Case Study: SoundCloud's Multi-Year Journey Breaking the Monolith

SoundCloud is one of the most popular websites on the internet, and for good measure — it hosts everything from electro hip hop to podcasts, from book readings to radio shows. Underneath all of that media, there is a massive system of interconnected data manipulation and service endpoints.

It might be surprising to the reader, then, that until relatively recently, this functionality was handled as a giant monolith. SoundCloud has officially completed its 8-year journey to breaking the monolith and adopting a more reasonable microservice architecture.

Today, we’re going to review this journey. We’ll look at why it had to happen, how they implemented the process, and whether or not it was a success. We’ll distill some finer points for developers looking to implement a similar approach and identify the key strategies to SoundCloud’s success.

Facing the Monolithic Challenge

SoundCloud is an interesting platform. Though it’s at the forefront of the conversation about modern web architecture, SoundCloud is a relatively older site, established in 2008. While 14 years ago might not seem like a long time to some people, on the internet, that’s quite a ways back. During this time, SoundCloud has seen multiple phases in its lifecycle, with each stage bringing a different a user base, a different demand on resources, and a different business logic underpinning it all.

When SoundCloud was first created, it somewhat famously existed as a single Ruby on Rails app that served both the website and the public API as a monolith. Over time, however, this monolith became rather weighty. The heaviness of the core platform became more apparent as the team adopted microservice development, abstracting away functionality into independent microservices.

Very quickly, SoundCloud discovered that leveraging a monolith for core functions and pushing to external microservices is often more complicated than it’s worth. To quote SoundCloud via its own blog:

“In the new reality of a microservices architecture, where some new features now existed outside of the Rails application, and some services supplemented the existing features of the Rails application, we needed to decide how to maintain the public API going forward. It was crucial to ensure that important features — like serving content — continued to work for existing integrations, even though their implementations had changed and now spanned multiple services.

We tried various approaches and learned that our Rails application didn’t perform well when interacting with multiple microservices to serve user traffic.”

Finding a Solution: The Strangler Pattern

Faced with the issue of moving from a monolith to a microservice paradigm, SoundCloud had to make a choice: full migration at the cost of stability, or stability at the cost of a rapid changeover. SoundCloud chose a middle route known as the Strangler Pattern.

The Stranger Pattern is an interesting approach to rewriting a system that is in constant use. Martin Fowler described the system after a visit to Australia, in which he encountered the strangler figs. These figs start in the top branches of trees, and over time, work their way down the tree to the soil below, where they take root. These strangler figs eventually kill the tree in which they first took refuge in, replacing the supportive structure with its own biomass.

In the case of the monolith to microservice conversion, the idea of the strangler fig is fitting. Engineers adopt small components one at a time, slowly overtaking the existing system and eventually replacing it entirely. To build such a system, however, one must fully commit to the process. SoundCloud experienced firsthand that loose adherence often creates more problems than it solves.

Again, quoting the SoundCloud blog on this process:

“Although this end goal informed the original decision to adopt the Strangler pattern, our choice to use it was more motivated by an immediate need rather than planning for a future free of the public API monolith.

As a result, the Strangler was left, along with the monolith, largely unmaintained while feature development on our internal APIs continued at pace. To facilitate the continued development of our internal APIs, it became necessary to duplicate code paths for accessing core entities, e.g. tracks, playlists, users, etc.”

What SoundCloud was left with was a monolith that still dictated much of the backend, a collection of interconnected microservices that duplicated much of this internal codebase, and a hybrid solution that delivered neither strength of the two concepts. As SoundCloud developed and grew larger, so too did the core problem become large. By January 2020, SoundCloud identified that they needed to fix their approach’s underlying issues.

An Urgent Fix and a Detailed Battle Plan

Faced with the urgent need to move away from the monolith and execute upon the promise of their microservices, SoundCloud began the arduous task of plotting their approach. In theory, slowly replacing existing monolithic pieces over time is a simple concept. Yet, the pathway was not so simple. Thus, a battle plan needed to be devised.

For SoundCloud, a critical first step was to figure out the scope of the work to be done. Upon reviewing the collection of APIs and apps currently working on the backend, SoundCloud found that many of the official applications were making direct calls to the public API. In these cases, they were able to simply migrate the applications to use the official Backend for Frontend (BFF). This BFF was designed to bridge the edge between public and internal, and moving functions to the official BFF reduced the work that had to be done drastically.

Additionally, SoundCloud moved a lot of common functionality into what they call their “Value-Added Services“. The VAS functionality allowed SoundCloud to further reduce the scope of the work to be done by abstracting what needed to occur in the core API into adjacent microservices and systems.

SoundCloud engineers then reviewed their codebase, creating a list of challenges they identified as potential blockers. The biggest blocker to moving away from the Ruby monolith was re-learning what was possible and implementable. SoundCloud notes in their blog that “some things that come for free (or “magic”) in Ruby were things we needed to implement ourselves — for example, multipart request parameter parsing.” This is a significant reason so many monolithic providers stick to what they know – it can often be more time-consuming to figure out how to “just do what the API used to do” with new technology. In this case, SoundCloud tried to build supportive structures where necessary, leaning on existing tech and dependencies to deliver ongoing functionality.

Flipping the Switch

The way that SoundCloud actually switched over to the new systems is quite ingenious. Once the new code was developed in parallel to the existing monolith, there were two pathways a request could travel through — the ported implementation and the original code. To test these new implementations, SoundCloud utilized a proxy for the public API, pushing requests to the original code in tandem with a push to the new code.

As the response from each was generated, they were compared. If the responses were inconsistent, a telemetry event was triggered, and a developer was flagged for review and update. This live testing allowed for continued service to the end user while testing with actual formed requests. In this way, SoundCloud could more effectively test its new code without having to engage in theoretical bug testing or mocking.

Once the new code “proved” itself by being consistent against the old code, the developers could remove the proxy and the ported response could be defaulted to for requests. This cut a huge amount of testing out of the system. It should be mentioned that mutation methods such as PUT or POST required their own regression testing, as allowing parallel requests in this way would result in duplicate entry generation issues.

Major Lessons From the SoundCloud Migration

A handful of major lessons can be found in this process. While some of these are very specific to SoundCloud’s particular situation, they are still good to keep in mind.

Firstly, SoundCloud’s efforts showed that telemetry-based decision-making is superior to guesswork. SoundCloud’s use of telemetry to detect undocumented endpoints to find code whose behavior did not match expectations led to a drastic reduction in the required work. Developers should consider implementing such systems for their own porting processes.

Secondly, we should note that, at least in the early stages, the fix can cause more issues than the original problem. SoundCloud notes that the Strangler Pattern introduced significant complexity, especially in light of the period at the beginning of their efforts where minimal work was accomplished. This can be resolved by having a detailed plan ahead of time. Migration is always a complex issue, so having a clear step-by-step plan for implementation and final deliverables can help mitigate much of this complexity.

SoundCloud also makes the point to note that changes, no matter how incremental, can increase bugs and issues in the system, and that migration is not a one-stop fix for all concerns. As with any development process, developers should review the steps they’ve taken to ensure that no additional issues were introduced, no bugs replicated, and no duplicate pathways maintained. Even small changes that seem minor in the abstract can have huge ramifications over time.

Judging Success

One potentially tricky question to answer is — was the process a success?

The answer is entirely defined by what you consider success. Yes, SoundCloud migrated its services away from the monolith and into its new approach. In that way, yes, this was a huge success. It should be noted, though, that SoundCloud states that organizations should consider whether or not the impact of the work is actually worth it in terms of business impact.

That brings up an interesting point that is often ignored when discussing these larger projects. Simply put, all development processes have their costs. While the cost of not moving away from the monolith seemed, in this case, to be much higher than the cost of moving away, it does bear consideration for other developers and systems. While this process could be considered a success for SoundCloud, not every company will have the money, time, and resources to carry out this endeavor so smoothly.

It should also be noted here that much of this cost is variable — the specific structure applied, the efficiency, the dedication to the process, and even the state of the current monolith can drastically change the levels of success or failure intrinsically. Therefore, this case study should be considered an example of SoundCloud’s process, not a predeterminate marker for other efforts.