Sagas in a Distributed World: Utopia or Dystopia

In my previous article, Legacy Modernization or Chasing Rainbows, I discussed the myriad complexities and challenges we encounter when migrating our legacy monolith systems to microservices. In the course of this journey, we often arrive at the Data-Decomposition phase, where we encounter what we had taken for granted in our monolith: ACID Transactions of RDBMS.

To elaborate, a Relational Database (RDBMS) must be Atomic, Consistent, Isolated, and Durable (ACID) to ensure the correctness of a transaction. The Atomicity ensures that either all operations in the transaction succeed or all fail. Consistency takes data from one valid state to another valid state. Isolation guarantees that concurrent transactions behave as if they are running sequentially and not step on each other. Lastly, Durability means that committed transactions are persisted irrespective of any type of system failure. In a distributed transaction scenario, as the transaction spans several services, it always remains a key concern to ensure ACID.

Transactions in a monolith are rather trivial to implement — given that everything we need resides in the same RDBMS, it’s not something we need to think much about. But as we start breaking up this monolith into multiple microservices, we charter into new territory where things are, unfortunately, not that simple anymore. What used to get executed nicely on a single monolith database gets distributed across multiple databases and multiple microservices. Here, we encounter the problem of distributed transactions, i.e., a transaction that is not confined to one database but is spread across various databases.

A solution often suggested for this problem is the long-running transactions or Sagas as they are popularly called.

What Is the Saga Pattern?

At its core, the Saga Pattern can be seen as a direct result of the database-per-service pattern. In the database-per-service pattern, each microservice is responsible for its own data. However, this leads to an interesting situation. What happens when a business transaction involves data that spans across multiple microservices? This is where the need for the Saga Pattern arises.

The Saga Pattern is a microservices architectural pattern to implement a transaction that spans multiple services. A Saga is basically a sequence of local transactions executing a set of operations across multiple microservices, applying consistent all-or-nothing semantics. Sagas split up one overarching business transaction into a series of multiple local database transactions executed by the participating services.

In Sagas, each of these transactions has a corresponding compensating transaction. While a Saga proceeds with its steps, if any of the transactions in a saga fails, the compensating actions for each transaction that was successfully run previously will be invoked so as to nullify the effect of the previously successful transactions.

Sagas are an exciting design concept, but this seemingly nice story does not weave so nicely in the real world. Read on to find out why.

Every “Do” Has an Equal and Opposite “Undo”

When moving to microservices, one of the first things you’ll realize is that individual services do not exist in isolation. While the goal is to create loosely coupled, independent services with as little interaction as possible, the chances are high that one service will need a particular service to operate. Or, multiple services will need to act in concert to achieve a consistent outcome of an operation in your business domain. What this effectively translates to is that for each Do-Command(s), we will need a corresponding Undo-Command(s) or rollback counterpart to reverse/undo the changes made by the previous corresponding Do-Command(s).

This interaction can give rise to some cringe-worthy scenarios.

Scenario 1: Services in your monolith that have migrated to microservices

This should be the easiest to deal with. For services within your own monolith that have already moved to micro-services, you can request the service owners to expose an Undo-Command. Hopefully, they would oblige, assuming everybody in the domain is moving in the same direction.

Scenario 2: Services in your own monolith that have NOT migrated to microservices

It may be the case that you are the first pioneer in your domain. Your monolith is intact, but you are testing the waters with a separate microservice. You did all the detailed analysis for a specific functionality and are confident of moving it to a separate microservice encapsulating its own database, except for one minor glitch. For a specific functionality, our microservice needs the monolith to do something on its behalf. No problem with that — the functionality already exists in the monolith. We just need to expose it as an endpoint. So far, so good.

Reality hits hard when we realize that you need to run this complete functionality in a transaction across these two disparate systems and their corresponding databases. For that, you’ll need a Saga (Distributed-transaction) that runs across them. Once you start going down that road, you’ll realize it’s insufficient to analyze the functionality we wanted to move out of the monolith. Now we have to analyze this additional functionality in our monolith that our microservice interacts with and implement “Undo” for it in the monolith, thereby adding more code to the monolith, a stark opposite of what we intended to do in the first place.

Scenario 3: Services in other domains in our organization

This scenario is a twist of the one above. Here the service we want to run a “Do” functionality on, as part of our transaction, lies with another domain within our organization. This kind of scenario is rampant in the case of legacy applications, particularly those running on mainframes. Now, this other domain has exposed the “Do” functionality as a command/endpoint for our consumption, but do they have a rollback-counterpart, an “Undo” functionality exposed for the same (or even implemented for that matter)? Here, Conway’s Law comes into the picture. It states that “organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations”. This means what our team creates is dependent on how our organization communicates internally, and effectively it translates to the fact that the success or failure of this solution will depend on Conway’s Law. In this scenario, the only thing we can do is request the other team to include the Undo functionality in their Product Log and wait for its prioritization and subsequent implementation.

Cascading “Do(s)” And “Undo(s)”

An astute reader would have observed in the above argument that any service that provides an “Undo” functionality needs to deal with all the cascading “Undo(s)” across all the other services that it interacts with in its “Do” counterpart. All those services need to provide the “Undo” functionality on their behalf and, in turn, handle the “Undo(s)” of all the cascading services that they interact with. As you can imagine, it’s a rabbit hole from there. Try doing this for a non-trivial application, and then all of a sudden, you realize how far those dominos go. These are the kind of directions that any organization needs to align itself on before embarking on this journey of legacy migration.

Turning the Problem on Its Head

Looking at it from one level above, the same rationale applies to every Do-functionality that our microservice exposes; for every exposed “Do” we will have to expose its rollback counterpart as well. And that amounts to practically twice, if not more, the amount of effort and time required to implement the complete functionality, including their rollback counterparts. This can completely derail the project unless considered during the planning stage.

Undo-Functionality/Compensating Request

It must be borne in mind that Undo-Functionality/Compensating Request semantically undoes the effect of a request. If we think deeply, there are a lot of layers to this sentence.

Depending on the business scenario, there are a couple of ways to achieve this. We might be able to undo the change by reverting the values in a particular table and column to their previous value, or we might need to invalidate a previous record and insert a new one, or we might need to insert a compensating record to undo the effect of the previous one (think debit and credit).

The fundamental problem with these approaches is how do we recall the previous value to revert to? Even if we save the previous values somewhere, how do we ensure that someone has not changed the value in the interim before we apply revert to our saved set of prior values? These are complex design decisions without good answers.

Another problem with this approach is determining the payload of this Compensating Request and all the subsequent cascading Compensating Requests as well. This issue becomes more evident in the case of Fire-and-Forget asynchronous operations where we do not have any kind of reference/id to run the Compensating Request for.

Retry & Idempotency

While we are on this topic, let’s also touch upon one other intricacy. While it may be acceptable for a Do-Request to be aborted, Compensating-Requests cannot be aborted. So, we need to keep retrying them till they succeed, which means they must be idempotent. This deserves some deep thinking here, as failure is not really an option for the Undo functionality. The corresponding compensating transactions will have to continue to retry until it succeeds, and until that happens, the system will remain in an inconsistent state. This effectively translates to the implementation invariant that Sagas requires idempotency and compensating operations from all the participating services. Realistically speaking, this is quite a tall order to impose.

Compensating Request Are Not Time-Travel

Admittedly, we cannot make something unhappen that has happened already. At best, we can only hope to semantically compensate for its effect to be negated in some cases. In others, we may just have to live with the fact that not everything can be semantically reversed. This can mean different things in different contexts.

For example, say an account was opened and immediately closed. This does not negate the fact that the account was opened and active for a brief period of time. Closing the account is not the same thing as Undoing the creation of the account. It will likely appear in the financial and regulatory reports. Depending on how long the account was active, there may have been transactions executed on it or even an account statement generated for the account.

Another scenario could be where we debit money from Account-A and credit money to Account-B, but credit to Account-B fails, and when we go to undo the debit on Account-A, the account is closed. Depending on the financial, legal, and regulatory conditions that the business operates under, there can be severe implications of all of these.

Similarly, some tasks cannot be semantically reversed in any meaningful way. One example is sending an email. Of course, we can send a “disregard the last email” message, but that does not negate the fact that the customer might have acted on that email before getting a disregard mail from the application. Additionally, we have to be sure that we do not flood the customer’s inbox with hundreds of these messages if the compensating action needs to be retried.

Another side-effect of this inability to go back in time is the impact on the business processing or calculations done on real-time data. In this new world of Sagas, it may not be enough to pull some real-time data points from another service and use them in our calculations at that moment (think forex rates or stock prices here). We also need to save these values along with the calculations performed to be able to successfully revert later. And the revert becomes even more complex if we consider other calculations being executed on the same entities in between our Do(s) and Undo(s).

Dirty-Reads: Saga Updates Are Visible Before the Saga Completes

A long-running transaction, as the name itself implies, may persist for an extended period of time. As there is no Atomicity and no Isolation, its partial updates are visible even before the Saga completes. As local transactions are committed while the Saga is running, their changes are already visible to other concurrent transactions, despite the possibility that the Saga will fail eventually, causing all previously applied transactions to be compensated. From the perspective of the overall Saga, the isolation level is comparable to “read uncommitted.” This can lead to various ramifications.

It is possible that between persisting a change and doing its rollback, the initial change might have been acted upon by some other systems, such as an application, microservice, or batch process. For example, we purchase stock in our “Do” function, but when we return to run our “Undo”, the stock has already been sold to a third party.

Unless handled carefully, this might lead to an avalanche of end-to-end cascading incorrect/orphaned data. We then run the dire risk of getting a corrupted database or making business decisions based on data that was later ‘rolled back’.

More so, what if we update some core data, then some other domain performs a new update on the same data before we decide to roll back our update. What should then be the end state of the data?

Conclusion

The Saga pattern offers a powerful and flexible solution for implementing long-running business transactions which require multiple, separate services to agree on either applying or aborting a set of data changes. Of course, we should aspire for a service cut that reduces the need for interaction with remote services as much as possible. But depending on business requirements, the need for such interaction spanning multiple services might be inevitable, particularly when it comes to integrating legacy systems or systems that are not under our control.

The main benefit of the Saga Pattern is that it helps maintain data consistency across multiple services without tight coupling. This is an extremely important aspect of microservices architecture.

However, the main disadvantage of the Saga Pattern is the inherent complexity. Sagas is not a technical problem; the industry has collectively sorted out individual technical pieces of this puzzle, and there are enough tools to satisfy your requirements. The complexity lies in having all of these pieces work in tandem to achieve the expected results correctly, consistently, reliably in all scenarios and at all times, including the edge cases. When implementing complex patterns like Sagas, it’s vital to understand their constraints and semantics, for example, how they behave in failure scenarios. You must ensure that (eventual) consistency is also achieved under such unforeseen circumstances.

In my opinion, Sagas can help solve certain challenges and scenarios. They should be adopted or explored if the need arises. However, implementing Sagas is incredibly complex and error-prone, especially in a non-trivial domain like banking or finance. So, unless you have Netflix or Twitter-grade developers working for you within a domain that is conducive to the philosophy of Sagas, you probably might want to reconsider the approach. Like many things in software architecture, there is no silver bullet to this problem. It is all about knowing your domain, your use-cases, and being aware of solutions. This will dictate whether you succeed in building a microservices architecture or distributed monolith.