The Problem with Distributed Transactions

Updated June 3, 2026
M
Magic Magnets Team
9 min read

Transactions are one of the most powerful ideas in computer science. A database transaction gives you a superpower: wrap a bunch of operations in a BEGIN and COMMIT, and the database guarantees they either all happen or none of them do. That's ACID (atomicity, consistency, isolation, durability), and on a single database, it works beautifully.

Then you build a distributed system, and the superpower disappears.

The Classic Problem: Bank Transfer Across Systems

Let's say you're transferring $500 from your Chase account to a friend's Bank of America account. Simple enough, right?

From the software's perspective, this is two operations:

  1. Debit $500 from Chase's database
  2. Credit $500 to Bank of America's database

On a single database, you'd just wrap both in one transaction. But these are two completely separate systems, each with their own databases, their own servers, their own failure modes. There's no shared transaction coordinator that both systems trust.

So what happens?

Quiz Time

A bank transfer debits Account A successfully, then the network drops before crediting Account B. What is this failure mode called?

What Goes Wrong

Partial Failures

Imagine Chase successfully debits your account, then the network drops before the credit reaches Bank of America. You've lost $500. It's not in your account and it never arrived. This is a partial failure — the worst outcome in distributed systems. Some work succeeded, some didn't, and the world is now inconsistent.

Network Timeouts

Chase sends the credit request to Bank of America and waits. And waits. After 30 seconds, Chase gets a timeout error. But here's the cruel part: a timeout doesn't tell you what happened. Did Bank of America receive the request and credit the account but the response was lost? Did Bank of America never receive the request at all? Chase has no way to know. If it retries, it might credit the account twice. If it doesn't retry, the money is lost.

Process Crashes

Chase debits the account. Then the Chase application server crashes before it can send the credit request to Bank of America. The debit is committed to the database, but the credit never happens.

These aren't theoretical edge cases. Networks drop packets. Servers crash. Disks fill up. In a system handling millions of transactions, partial failures happen constantly. The question is: how do you design around them?

Quiz Time

A service sends a credit request to an external system and receives a timeout. What does the timeout tell the caller about whether the operation completed?

Why You Can't Just Use a Global Lock

Your first instinct might be: "Just lock both accounts before doing anything." This is essentially what Two-Phase Commit (2PC) tries to do. A coordinator asks all participants, "Can you do this?" It only proceeds if everyone says yes. If anyone says no (or doesn't respond), everyone rolls back.

2PC works on paper. In practice, it has brutal failure modes:

  • The coordinator can crash — after participants vote "yes" and lock their resources, but before the coordinator sends the commit. Participants are stuck holding locks forever, waiting for a coordinator that's never coming back.
  • It's blocking — if the coordinator is slow or partitioned, every participant blocks. At scale, this is a latency nightmare.
  • It doesn't scale — every participant has to hold a lock for the entire duration of the transaction, across a network. The round trips kill your throughput.

2PC is used in practice (XA transactions in Java EE, distributed databases), but mainly in systems where you control all participants and can afford the overhead. Across third-party systems like Chase and Bank of America? Forget it.

Quiz Time

Two-Phase Commit (2PC) guarantees atomicity across distributed systems without any blocking or locking concerns.

The Modern Approaches

If you can't have true atomic transactions across systems, you have to embrace a different mindset: design for failure recovery instead of failure prevention.

Saga Pattern

The Saga pattern breaks a distributed transaction into a sequence of local transactions, each published as an event. If step N fails, you execute compensating transactions to undo steps 1 through N-1.

For the bank transfer:

  1. Chase debits the account → emits "Debit successful" event
  2. Bank of America credits the account → emits "Credit successful" event
  3. If the credit fails → Bank of America emits "Credit failed" event
  4. Chase receives "Credit failed" → executes compensation (re-credits the deducted amount)

The key insight: you're not preventing partial failures, you're recovering from them. The system eventually reaches a consistent state, even if it passes through inconsistent states along the way.

Quiz Time

In the Saga pattern, what mechanism is used to undo work when a step fails partway through a distributed transaction?

Outbox Pattern

The Outbox pattern solves a subtler problem: how do you ensure you don't lose an event after a local transaction commits? The classic bug: you commit to your database, then the process crashes before publishing the event to Kafka. The event is lost.

The fix: write the event to an outbox table in the same local transaction as your business data. A separate process reads the outbox and publishes to the message broker. Since writing to the outbox is local, it's covered by your local ACID guarantees. You can't lose the event.

The Outbox pattern is what makes Sagas reliable. Without it, you get "at most once" delivery. With it, you get "at least once" — and idempotency handles duplicates.

Quiz Time

The Outbox pattern writes events to a dedicated table in the same local transaction as the business data to guarantee event delivery.

Eventual Consistency

Both Sagas and the Outbox pattern embrace eventual consistency: the system isn't guaranteed to be consistent at every instant, but it will become consistent eventually, given no further failures. For most business operations — order processing, email sending, inventory updates — eventual consistency is completely acceptable.

The cases where it's not acceptable (bank balances, seat reservations, inventory at the exact moment of purchase) require more careful design: idempotency keys, reservation patterns, and careful use of locks at the individual service level.

Choosing Your Approach

SituationApproach
Simple, low-volume workflows across your own services2PC (if your DB supports it)
Long-running business processes with clear compensation logicSaga pattern
Preventing event loss after local commitsOutbox pattern
Third-party systems with no shared coordinatorSaga + Outbox + idempotent APIs
Quiz Time

Which combination is most appropriate for integrating with third-party systems that have no shared transaction coordinator?

Summary

Distributed transactions are hard because networks lie, processes crash, and partial failures are inevitable. True ACID guarantees across multiple systems are either impossible or too expensive to be practical. The modern answer is to stop trying to prevent failure and instead design systems that can detect and recover from it. Sagas give you compensating transactions. The Outbox pattern gives you reliable event delivery. Eventual consistency gives you a model for reasoning about correctness over time — not at every instant. Embrace the uncertainty; engineer around it.

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices