Handling Failures in Distributed Systems
Updated June 3, 2026Imagine you're trying to order food at a chaotic, loud, and ridiculously busy restaurant. You shout your order to the waiter. Did he hear you? Did he ignore you? Is the chef cooking it? Did the waiter get hit by a swinging door on the way to the kitchen?
You wait 10 minutes. No food. Do you order again? If you do, will you be charged twice and get two meals?
This uncertainty is exactly what software experiences every millisecond in a distributed system.
The Core Concept
In a single-computer system, things are reliable. If you call a function, it executes. In a distributed system, components communicate over the network, and the network is fundamentally unreliable.
The first rule of distributed systems is: Everything fails, all the time. Servers crash, hard drives die, network packets are dropped, and latency spikes. Your job is not to prevent failures, but to design systems that handle them gracefully.
Types of Failures
- Crash Failures: A server completely dies. (Easiest to handle. It just stops responding.)
- Omission Failures: A server drops a request or fails to send a response.
- Byzantine Failures: A server acts maliciously or sends completely incorrect data (e.g., a hacked node or a corrupted memory chip). This is the hardest to solve and is a major focus in blockchain systems.
Essential Survival Strategies
How do we build reliable systems out of unreliable parts? We use these fundamental patterns:
1. Timeouts and Retries
If Service A calls Service B, it shouldn't wait forever. It must have a strict timeout. If it doesn't get a response in 500ms, it assumes failure and tries again.
2. Exponential Backoff and Jitter
If Service B goes down, and 1,000 clients immediately retry every 1 second, they will create a "retry storm" that acts like a DDoS attack, keeping Service B down permanently. Instead, clients should use Exponential Backoff: wait 1s, then 2s, then 4s, then 8s. Adding Jitter (a random amount of time, like 1.2s, 2.5s) prevents all clients from retrying at the exact same millisecond.
Why do you add random "jitter" to exponential backoff retries?
3. Idempotency (The "Order Twice" Problem)
If you retry a request, you must ensure it doesn't execute twice. If you retry a "Charge $50" request, the customer shouldn't be billed $100.
An API is Idempotent if making the same request multiple times produces the same result as making it once. (Usually done by passing a unique RequestId with the transaction).
4. Circuit Breakers
If Service B is completely down, it's foolish to keep sending it requests and waiting for timeouts. [quiz:1]
A Circuit Breaker detects repeated failures and "trips." Once tripped, Service A immediately stops calling Service B and returns an error (or a fallback cached value) without waiting. After a few minutes, it sends a single test request to see if Service B has recovered.
[!TIP] Netflix popularized the Circuit Breaker pattern with their open-source library Hystrix, ensuring a failure in their recommendation engine doesn't break the entire video streaming platform.
Summary
- Failures are the norm, not the exception, in distributed systems.
- Always use strict Timeouts and Retries with Exponential Backoff and Jitter.
- Make critical operations (like payments) Idempotent.
- Use Circuit Breakers to prevent cascading failures across microservices.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices