Reliability
Updated June 6, 2026Reliability and availability are often used interchangeably. They shouldn't be. Understanding the difference matters for both system design and for talking clearly about incidents.
Here's the distinction:
- Availability is about uptime: is the system responding right now?
- Reliability is about correctness over time: does the system do what it's supposed to do, consistently, without failing?
A system can be highly available but unreliable. Imagine a payment service that's always up (great availability!) but occasionally processes transactions twice (terrible reliability). Or a search engine that's always reachable but returns wrong results 5% of the time.
Reliability is the property that the system behaves correctly and consistently, not just that it's responding.
A payment service is always reachable but occasionally charges customers twice. Which property does it lack?
Measuring Reliability: MTBF and MTTR
Two metrics dominate reliability conversations in production engineering.
MTBF — Mean Time Between Failures
MTBF is the average time between system failures. High MTBF means your system fails infrequently. If your service crashes once every 30 days on average, your MTBF is 30 days.
MTBF is mostly about prevention by building systems that don't fail often. This involves good software engineering practices, thorough testing, staged rollouts, and resilient architecture.
MTTR measures the average time between system failures.
MTTR — Mean Time To Recovery
MTTR is the average time it takes to recover from a failure. This is about detection + response + fix. If a failure happens at 3am, how long until the system is healthy again: 5 minutes or 5 hours?
The relationship between MTBF and MTTR determines your effective availability:
Availability ≈ MTBF / (MTBF + MTTR)Here's the insight: improving MTTR often has more impact than improving MTBF. Failures will happen, which is a fact of distributed systems. The organizations that maintain high reliability are often the ones with the best incident response processes, not necessarily the ones with the fewest failures.
According to the availability formula, what is the most impactful way to improve availability when failures are inevitable?
Fault Tolerance Patterns
Reliable systems are built around the assumption that individual components will fail, and then designing to absorb those failures gracefully.
Graceful Degradation
When part of a system fails, the rest should keep working — at reduced capacity if necessary. Netflix calls this "let it fail gracefully": if the recommendation engine is down, show popular titles instead of crashing the entire homepage.
The pattern: define what each dependency failure means for the user experience, and implement a fallback for each one.
Circuit Breakers
Imagine Service A calling Service B. Service B starts responding slowly, perhaps because it's overloaded or recovering from a crash. Without protection, Service A will pile up slow requests, exhaust thread pools, and cascade the failure upstream to every caller.
Fragile System: Service A blocks on sync calls when Service B is slow or down
A circuit breaker monitors failure rates. When failures exceed a threshold, the circuit "opens," meaning Service A stops calling Service B and immediately returns a fallback response. After a timeout, it "half-opens" to test if Service B has recovered.
This pattern, popularized by Netflix's Hystrix library, prevents one struggling service from taking down everything that depends on it.
What does a circuit breaker do when the failure rate of a downstream service exceeds its threshold?
Circuit Breaker: circuit "opens" on failure threshold, returning fallback responses immediately
Retries with Backoff
When a request fails, retry it, but not immediately. Immediate retries into a failing system make things worse. The standard pattern is exponential backoff with jitter: wait a bit, retry; wait longer, retry; wait even longer, retry; and add randomness so all clients don't retry simultaneously and create a thundering herd.
Timeouts Everywhere
Every external call, including database queries, HTTP requests, and message queue reads, needs a timeout. Without timeouts, a single slow dependency can hold threads open indefinitely, eventually starving the entire service of resources. Set timeouts aggressively. A request that takes 10 seconds to fail is nearly as bad as one that never fails.
Full Resilience: circuit breaker, retry queues, fallback cache, and health monitoring working together inside the network boundary
Chaos Engineering: Netflix's Chaos Monkey
Here's a counterintuitive idea: one of the best ways to build reliable systems is to deliberately break them in production.
Netflix pioneered this approach with a tool called Chaos Monkey, part of what they called the Simian Army. Chaos Monkey randomly terminates EC2 instances in Netflix's production environment during business hours. The philosophy: if you're going to experience random failures anyway, you'd rather discover them when your engineers are awake and ready to respond.
By forcing services to survive random instance terminations, Netflix's teams built resilient architectures out of necessity. Services that couldn't handle an instance dying were fixed quickly. The organization developed muscle memory for responding to failures.
Chaos engineering has since expanded into a discipline of its own:
- Chaos Monkey: kills random instances
- Chaos Gorilla: takes down an entire AWS availability zone
- Latency Monkey: introduces artificial network delays
- Chaos Kong: simulates an entire AWS region failure
The key principle: test your failure modes before your users discover them. A failure you've practiced recovering from is far less scary than one you've never seen before.
Netflix's Chaos Monkey deliberately terminates production instances during business hours.
The Blame-Free Post-Mortem
When things go wrong, as they inevitably will, how your organization responds matters enormously for long-term reliability. The best engineering cultures run blameless post-mortems: structured analyses of what went wrong, why it went wrong, and what systemic changes will prevent recurrence.
Post-mortems that assign blame to individuals don't improve systems; instead, they make engineers afraid to take risks or report issues. Post-mortems that focus on systems, processes, and tooling create compounding improvements over time.
Google's Site Reliability Engineering (SRE) book describes this as "postmortem culture," treating every significant incident as a learning opportunity for the organization rather than a search for someone to blame.
What is the primary goal of a blameless post-mortem?
Summary
Reliability is about consistent correct behavior over time, not just uptime. MTBF (time between failures) and MTTR (time to recover) are the key metrics, and improving MTTR often matters more than reducing failure frequency. Fault tolerance patterns like graceful degradation, circuit breakers, retries, and timeouts help systems survive component failures. Netflix's Chaos Monkey represents the ultimate expression of reliability engineering: deliberately break things in production before users do. And when failures happen, blameless post-mortems are how organizations get better over time.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices