Availability
Updated June 6, 2026Availability is simple to define and surprisingly hard to achieve: availability is the percentage of time a system is operational and able to serve requests.
If your system is down for 1 hour out of 100 hours, it's 99% available. Easy math. The hard part is that the difference between 99% and 99.99% is not "a little better"; it's a completely different engineering challenge.
The Nines
In the industry, availability is usually described in "nines," representing how many 9s follow the decimal point:
| Availability | Downtime per year | Downtime per month | Downtime per week |
|---|---|---|---|
| 99% (two nines) | 87.6 hours | 7.2 hours | 1.68 hours |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 10.1 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Let that sink in. Five nines means your entire annual downtime budget is about 5 minutes. That's not a typo.
Most consumer apps target three or four nines. Payment systems, healthcare, and emergency services push for five. True five-nine availability is extremely expensive and requires deep architectural investment, including redundancy everywhere, global distribution, rigorous change management, and runbooks for every conceivable failure.
Here's the thing about SLAs: a vendor promising 99.9% uptime still gets to be down for 43 minutes per month. Read the fine print.
Five-nines (99.999%) availability allows for roughly how much total downtime per year?
Why High Availability Is Hard
The dirty secret of availability is that every component in your system, including servers, databases, networks, load balancers, and DNS providers, can and will fail. Independently. Simultaneously. At the worst possible time.
And downtime compounds. If you have three services each with 99.9% availability, chained together (A calls B calls C), your end-to-end availability is 99.9% × 99.9% × 99.9% = 99.7%. Add more dependencies and the math gets grim fast.
This is why high availability isn't about making individual components bulletproof; instead, it's about designing systems that survive component failures without going down.
A system has three services chained together (A calls B calls C), each with 99.9% availability. What is the approximate end-to-end availability?
Single Point of Failure (SPOF): a single server and database setup with no redundancy
High Availability Patterns
1. Redundancy
The most fundamental HA pattern: run multiple copies of everything. If one instance dies, another takes over. No single instance should be indispensable.
This applies at every layer:
- Multiple application servers behind a load balancer
- Database replicas: a primary and one or more standbys
- Multiple availability zones to ensure a datacenter outage doesn't take you down
- Multiple DNS providers, as DNS failure is more common than people think
2. Failover
Redundancy is useless unless your system can detect failures and route around them automatically. That's failover.
Active-passive failover: one node handles traffic (active), another sits on standby (passive). If the active node fails, traffic is routed to the passive node. The downside: the passive node is idle capacity, and failover takes some time, ranging from seconds to minutes depending on detection speed.
Active-Passive Failover: standby components ready to take over, with async database replication
Active-active failover: all nodes handle traffic simultaneously. If one fails, the remaining nodes absorb its share. More efficient but harder to implement, especially for databases where write conflicts become a concern.
In active-passive failover, the passive node handles live traffic alongside the active node.
Active-Active Failover: all servers actively serving traffic simultaneously, sharing the load
3. Health Checks
How does your load balancer know a server is down? Health checks — periodic pings to an endpoint (usually /health or /ping) that verify the service is alive and responding. If a node fails health checks, it's taken out of rotation until it recovers.
Good health checks go beyond "is the process running?" by verifying that the service can actually do its job: connect to the database, reach its dependencies, and return a valid response. A server that's up but can't reach the database is not healthy.
Which of the following best describes what a "good" health check should verify?
4. Geographic Distribution
A datacenter fire, a regional network outage, or a natural disaster can take out everything in a single geographic area. Truly high-availability systems distribute across multiple regions.
AWS calls these "regions" (us-east-1, eu-west-1, etc.). Inside each region are multiple "availability zones," which are isolated datacenters with independent power and networking. The standard HA recommendation is to spread across at least two AZs, and for critical systems, across regions.
Multi-AZ Active-Active: fully redundant deployment across Availability Zones inside a Cloud Region
The AWS Outage Story
In December 2021, AWS us-east-1 had a major outage. This took down not just AWS customers but services those customers depended on, including Disney+, Slack, Duolingo, Ring cameras, Roomba, and even Amazon's own package tracking. All down, because they were running in a single AWS region.
The lesson wasn't "don't use AWS." The lesson was: don't assume your cloud provider is infinitely reliable. AWS itself targets 99.99% per service, per region. That's still over 50 minutes of downtime per year, per service.
Systems that need true high availability run across multiple cloud providers or multiple regions, with automated failover between them. While it's expensive and complex, for the right use cases, it's the only way.
The primary lesson from the 2021 AWS us-east-1 outage is that cloud providers are not reliable enough to use for production systems.
Availability vs. Uptime
One subtle but important distinction: availability is about the user experience, not just the process being alive.
A server can be running while:
- Returning 500 errors due to a dependency failure
- Responding so slowly that requests time out
- Serving stale data because replication is broken
None of these are "up" from a user's perspective. Good availability monitoring measures the end-to-end user experience, not just whether a process is alive.
A server returning 500 errors due to a dependency failure is considered "available" as long as the process is running.
Summary
Availability is the percentage of time your system is operational, expressed in "nines." The jump from 99% to 99.999% is not incremental; it requires fundamentally different architecture. The core patterns for high availability are redundancy (multiple copies), failover (automatic rerouting when things break), health checks (detecting failures fast), and geographic distribution (surviving regional outages). AWS outages are a reminder that even the best infrastructure fails, meaning the only way to survive is to design assuming failure will happen.
Saved on this device only
Sign in to sync progress across devices