Availability

Updated June 6, 2026

Magic Magnets Team

7 min read

Availability is simple to define and surprisingly hard to achieve: availability is the percentage of time a system is operational and able to serve requests.

If your system is down for 1 hour out of 100 hours, it's 99% available. Easy math. The hard part is that the difference between 99% and 99.99% is not "a little better"; it's a completely different engineering challenge.

The Nines

In the industry, availability is usually described in "nines," representing how many 9s follow the decimal point:

Availability	Downtime per year	Downtime per month	Downtime per week
99% (two nines)	87.6 hours	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds

Let that sink in. Five nines means your entire annual downtime budget is about 5 minutes. That's not a typo.

Most consumer apps target three or four nines. Payment systems, healthcare, and emergency services push for five. True five-nine availability is extremely expensive and requires deep architectural investment, including redundancy everywhere, global distribution, rigorous change management, and runbooks for every conceivable failure.

Here's the thing about SLAs: a vendor promising 99.9% uptime still gets to be down for 43 minutes per month. Read the fine print.

Quiz Time

Five-nines (99.999%) availability allows for roughly how much total downtime per year?

Why High Availability Is Hard

The dirty secret of availability is that every component in your system, including servers, databases, networks, load balancers, and DNS providers, can and will fail. Independently. Simultaneously. At the worst possible time.

And downtime compounds. If you have three services each with 99.9% availability, chained together (A calls B calls C), your end-to-end availability is 99.9% × 99.9% × 99.9% = 99.7%. Add more dependencies and the math gets grim fast.

This is why high availability isn't about making individual components bulletproof; instead, it's about designing systems that survive component failures without going down.

Quiz Time

A system has three services chained together (A calls B calls C), each with 99.9% availability. What is the approximate end-to-end availability?

algobase.dev

No redundancy — one server and one database. Any failure here takes the entire system down. This is a Single Point of Failure at every layer.

1 / 1

Single Point of Failure (SPOF): a single server and database setup with no redundancy

High Availability Patterns

1. Redundancy

The most fundamental HA pattern: run multiple copies of everything. If one instance dies, another takes over. No single instance should be indispensable.

This applies at every layer:

Multiple application servers behind a load balancer
Database replicas: a primary and one or more standbys
Multiple availability zones to ensure a datacenter outage doesn't take you down
Multiple DNS providers, as DNS failure is more common than people think

2. Failover

Redundancy is useless unless your system can detect failures and route around them automatically. That's failover.

Active-passive failover: one node handles traffic (active), another sits on standby (passive). If the active node fails, traffic is routed to the passive node. The downside: the passive node is idle capacity, and failover takes some time, ranging from seconds to minutes depending on detection speed.

algobase.dev

Active-passive setup — the load balancer sends traffic to the active server. The standby takes over automatically if the active node fails. The database replica promotes to primary on failure (~30s failover window).

1 / 1

Active-Passive Failover: standby components ready to take over, with async database replication

Active-active failover: all nodes handle traffic simultaneously. If one fails, the remaining nodes absorb its share. More efficient but harder to implement, especially for databases where write conflicts become a concern.

Quiz Time

In active-passive failover, the passive node handles live traffic alongside the active node.

algobase.dev

Active-active setup — all servers are active and handle traffic simultaneously. If one server fails, the load balancer routes all traffic to the remaining active server(s). Writes continue to flow to the primary database, replicated asynchronously.

1 / 1

Active-Active Failover: all servers actively serving traffic simultaneously, sharing the load

3. Health Checks

How does your load balancer know a server is down? Health checks — periodic pings to an endpoint (usually /health or /ping) that verify the service is alive and responding. If a node fails health checks, it's taken out of rotation until it recovers.

Good health checks go beyond "is the process running?" by verifying that the service can actually do its job: connect to the database, reach its dependencies, and return a valid response. A server that's up but can't reach the database is not healthy.

Quiz Time

Which of the following best describes what a "good" health check should verify?

4. Geographic Distribution

A datacenter fire, a regional network outage, or a natural disaster can take out everything in a single geographic area. Truly high-availability systems distribute across multiple regions.

AWS calls these "regions" (us-east-1, eu-west-1, etc.). Inside each region are multiple "availability zones," which are isolated datacenters with independent power and networking. The standard HA recommendation is to spread across at least two AZs, and for critical systems, across regions.

algobase.dev

Multi-AZ active-active — two availability zones each run a full set of servers and databases inside the AWS Region. Traffic splits between AZs; if one zone fails entirely, the other absorbs all traffic. Cross-AZ replication keeps both databases in sync.

1 / 1

Multi-AZ Active-Active: fully redundant deployment across Availability Zones inside a Cloud Region

The AWS Outage Story

In December 2021, AWS us-east-1 had a major outage. This took down not just AWS customers but services those customers depended on, including Disney+, Slack, Duolingo, Ring cameras, Roomba, and even Amazon's own package tracking. All down, because they were running in a single AWS region.

The lesson wasn't "don't use AWS." The lesson was: don't assume your cloud provider is infinitely reliable. AWS itself targets 99.99% per service, per region. That's still over 50 minutes of downtime per year, per service.

Systems that need true high availability run across multiple cloud providers or multiple regions, with automated failover between them. While it's expensive and complex, for the right use cases, it's the only way.

Quiz Time

The primary lesson from the 2021 AWS us-east-1 outage is that cloud providers are not reliable enough to use for production systems.

Availability vs. Uptime

One subtle but important distinction: availability is about the user experience, not just the process being alive.

A server can be running while:

Returning 500 errors due to a dependency failure
Responding so slowly that requests time out
Serving stale data because replication is broken

None of these are "up" from a user's perspective. Good availability monitoring measures the end-to-end user experience, not just whether a process is alive.

Quiz Time

A server returning 500 errors due to a dependency failure is considered "available" as long as the process is running.

Summary

Availability is the percentage of time your system is operational, expressed in "nines." The jump from 99% to 99.999% is not incremental; it requires fundamentally different architecture. The core patterns for high availability are redundancy (multiple copies), failover (automatic rerouting when things break), health checks (detecting failures fast), and geographic distribution (surviving regional outages). AWS outages are a reminder that even the best infrastructure fails, meaning the only way to survive is to design assuming failure will happen.

Reliability

How helpful was this content?

Comments

0/2000

Saved on this device only