Consensus Algorithms

Updated June 3, 2026

Magic Magnets Team

10 min read

There's a fundamental problem at the heart of distributed systems: how do multiple nodes agree on a single value when any of them might fail or messages might get lost?

This is the consensus problem. And it's harder than it sounds.

What Does "Agreement" Mean?

Imagine you have three database nodes. A client writes a value. The nodes need to agree:

Which value got committed?
In what order did writes happen?
Who's in charge if the primary fails?

These all reduce to the same underlying problem: getting multiple independent nodes to agree on something in the presence of failures.

A correct consensus algorithm must satisfy three properties:

Agreement: All non-faulty nodes decide on the same value
Validity: The decided value must have been proposed by some node (you can't just make something up)
Termination: All non-faulty nodes eventually decide

Sounds simple. The catch: FLP impossibility (Fischer, Lynch, Paterson, 1985) proves that no deterministic consensus algorithm can guarantee termination in a purely asynchronous system where even one node might crash. It's mathematically impossible.

Quiz Time

What does the FLP impossibility result prove?

So how does anything work? By relaxing assumptions. Real systems assume partial synchrony — networks are mostly reliable with bounded delays — and they use randomization or timeouts to ensure termination in practice.

Paxos: The Original

Paxos was described by Leslie Lamport in a 1989 paper he famously wrote in the style of a Greek parliament. The paper sat unpublished for almost a decade because reviewers found it too quirky. When it was finally published in 1998, it became one of the most cited papers in distributed systems.

Paxos solves single-value consensus in three roles: Proposers (who want to commit a value), Acceptors (who vote), and Learners (who learn the decided value).

The algorithm runs in two phases:

Phase 1: Prepare A proposer picks a proposal number n and sends a Prepare(n) message to a majority of acceptors. Each acceptor responds with the highest proposal number it's already accepted (if any) and promises not to accept anything with a lower number.

Phase 2: Accept If the proposer gets responses from a majority, it sends an Accept(n, v) message. The value v is either its own proposed value or the highest-numbered value returned in Phase 1 (to avoid overwriting already-committed values). Acceptors accept if they haven't promised to ignore it.

Once a majority accepts, the value is decided.

The honest assessment of Paxos: it's correct, but it's notoriously difficult to understand and even harder to implement correctly. Most people who think they understand Paxos have understood a simplified version. Multi-Paxos (for replicating a log of values, not just one) introduces additional complexity. Chubby (Google's distributed lock service) is built on Paxos. Their engineers have said it took a significant effort to make the implementation production-ready.

Raft: Built to Be Understood

Raft was designed explicitly as an alternative to Paxos that's easier to understand. The authors' paper (Ongaro and Ousterhout, 2014) is titled "In Search of an Understandable Consensus Algorithm." They surveyed students after teaching both algorithms and found Raft significantly easier to grasp.

Raft decomposes consensus into three sub-problems:

1. Leader Election

Raft elects a leader who takes full responsibility for managing the replicated log. Followers only do what the leader tells them.

Time is divided into terms, monotonically increasing integers. Each term begins with an election. If followers stop receiving heartbeats from the leader (leader crash), they start a new election.

A node becomes a candidate and requests votes from peers. A node votes for a candidate if:

It hasn't voted in this term yet
The candidate's log is at least as up-to-date as the voter's log

The first candidate to get a majority wins and becomes the new leader.

2. Log Replication

The leader accepts client writes, appends them to its log, and sends them to followers via AppendEntries RPC. Once a majority of nodes have written the entry to their logs, it's considered committed. The leader notifies followers, who apply the entry to their state machines.

Client → Leader: "set x = 5"
Leader → Follower A, B, C: AppendEntries(x=5)
A, B confirm (majority!) → Leader commits
Leader → Client: "OK"
Leader → Followers: "entry is committed, apply it"

3. Safety

Raft's election mechanism ensures that a leader always has all committed entries. A candidate with a less up-to-date log cannot win an election (followers won't vote for it). This property, Leader Completeness, is what makes Raft safe.

Property	Raft	Multi-Paxos
Understandability	High	Low
Leader election	Explicit	Implicit
Log replication	Leader-driven	More complex
Widely adopted	Yes (etcd, CockroachDB)	Yes (Chubby, Spanner)

ZAB: Zookeeper's Variant

Zookeeper uses a protocol called ZAB (Zookeeper Atomic Broadcast), which is conceptually similar to Paxos but optimized for the specific case of replicating a transaction log. ZAB has two modes: recovery (leader election after a crash) and broadcast (normal operation). It predates Raft but solves the same core problem.

Quiz Time

Which systems use Raft as their consensus algorithm?

Select all that apply

Real-World Usage

etcd is the most prominent user of Raft. It's the backing store for Kubernetes. Cluster state, configuration, and coordination all flow through etcd. Every time Kubernetes schedules a pod or updates a service, that decision goes through Raft consensus.

CockroachDB uses Raft at the range level. Each range of keys is replicated across nodes using Raft. This gives it serializable transactions across a distributed cluster.

Consul (by HashiCorp) uses Raft for leader election and service catalog replication.

Google Spanner uses a Paxos variant for replication across its globally distributed nodes. Paxos is more flexible for the complex multi-region topology Spanner operates in.

Apache Kafka (KRaft mode) replaced Zookeeper with its own Raft-based controller quorum in Kafka 3.x, removing a major operational dependency.

What Consensus Is Actually Used For

It's worth being precise: consensus algorithms are used for coordination, not bulk data storage.

You don't store your entire database in etcd. etcd stores a few hundred MB to a few GB of metadata — configuration, leader election results, distributed locks, service discovery entries. Consensus is expensive (requires majority quorum per write), so you use it sparingly for the things that genuinely need strong consistency.

The pattern: Use a consensus-backed store (etcd, Zookeeper) for coordination: who's the leader, what's the configuration, who holds this lock. Use regular databases and caches for your application data.

Summary

The consensus problem asks how distributed nodes can agree on a value when nodes can fail. A correct consensus algorithm must guarantee agreement, validity, and termination — though FLP impossibility proves this is theoretically impossible in purely async systems, so real algorithms work under practical assumptions. Paxos is the original correct solution but is famously difficult to understand and implement. Raft was designed for understandability, breaking the problem into leader election, log replication, and safety guarantees — it's used by etcd, CockroachDB, and Consul. ZAB powers Zookeeper. Consensus is expensive, so it's used for coordination (leader election, distributed locks, configuration) rather than bulk data storage.

Paxos Algorithm

How helpful was this content?

Comments

0/2000

Saved on this device only