Split Brain Problem
Updated June 3, 2026Imagine a pirate ship with two highly capable captains. They usually coordinate their commands. But one day, a cannonball destroys the ship's internal communication tubes.
Captain A in the bow thinks Captain B is dead. Captain B in the stern thinks Captain A is dead.
Suddenly, both captains assume full control. Captain A orders the ship to steer left to avoid a reef, while Captain B orders the ship to steer right to engage the enemy. The crew is confused, the ship spins out of control, and disaster strikes.
This is the Split Brain Problem.
The Core Concept
The split-brain problem occurs in a distributed system when a cluster of nodes divides into two or more independent, competing sub-clusters due to a network partition.
Because they can't communicate, each sub-cluster believes the other is dead. Therefore, each sub-cluster promotes itself as the "primary" or "master" and begins accepting writes from users.
The Devastating Consequences
When a system gets "split-brain," you have multiple masters independently modifying data.
[quiz:0] If a user updates their password on Sub-cluster A, and another user updates their profile on Sub-cluster B, the data is now diverging.
When the network is finally fixed and the two sub-clusters talk to each other again, you have a massive conflict. Which data is correct? How do you merge the conflicting updates? Often, you can't without manual human intervention or data loss.
Real-World Examples
- Elasticsearch (Older versions): Historically struggled with split-brain issues if the cluster wasn't configured with a strict quorum, leading to corrupt or split search indexes.
- Database Replicas: If you have a primary MySQL database and a secondary replica, and the network between them fails, an automated failover script might incorrectly promote the replica to primary. Now you have two primaries accepting writes!
How to Prevent Split-Brain
To stop two captains from taking control, systems use a few clever mechanisms:
1. Quorum (Majority Rules)
The most common solution. You require a majority of nodes to agree before a primary can be elected or a write can be accepted. If you have 5 nodes, you need 3 to form a quorum. If the network splits into a group of 2 and a group of 3, only the group of 3 can elect a leader. The group of 2 realizes it's a minority and pauses operations.
What does STONITH (Shoot The Other Node In The Head) do and why is it used?
2. STONITH (Shoot The Other Node In The Head)
Yes, this is a real technical term. In high-availability clusters (like Pacemaker), if Node A suspects Node B is unreachable but isn't sure if it's completely dead, Node A will send a signal to a smart power switch to literally cut the electricity to Node B. This guarantees Node B is dead before Node A takes over, eliminating the split-brain risk.
Summary
- Split-brain happens when a network failure causes a system to split, and multiple nodes independently claim to be the primary leader.
- It leads to data corruption and conflicting writes.
- Quorums (requiring a strict majority vote) are the standard way to prevent it.
- Fencing or STONITH can physically ensure an isolated node cannot cause harm.
Saved on this device only
Sign in to sync progress across devices