Heartbeats

Updated June 3, 2026

Magic Magnets Team

7 min read

In a distributed system, nodes fail. Servers crash, network links go down, processes get stuck. The question isn't whether failures happen. It's how quickly you detect them and respond.

That's what heartbeats are for.

What Is a Heartbeat?

A heartbeat is a periodic signal that a node sends to indicate it's still alive and functioning. The name is deliberate — just like a pulse tells you a patient is alive, a heartbeat signal tells the system a node is still up.

The basic mechanism:

A node sends a "I'm alive" message every N seconds
A monitor (another node or a central coordinator) tracks when it last heard from each node
If a node goes silent for longer than a threshold, it's presumed dead

That's it. Conceptually, heartbeats are one of the simplest ideas in distributed systems. The subtleties are in the details.

Heartbeat Intervals and Failure Detection Speed

The interval between heartbeats determines how fast you detect a failure. Shorter intervals = faster detection. But shorter intervals also mean more network traffic and more CPU time spent on bookkeeping.

If you send a heartbeat every second and require 3 missed heartbeats before declaring a node dead, you detect failures in about 3 seconds. If you send a heartbeat every 10 seconds, failure detection takes up to 30 seconds.

The trade-off:

Interval	Detection Speed	Network Overhead
100ms	Very fast (~300ms)	High
1s	Fast (~3s)	Moderate
10s	Slow (~30s)	Low

In practice, most systems use 1-5 second intervals with a timeout of 2-3 missed heartbeats. But the right values depend on your failure detection SLAs.

Quiz Time

You set a heartbeat interval of 1 second and require 3 missed heartbeats before declaring a node dead. A node experiences a 2-second network blip but recovers. What is the outcome?

The False Positive Problem

Short intervals create another problem: false positives. A node might miss a heartbeat because of a temporary network blip, not because it's actually dead. If you declare it dead and trigger failover too aggressively, you might cause cascading disruption — new leader elections, traffic rerouting, state transfers — all for a node that recovers in 200ms.

Systems handle this by requiring multiple consecutive missed heartbeats before taking action, and by using adaptive timeouts that account for current network conditions.

Centralized vs Gossip-Based Heartbeats

There are two architectures for heartbeat monitoring: centralized and gossip.

Centralized Heartbeats

In a centralized model, all nodes send heartbeats to a single coordinator (or a small cluster of coordinators). The coordinator maintains a registry of all nodes and their last-seen timestamps.

Pros: Simple to implement, easy to reason about, central source of truth
Cons: The coordinator is a single point of failure. At very large scale, the coordinator becomes a bottleneck

This is the model Zookeeper uses. Every node in the cluster maintains a session with Zookeeper. If a node's session expires (because heartbeats stopped), Zookeeper marks it as unavailable and other nodes can react accordingly.

Gossip Protocol

In a gossip-based model, nodes share health information with each other — not with a central coordinator. Each node periodically picks a few random peers and shares its knowledge of which nodes are alive. Over multiple rounds, information propagates through the entire cluster, like a rumor spreading through a crowd.

Node A → tells Node C: "B and D are alive, E is dead"
Node C → tells Node F: "A, B, D alive, E dead"
Node F → tells Node B: "A, C, D alive, E dead"
// Within seconds, everyone knows

Pros: No single point of failure. Scales to thousands of nodes. Resilient to network partitions. Cons: Slightly less immediate (takes multiple rounds to propagate), more complex to implement

Cassandra uses gossip for cluster membership. Amazon's Dynamo and many peer-to-peer systems also use variants of gossip.

Quiz Time

What is the main advantage of gossip-based heartbeats over centralized heartbeats?

Real-World Usage

Kubernetes Liveness Probes

Kubernetes uses liveness probes as a heartbeat mechanism for application health. You configure a probe — an HTTP endpoint, a TCP connection, or a shell command — and Kubernetes calls it periodically. If the probe fails repeatedly, Kubernetes restarts the container.

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

This is heartbeats from the orchestrator's perspective. The application doesn't send heartbeats — Kubernetes actively probes it.

Zookeeper Sessions

Zookeeper clients maintain a session with the server. The client must send heartbeats (called "pings") within the session timeout. If the session expires, Zookeeper can trigger actions like releasing ephemeral nodes — a clean way to signal that a service is down and trigger leader re-election.

etcd

etcd (the backing store for Kubernetes) uses the Raft consensus algorithm, which has heartbeats built in. The Raft leader sends periodic heartbeat messages to followers to assert its authority. If followers stop receiving heartbeats, they hold an election to pick a new leader. The heartbeat interval directly controls how quickly the cluster detects a leader failure and recovers.

Heartbeats and Failure Detectors

In distributed systems theory, heartbeats are an implementation of a failure detector — a component that monitors nodes and outputs a list of nodes it suspects have failed.

The famous Chandra-Toueg paper (1996) classified failure detectors by two properties:

Completeness: do we eventually detect every failure?
Accuracy: do we avoid false positives?

Perfect failure detectors (complete and always accurate) are impossible in asynchronous networks. The best real systems achieve "eventually perfect" — they may have brief false positives but eventually stabilize to accurate detection.

This theoretical backdrop explains why getting heartbeat tuning right is non-trivial, and why production systems put care into their timeout configurations.

Summary

Heartbeats are periodic signals that nodes send to indicate they're alive. The interval between heartbeats determines how quickly failures are detected — shorter intervals mean faster detection but more network overhead. Centralized heartbeats (like Zookeeper sessions) use a single coordinator and are simple but have a scaling ceiling. Gossip-based heartbeats propagate health information peer-to-peer and are more resilient at large scale. Kubernetes liveness probes, Zookeeper sessions, and etcd's Raft heartbeats are real-world examples of the same underlying idea: periodic signals that let a distributed system know when a node has gone silent and action needs to be taken.

Handling Failures in Distributed Systems

How helpful was this content?

Comments

0/2000

Saved on this device only