Kappa Architecture

Updated June 8, 2026
M
Magic Magnets Team
8 min read

Lambda Architecture solves the batch-vs-stream problem by running both systems in parallel. But it creates a new problem: you write the same business logic twice, in two different systems, and keep them in sync forever.

Jay Kreps, one of the creators of Apache Kafka, proposed a different answer: what if the streaming layer was reliable enough and retained enough historical data that you could eliminate the batch layer entirely?

He called this the Kappa Architecture.

The Core Idea

Kappa Architecture has one principle: everything is a stream.

Historical data and real-time data are not treated differently. There is no batch job, no master dataset in HDFS, no nightly Spark run. There is a single durable log (Kafka) and a single stream processor (Flink or Kafka Streams) that reads from it.

The key insight is that a stream of historical events is just a stream that starts from the past. If Kafka retains all events indefinitely, then "reprocess all of history" is equivalent to "read from offset 0." The same streaming code that processes new events in real-time can also replay historical events to rebuild state from scratch.

The Three Components

The Durable Log

Kafka is the foundation. Every event is appended to a Kafka topic and retained. By default, Kafka retains messages for a fixed period (commonly 7 days). For Kappa Architecture to work, you configure topics with extended or indefinite retention.

Log compaction is an alternative for key-value style data: Kafka keeps only the most recent record per key, discarding older versions. This gives you a compact, full snapshot of current state without keeping years of every intermediate event. A user-profiles topic with compaction keeps the latest profile for each user ID, not every update they've ever made.

The Stream Processor

A single Flink job reads from Kafka and maintains state. This might mean:

  • Counting events per user in a sliding 30-minute window
  • Joining a transactions stream with a user-info stream to enrich events
  • Detecting anomalies in real-time by comparing against historical averages

The output goes to a serving database: Cassandra for high-throughput writes, Redis for low-latency reads, or a database appropriate for your query patterns.

This is the only aggregation codebase. There is no parallel batch system.

Reprocessing: The Critical Feature

What happens when you discover a bug in your aggregation logic? Or when business requirements change and you need to recompute historical metrics with new rules?

In Lambda Architecture, you re-run the batch job. In Kappa Architecture, you replay the Kafka log.

The process:

  1. Deploy a new version of your stream processor (call it v2)
  2. Point v2 at Kafka from offset 0 (the beginning of time for that topic)
  3. v2 reads historical events as fast as possible, rebuilding state into a new serving database
  4. v1 continues serving live traffic during this replay
  5. Once v2 catches up to the present moment, switch clients to the new database
  6. Shut down v1

The replay runs much faster than real-time because Flink isn't waiting for new events to arrive. It reads stored data at whatever speed the network and CPU allow.

algobase.dev
Kappa steady state: events are appended to Kafka and consumed by a single Flink stream processor. Results are written to a serving database. Kafka's long-term retention means historical events are available for replay at any time.
1 / 1

Kappa steady state: single Flink processor reading from Kafka and writing to serving DB

algobase.dev
Kappa reprocessing: when logic changes, a new Flink v2 is launched and reads the full Kafka history from offset 0 as fast as possible. Flink v1 keeps serving live traffic. When v2 catches up to the present, the client is switched to the new serving database. v1 is then shut down.
1 / 1

Kappa reprocessing: v2 replays full Kafka history while v1 continues serving. Clients switch when v2 catches up.

Exactly-Once Semantics

Kappa Architecture depends on the stream processor producing correct results despite failures. If a Flink job crashes and restarts, it must not double-count events or skip them.

Flink's exactly-once semantics work through distributed snapshots: Flink periodically takes a consistent snapshot of all operator state and writes it to durable storage (S3 or HDFS). On recovery, Flink restores from the last successful checkpoint and resumes from the corresponding Kafka offset. The same event is never processed twice, and no event is skipped.

Without exactly-once semantics, a crash-and-restart would produce inaccurate results: events before the crash might be counted again, corrupting all downstream metrics.

Kappa vs. Lambda: When Each Applies

Kappa is simpler when:

  • All your processing logic can be expressed as streaming transformations (stateful aggregations, joins, windowed counts)
  • Your event volume fits within Kafka's retention budget
  • Reprocessing via Kafka replay is fast enough for your recovery time requirements

Lambda is still valid when:

  • You need to run algorithms that don't fit streaming: ML model training, iterative graph algorithms, cross-join operations across the full dataset
  • Your event volume is so large that storing it all in Kafka would cost more than HDFS/S3
  • Regulatory requirements mandate a separate batch reconciliation pass

In practice, most modern teams start with Kappa. The operational simplicity of one codebase is worth the trade-off for the vast majority of use cases. Lambda is a specialization for workloads with genuine batch-only requirements.

Real-World Adoption

LinkedIn runs thousands of Kafka topics and Flink jobs in a Kappa-style architecture. Profile updates, connection events, job applications, and content engagement all flow through Kafka. Stream processors update search indexes, feed recommendation models, and trigger notifications, all from one unified streaming layer.

Uber uses a similar model for real-time pricing, driver matching, and fraud detection. The same event stream that powers the live product also feeds analytics and ML training pipelines.

Limitations

Kafka storage cost: storing months or years of events in Kafka costs more than in object storage like S3. Teams sometimes use tiered storage (Kafka with an S3 backend for older segments) to reduce cost while maintaining replay capability.

Replay time: replaying years of history through a stream processor takes time. If your Kafka topic holds 3 years of events and your processor handles 100,000 events per second, a full replay takes hours. This is acceptable for most reprocessing scenarios but means Kappa recovery is slower than Lambda for large datasets.

Stateful join complexity: joining two streams in Flink requires holding state for events that haven't been matched yet. For large, slow-moving streams, this state can grow large and expensive to checkpoint.

Summary

Kappa Architecture eliminates the batch layer by treating all data as a stream. Kafka serves as the durable, long-term event log. A single stream processor (Flink) reads from it to produce results, replacing both the batch job and the speed layer from Lambda Architecture. When logic changes, a new processor version replays the Kafka log from offset 0 into a new serving database while the old version continues serving traffic. Clients switch when the new version catches up. Exactly-once semantics ensure correct results after failures. Kappa is simpler to operate than Lambda for most workloads, though Lambda remains the right choice when batch-only algorithms are genuinely required.

Streaming Engines

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices