Streaming Engines

Updated June 8, 2026
M
Magic Magnets Team
8 min read

A streaming engine is a system that processes data continuously as it arrives, rather than accumulating it and processing it later. Where a batch job asks "process all the data from the last 24 hours," a streaming engine asks "what do I do with this event, right now?"

The canonical use case is fraud detection. When you swipe a credit card, the bank has under 100ms to approve or decline. No batch job helps here. The transaction event must be evaluated the moment it arrives.

The Streaming Model

A streaming engine reads events from a source (usually Kafka), applies a sequence of operators to each event, and writes results to a sink. The data is never fully materialized in one place: it flows through the operator graph as it arrives.

The fundamental difference from batch is unbounded data. A batch job processes a finite dataset (yesterday's logs, last month's transactions) and terminates. A streaming job processes an infinite sequence of events and runs indefinitely.

This creates problems that batch processing doesn't have:

  • You can't "wait for all the data" before computing an aggregate because there is no end
  • Events can arrive out of order (a mobile app buffered events offline, then flushed them all at once)
  • The job must survive worker failures without losing state or reprocessing events incorrectly

Windows

How do you compute "page views in the last 5 minutes" on a continuous stream? You use a window, a bounded time range applied to the unbounded stream.

Tumbling windows: fixed-size, non-overlapping windows. A 5-minute tumbling window produces one result every 5 minutes: [00:00-00:05), [00:05-00:10), [00:10-00:15). Used for periodic summaries.

Sliding windows: fixed-size windows that advance by a smaller step. A 5-minute window sliding every 1 minute produces results at 00:01, 00:02, 00:03... Each window overlaps with the previous ones. Used for rolling averages.

Session windows: variable-size windows that close after a gap of inactivity. If a user doesn't produce any events for 30 minutes, the current session window closes. Used for user session analysis.

Event Time vs. Processing Time

Processing time is the wall clock time when the event arrives at the streaming engine. Simple to implement, but misleading: if events arrive late (a phone reconnected after being offline), they're assigned to the wrong window.

Event time is the timestamp embedded in the event itself, recording when it actually occurred. This is what you almost always want: aggregate events by when they happened, not when they arrived. The streaming engine must handle events that arrive after their window has already closed.

Watermarks

A watermark is the mechanism streaming engines use to handle late events. It's the engine's estimate of how far along event time has progressed.

When the watermark passes the end of a window, the engine considers that window "complete" and emits the result. Events that arrive after the watermark (late data) can be handled in several ways: discarded, merged into the next window, or routed to a side output for separate handling.

Setting the watermark is a trade-off:

  • A tight watermark (e.g., 5 seconds late) closes windows quickly, giving low latency results, but drops events that arrive more than 5 seconds late.
  • A loose watermark (e.g., 10 minutes late) tolerates more late data but delays results by 10 minutes.

State Management

Streaming jobs that go beyond stateless filtering need to maintain state: counts per user, running averages, fraud signals accumulated over a session.

Flink manages state through a configurable state backend:

  • RocksDB (the production default): stores state on local disk. Handles large state that doesn't fit in memory. Slower reads than heap but scales to terabytes per task.
  • Heap-based: stores state in JVM memory. Fast reads, but limited by available RAM and lost on crash (recovered from checkpoint).

State is periodically checkpointed to durable storage (S3 or HDFS). On failure, Flink restores state from the last checkpoint and replays events from the corresponding Kafka offset, achieving exactly-once semantics.

Exactly-Once Semantics

Exactly-once means each event affects the output exactly once, even if the system crashes and restarts.

Without it, a crash could cause two outcomes:

  • At-least-once: events replayed from Kafka get processed again. Counts become inflated.
  • At-most-once: events processed before the crash are lost. Counts are understated.

Flink achieves exactly-once through distributed snapshots (based on the Chandy-Lamport algorithm). It periodically injects checkpoint barriers into the event stream. When all operators have received a barrier, Flink atomically saves the state of every operator and the current Kafka offsets. On recovery, it resets both the state and the offsets together, so no event is counted twice.

algobase.dev
Flink job topology: events flow from Kafka through a filter/transform operator and into a 5-minute tumbling window aggregator. The window operator reads and writes state to RocksDB on local disk. At the end of each window, the result is emitted to a Kafka sink topic and written to Redis. The Kafka source and sink are the integration points with the rest of the system.
1 / 1

Flink job topology: Kafka source, filter/transform, windowed aggregation with RocksDB state, and dual sinks (Kafka and Redis)

The Main Streaming Engines

Apache Flink: the industry standard for stateful stream processing. True event-time processing, exactly-once semantics, RocksDB state backend, rich windowing API, and native support for both streaming and batch on the same engine. Used by Alibaba (thousands of Flink jobs), Uber, Netflix, and most large-scale streaming deployments.

Kafka Streams: a library (not a separate cluster) that runs inside your Java application. Simpler to operate because there's no external cluster to manage. Supports stateful operations, windowing, and stream-stream joins. Appropriate when your processing logic is moderate and you want to avoid the overhead of a separate Flink cluster.

Spark Structured Streaming: Spark's streaming API, which treats a stream as an unbounded table and lets you write SQL or DataFrame code. Familiar to teams already using Spark. Achieves exactly-once semantics. Micro-batch execution (not truly event-by-event) means latency is higher than Flink (typically seconds vs. milliseconds). Good for teams that want one engine for both batch and streaming.

FlinkKafka StreamsSpark Structured Streaming
LatencySub-millisecondSub-millisecondSeconds (micro-batch)
StateRocksDB, large stateLocal RocksDBExecutor memory
Exactly-onceYesYesYes
Cluster neededYes (Flink cluster)No (runs in app)Yes (Spark cluster)
Best forComplex stateful processingModerate complexity, ops simplicityTeams using Spark for batch

Real-World Examples

Visa / Mastercard Fraud Detection

Transaction events publish to Kafka. A Flink job reads each transaction, looks up the cardholder's historical spending pattern (stored in Flink state), and scores the transaction for fraud risk. The result is emitted within 50ms. If the score exceeds a threshold, a second downstream job blocks the transaction and triggers a notification.

Uber Surge Pricing

Driver location and ride request events publish to Kafka every few seconds. A Flink job aggregates supply (available drivers) and demand (active requests) per geographic cell using a sliding 5-minute window. The per-cell surplus/deficit is written to Redis every 10 seconds. The pricing model reads from Redis when a rider requests a trip.

Summary

Streaming engines process data continuously as it arrives. Windows (tumbling, sliding, session) let you compute aggregates over bounded time ranges in an unbounded stream. Watermarks handle late-arriving events by estimating how far event time has advanced. State management through backends like RocksDB lets jobs maintain per-user or per-session context across millions of events. Exactly-once semantics, achieved through distributed checkpointing, ensures correct results after failures. Apache Flink is the leading engine for complex stateful processing. Kafka Streams is simpler to operate for moderate workloads. Spark Structured Streaming suits teams already invested in Spark.

How helpful was this content?

Comments

0/2000

Sign in to join the discussion

Saved on this device only

Sign in to sync progress across devices