Batch vs Stream Processing

Updated June 8, 2026

Magic Magnets Team

10 min read

Every system that processes data has to answer a fundamental question: when do you process it?

The two answers are: now (stream processing) or later, all at once (batch processing). Each has a different set of trade-offs in latency, cost, complexity, and correctness. The choice shapes your entire data architecture.

Batch Processing: Do It Later, Do It Big

Batch processing collects data over a period of time and then processes it all at once, typically on a schedule. Every night at 2 AM, run the job. Every hour, process the last hour's worth of data. Every week, recompute the entire dataset from scratch.

Think of it like doing laundry. You don't wash each piece of clothing the moment it gets dirty. You let it pile up, then run the washing machine once. It's more efficient per item, even though you wait longer to get clean clothes.

Characteristics:

High latency: data processed minutes to hours (or days) after it arrives
High throughput: processes enormous volumes in each run, terabytes to petabytes
Bounded datasets: each job runs over a defined time window
Simpler correctness: the data is static when you process it, so aggregations are exact

Netflix and Batch Processing

Netflix's recommendation engine historically ran as a batch job. Every night, an enormous Hadoop job would run over the day's watch history for all users, billions of events, and recompute personalized recommendations for each user. The next morning, you'd see your updated recommendations.

This is fine for recommendations. Whether you get your "You might also like..." update at 2 AM or 4 AM doesn't matter to the user experience. The scale of the computation (billions of events, 200+ million users) is easier to handle as a batch job than in real-time.

Netflix also uses batch processing for content analytics (how many people watched a show, completion rates, regional breakdowns), billing reconciliation, and A/B test analysis.

The Tools: Apache Spark

Apache Spark is the dominant batch processing engine. It replaced MapReduce (Hadoop) by keeping intermediate data in memory rather than writing it to disk between stages. That makes it 10-100x faster for iterative workloads like ML training.

A Spark job reads data from HDFS, S3, or another storage system, processes it using a DAG (directed acyclic graph) of transformations, and writes results back. The job can process petabytes across thousands of machines.

Other batch tools: Hive (SQL over Hadoop), dbt (SQL transformations over a data warehouse), Airflow (orchestrating batch pipeline schedules), AWS Glue, Google Dataflow in batch mode.

algobase.dev

Batch processing: events accumulate in object storage continuously, then a Spark job reads and processes them on a schedule (nightly or hourly). Results land in the data warehouse hours after the events arrived. High throughput, low operational cost, but high latency.

1 / 1

Batch pipeline: events accumulate in object storage, then Spark reads and processes them on a schedule

Stream Processing: Do It Now

Stream processing handles data in real-time as it arrives. Instead of accumulating a day's worth of events and then processing them, you process each event (or a small micro-batch) within milliseconds of it arriving.

Characteristics:

Low latency: results within milliseconds to seconds
Unbounded datasets: the stream is conceptually infinite, it never ends
Higher complexity: handling late-arriving events, out-of-order data, stateful aggregations, exactly-once semantics are genuinely hard problems
Approximate results: real-time aggregations (like "events in the last 5 minutes") may use approximate algorithms

Fraud Detection: The Classic Stream Processing Use Case

This is the textbook example of where stream processing is non-negotiable. When you swipe your credit card, the bank has milliseconds, not hours, to decide whether to approve the transaction. Running a batch job to detect fraud nightly would mean approving fraudulent transactions all day and catching them the next morning, after the damage is done.

The fraud detection model runs on a stream of transaction events. When your transaction arrives:

It's published to a Kafka topic
A stream processor consumes it immediately
The processor checks it against real-time signals: is this location unusual? Is this merchant category unusual? Has this card been used in two countries in the last hour?
A decision is returned within 100ms

This is only possible with stream processing.

Other stream processing use cases: live dashboards, real-time alerting, live leaderboards in games, log anomaly detection, clickstream analysis for personalization.

The Tools: Kafka Streams, Apache Flink, Spark Structured Streaming

Apache Kafka is the ubiquitous event streaming platform. It stores and delivers events in real-time with high throughput and durability. It's not a processing engine itself. It's the transport layer. Everything else connects to Kafka.

Kafka Streams is a lightweight processing library that runs inside your application. No separate cluster needed. Good for simpler stream processing tasks: filtering, aggregating, joining streams, enriching events. Widely used because of its operational simplicity.

Apache Flink is the heavy-duty stream processing engine. Built from the ground up for stateful, fault-tolerant stream processing. It handles complex windowed aggregations, out-of-order events, exactly-once semantics, and massive parallelism. Used by Alibaba, Uber, Netflix, and many others for their most demanding real-time data pipelines.

Spark Structured Streaming brings Spark's batch API to streaming. It treats the stream as an unbounded table and lets you write SQL-like queries over it. Easier for teams already using Spark, though Flink's streaming latency is typically lower.

algobase.dev

Stream processing: each event is published to Kafka and consumed by Flink within milliseconds. The result is written to a low-latency store like Redis immediately. Total end-to-end latency is under a second. Always-on infrastructure costs more, but time-sensitive workloads like fraud detection require it.

1 / 1

Stream pipeline: events go through Kafka and Flink within milliseconds, results land in Redis immediately

The Trade-offs Side by Side

Dimension	Batch	Stream
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (optimized for large volumes)	High (but per-event overhead adds up)
Complexity	Simpler (static data, exact aggregations)	Higher (state management, ordering, watermarks)
Cost	Lower per-event at scale	Higher (always-on infrastructure)
Correctness	Exact (data is complete before processing)	Often approximate (windows, late data)
Use cases	Analytics, reporting, ML training, ETL	Fraud detection, real-time monitoring, live dashboards

The Lambda and Kappa Architectures

As teams built systems that needed both low latency and complete accuracy, two patterns emerged:

Lambda Architecture: run both a batch layer and a stream layer simultaneously. The stream layer handles real-time queries with approximate results. The batch layer periodically recomputes accurate results over the complete dataset. Results are merged. The downside is that you maintain two codebases doing similar things, which is complex and expensive.

Kappa Architecture: eliminate the batch layer entirely. Run everything through the stream processor. When you need to recompute, replay historical events through the stream processor. Simpler to maintain, but requires the stream processor to handle batch-style historical replay efficiently.

Modern stream engines like Flink can do both well, making the Kappa architecture increasingly popular.

Choosing the Right Approach

Use batch processing when:

Latency of hours is acceptable
You need exact aggregations over large, complete datasets
The workload is periodic (reports, billing, ML training)
Cost efficiency at high volume is a priority

Use stream processing when:

Latency must be seconds or less (fraud detection, alerting)
The data is inherently time-sensitive (live dashboards, real-time personalization)
You need to react to events as they happen

Use both when:

You need real-time responses AND accurate historical reporting
Your analytics team needs batch SQL over historical data AND your ops team needs live dashboards

Summary

Batch processing is the pragmatic default when latency isn't critical. It's simpler, cheaper, and handles enormous volumes efficiently. Stream processing is the right choice when you need to react to events in real-time: fraud detection, live monitoring, and real-time personalization all require it. Apache Spark dominates batch workloads. Kafka is the backbone of most streaming architectures, with Flink for complex stateful stream processing and Kafka Streams for simpler use cases. Most mature data platforms end up using both, batch for historical accuracy and stream for real-time responsiveness.

MapReduce

How helpful was this content?

Comments

0/2000

Saved on this device only