Batch vs Stream Processing
Updated June 8, 2026Every system that processes data has to answer a fundamental question: when do you process it?
The two answers are: now (stream processing) or later, all at once (batch processing). Each has a different set of trade-offs in latency, cost, complexity, and correctness. The choice shapes your entire data architecture.
Batch Processing: Do It Later, Do It Big
Batch processing collects data over a period of time and then processes it all at once, typically on a schedule. Every night at 2 AM, run the job. Every hour, process the last hour's worth of data. Every week, recompute the entire dataset from scratch.
Think of it like doing laundry. You don't wash each piece of clothing the moment it gets dirty. You let it pile up, then run the washing machine once. It's more efficient per item, even though you wait longer to get clean clothes.
Characteristics:
- High latency: data processed minutes to hours (or days) after it arrives
- High throughput: processes enormous volumes in each run, terabytes to petabytes
- Bounded datasets: each job runs over a defined time window
- Simpler correctness: the data is static when you process it, so aggregations are exact
Netflix and Batch Processing
Netflix's recommendation engine historically ran as a batch job. Every night, an enormous Hadoop job would run over the day's watch history for all users, billions of events, and recompute personalized recommendations for each user. The next morning, you'd see your updated recommendations.
This is fine for recommendations. Whether you get your "You might also like..." update at 2 AM or 4 AM doesn't matter to the user experience. The scale of the computation (billions of events, 200+ million users) is easier to handle as a batch job than in real-time.
Netflix also uses batch processing for content analytics (how many people watched a show, completion rates, regional breakdowns), billing reconciliation, and A/B test analysis.
The Tools: Apache Spark
Apache Spark is the dominant batch processing engine. It replaced MapReduce (Hadoop) by keeping intermediate data in memory rather than writing it to disk between stages. That makes it 10-100x faster for iterative workloads like ML training.
A Spark job reads data from HDFS, S3, or another storage system, processes it using a DAG (directed acyclic graph) of transformations, and writes results back. The job can process petabytes across thousands of machines.
Other batch tools: Hive (SQL over Hadoop), dbt (SQL transformations over a data warehouse), Airflow (orchestrating batch pipeline schedules), AWS Glue, Google Dataflow in batch mode.
Batch pipeline: events accumulate in object storage, then Spark reads and processes them on a schedule
Stream Processing: Do It Now
Stream processing handles data in real-time as it arrives. Instead of accumulating a day's worth of events and then processing them, you process each event (or a small micro-batch) within milliseconds of it arriving.
Characteristics:
- Low latency: results within milliseconds to seconds
- Unbounded datasets: the stream is conceptually infinite, it never ends
- Higher complexity: handling late-arriving events, out-of-order data, stateful aggregations, exactly-once semantics are genuinely hard problems
- Approximate results: real-time aggregations (like "events in the last 5 minutes") may use approximate algorithms
Fraud Detection: The Classic Stream Processing Use Case
This is the textbook example of where stream processing is non-negotiable. When you swipe your credit card, the bank has milliseconds, not hours, to decide whether to approve the transaction. Running a batch job to detect fraud nightly would mean approving fraudulent transactions all day and catching them the next morning, after the damage is done.
The fraud detection model runs on a stream of transaction events. When your transaction arrives:
- It's published to a Kafka topic
- A stream processor consumes it immediately
- The processor checks it against real-time signals: is this location unusual? Is this merchant category unusual? Has this card been used in two countries in the last hour?
- A decision is returned within 100ms
This is only possible with stream processing.
Other stream processing use cases: live dashboards, real-time alerting, live leaderboards in games, log anomaly detection, clickstream analysis for personalization.
The Tools: Kafka Streams, Apache Flink, Spark Structured Streaming
Apache Kafka is the ubiquitous event streaming platform. It stores and delivers events in real-time with high throughput and durability. It's not a processing engine itself. It's the transport layer. Everything else connects to Kafka.
Kafka Streams is a lightweight processing library that runs inside your application. No separate cluster needed. Good for simpler stream processing tasks: filtering, aggregating, joining streams, enriching events. Widely used because of its operational simplicity.
Apache Flink is the heavy-duty stream processing engine. Built from the ground up for stateful, fault-tolerant stream processing. It handles complex windowed aggregations, out-of-order events, exactly-once semantics, and massive parallelism. Used by Alibaba, Uber, Netflix, and many others for their most demanding real-time data pipelines.
Spark Structured Streaming brings Spark's batch API to streaming. It treats the stream as an unbounded table and lets you write SQL-like queries over it. Easier for teams already using Spark, though Flink's streaming latency is typically lower.
Stream pipeline: events go through Kafka and Flink within milliseconds, results land in Redis immediately
The Trade-offs Side by Side
| Dimension | Batch | Stream |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (optimized for large volumes) | High (but per-event overhead adds up) |
| Complexity | Simpler (static data, exact aggregations) | Higher (state management, ordering, watermarks) |
| Cost | Lower per-event at scale | Higher (always-on infrastructure) |
| Correctness | Exact (data is complete before processing) | Often approximate (windows, late data) |
| Use cases | Analytics, reporting, ML training, ETL | Fraud detection, real-time monitoring, live dashboards |
The Lambda and Kappa Architectures
As teams built systems that needed both low latency and complete accuracy, two patterns emerged:
Lambda Architecture: run both a batch layer and a stream layer simultaneously. The stream layer handles real-time queries with approximate results. The batch layer periodically recomputes accurate results over the complete dataset. Results are merged. The downside is that you maintain two codebases doing similar things, which is complex and expensive.
Kappa Architecture: eliminate the batch layer entirely. Run everything through the stream processor. When you need to recompute, replay historical events through the stream processor. Simpler to maintain, but requires the stream processor to handle batch-style historical replay efficiently.
Modern stream engines like Flink can do both well, making the Kappa architecture increasingly popular.
Choosing the Right Approach
Use batch processing when:
- Latency of hours is acceptable
- You need exact aggregations over large, complete datasets
- The workload is periodic (reports, billing, ML training)
- Cost efficiency at high volume is a priority
Use stream processing when:
- Latency must be seconds or less (fraud detection, alerting)
- The data is inherently time-sensitive (live dashboards, real-time personalization)
- You need to react to events as they happen
Use both when:
- You need real-time responses AND accurate historical reporting
- Your analytics team needs batch SQL over historical data AND your ops team needs live dashboards
Summary
Batch processing is the pragmatic default when latency isn't critical. It's simpler, cheaper, and handles enormous volumes efficiently. Stream processing is the right choice when you need to react to events in real-time: fraud detection, live monitoring, and real-time personalization all require it. Apache Spark dominates batch workloads. Kafka is the backbone of most streaming architectures, with Flink for complex stateful stream processing and Kafka Streams for simpler use cases. Most mature data platforms end up using both, batch for historical accuracy and stream for real-time responsiveness.
Saved on this device only
Sign in to sync progress across devices