MapReduce

Updated June 8, 2026

Magic Magnets Team

8 min read

When big data started growing in the early 2000s, companies hit a wall. A single computer couldn't process billions of records fast enough. Counting words across a million web pages would take months on one server. You needed a way to split the work across thousands of cheap machines and collect the results.

MapReduce was Google's answer, published in a 2004 paper. It became the foundation of Apache Hadoop and shaped how the industry thinks about distributed data processing.

The Core Idea

MapReduce breaks the problem into two programmer-defined functions: a map function and a reduce function. The framework handles everything else: splitting the data, distributing work, shuffling intermediate results, and recovering from failures.

Here's the classic example: counting word frequencies across a billion documents.

Map: For each document, emit one key-value pair per word.

Input:  "the quick brown fox"
Output: ("the", 1), ("quick", 1), ("brown", 1), ("fox", 1)

Reduce: For each unique key, sum all the values.

Input:  ("the", [1, 1, 1, ...])   ← all occurrences of "the" from all documents
Output: ("the", 948573)

Between map and reduce, the framework performs a shuffle and sort: it collects every pair with the same key from every mapper and routes them all to one reducer. This is the part you don't write but it's doing the most work.

The Three Phases in Detail

1. Map

The input data sits in HDFS (Hadoop Distributed File System). HDFS splits large files into 128 MB blocks and stores each block on multiple nodes for redundancy.

The framework assigns one mapper per input split. Each mapper reads its chunk, runs your map() function on each record, and writes intermediate key-value pairs to its local disk. Map tasks run in parallel across all nodes. A job processing 1 TB of data with 128 MB splits launches ~8,000 mapper tasks.

2. Shuffle and Sort

Once mappers finish, the framework moves intermediate data across the network. Every reducer pulls the key-value pairs for its assigned set of keys from all mappers. The framework sorts the pairs by key, so the reducer receives them in order: all ("apple", 1) pairs before all ("the", 1) pairs.

This phase is network-intensive and often the bottleneck. Large jobs can move terabytes across the cluster during shuffle.

3. Reduce

Each reducer receives a sorted list of values for each key it owns. It runs your reduce() function once per unique key. The output goes directly to HDFS.

Reducers can also run in parallel. You choose the number. More reducers means smaller output files and faster completion if your reduce function is the bottleneck.

algobase.dev

MapReduce pipeline: HDFS splits the input across mapper workers, each emitting key-value pairs. The shuffle phase groups all pairs with the same key and sends them to one reducer. Each reducer aggregates its group and writes the final results back to HDFS.

1 / 1

Full MapReduce pipeline: HDFS splits input across mappers, shuffle groups by key, reducers aggregate and write output

Fault Tolerance

MapReduce was designed for unreliable hardware. Nodes fail. When they do:

Failed mapper: the framework re-runs the task on another node. The input data is still in HDFS.
Failed reducer: the framework re-runs the task. It re-fetches intermediate data from the mappers.
Speculative execution: if a task is running slower than its peers, the framework launches a duplicate on another node and takes whichever finishes first. This addresses "straggler" nodes without killing the slow one.

This fault tolerance is what made MapReduce practical at scale. Google was running it on commodity hardware with failure rates of several machines per day.

Real-World Use Cases

Google Search Indexing

The original MapReduce paper describes Google's search index build. Crawlers produced billions of web pages. MapReduce extracted links and keywords (map), then aggregated them to compute PageRank and build the inverted index that powers search results (reduce).

The New York Times Archive

The New York Times converted 11 million articles from 1851 to 1980 into a digital format. The image processing was distributed across hundreds of virtual machines using Hadoop MapReduce on AWS. A task that would have taken months finished in under 24 hours.

Log Analysis

Web servers generate terabytes of access logs. MapReduce jobs were widely used to compute aggregates: page views per URL, error rates per endpoint, unique visitors per day. Each log line is a record; the job emits (url, 1) pairs from the mapper and sums them in the reducer.

Why Spark Replaced It

MapReduce writes intermediate results to disk after every stage. For a multi-stage job (like ML training, which may iterate over the same data hundreds of times), this is punishing. Every iteration means reading from disk, writing to disk, shuffling over the network, writing to disk again.

Apache Spark keeps intermediate data in memory (RAM). For iterative workloads, this makes it 10-100x faster. Spark also provides a richer API: dataframes, SQL, machine learning libraries, and graph processing, all on the same engine.

Writing raw Hadoop MapReduce jobs is rare today. But the MapReduce model is still foundational. Spark, Flink, and most distributed query engines implement the same logical pipeline: partition the data, apply a function in parallel, shuffle by key, aggregate. The implementation changed; the model didn't.

Summary

MapReduce made distributed batch processing practical by splitting large datasets across many machines, applying a user-defined map function in parallel, shuffling intermediate results by key, and running a user-defined reduce function on each group. Fault tolerance through task re-execution and speculative execution made it reliable on commodity hardware. Spark replaced it for most use cases by keeping intermediate data in memory, but the map-shuffle-reduce model remains the foundation of modern distributed data processing.

ETL Pipelines

How helpful was this content?

Comments

0/2000

Saved on this device only