Distributed File Systems
Updated June 3, 2026Imagine you've written a book that is so incredibly large that no single bookshelf in the world can hold it. To store it, you have to rip out the pages, bundle them into chapters, and store different chapters in different libraries across the country.
If someone wants to read your book, they can't just go to one library. They need a master index that tells them exactly which library holds Chapter 1, which holds Chapter 2, and so on.
This is the exact problem a Distributed File System (DFS) solves for data.
The Core Concept
When data grows beyond the capacity of a single hard drive (or even a single high-end server), you have no choice but to distribute it across multiple machines. A Distributed File System allows multiple servers (often hundreds or thousands) to act like one giant, unified file system.
To the user or application, it just looks like a regular folder on their computer. But under the hood, the files are chopped into blocks and scattered across a fleet of commodity servers.
Anatomy of a DFS
DFS read path — Name Node metadata lookup, then direct chunk reads
Most distributed file systems (like Hadoop HDFS or Google File System) use a similar architecture:
- The Master Node: This is the librarian. It doesn't hold the actual file data. Instead, it holds the metadata—the directory tree and the mapping of which chunks of data live on which servers.
- The Chunk Servers: These are the grunts. They store the actual pieces of the files.
When a client wants to read a file, it first asks the Master Node, "Where are the chunks for log_data.csv?" The Master Node replies with a list of IP addresses. The client then talks directly to those Chunk Servers to get the data.
What is the primary role of the Master Node in a distributed file system?
A client reading a file from a DFS must always route all data transfers through the Master Node.
Real-World Examples
- Google File System (GFS): Google created GFS to store the massive amounts of web crawling data needed for their search engine. They relied on cheap, easily replaceable hardware, relying on software to handle the inevitable hardware failures.
- Hadoop Distributed File System (HDFS): Inspired by GFS, HDFS became the open-source standard for big data. Companies like Yahoo and Facebook used HDFS extensively to store massive data lakes for analytics.
Which of the following best describes why GFS and HDFS are designed to tolerate hardware failures gracefully?
Dealing with Failures
Fault tolerance — 3x chunk replication across different racks
If your data is spread across 1,000 cheap servers, some of those servers will crash. By default, systems like HDFS replicate every chunk of data three times across different servers.
[!NOTE] This replication ensures high availability and fault tolerance, but it means a 1TB file actually consumes 3TB of physical disk space.
HDFS replicates every chunk 3 times by default. What is the direct storage cost of storing a 1 TB file with this default setting?
Trade-offs and Workloads
Distributed file systems are heavily optimized for throughput (reading massive amounts of data quickly) rather than latency (reading small amounts of data instantly). They are also optimized for append-only workloads.
Distributed file systems like HDFS are optimized for low-latency random reads of small records.
Summary
- A DFS allows hundreds of servers to act as one giant storage drive.
- Files are broken into chunks and distributed across "Chunk Servers."
- A "Master Node" keeps track of where all the chunks live.
- Perfect for Big Data and analytics; terrible for low-latency database storage.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices