Data Structures for Scale: Introduction
Updated June 3, 2026You learned data structures in school: arrays, hash maps, trees. They're precise. A Set tells you exactly whether an element is in it. A Map stores exactly the value you put in. A sorted list gives you exact counts.
At a billion users, "exact" becomes the enemy.
Why Regular Data Structures Break at Scale
Let's say you're building the "unique visitors" counter for a website that gets 10 billion page views a day from 2 billion unique users. The naive approach: store every user ID in a HashSet. At 8 bytes per user ID, 2 billion users costs 16 GB of RAM. Per server. And you probably have hundreds of metrics like this running simultaneously.
Or consider spam filtering at Gmail's scale. You want to check if an email address has been flagged as spam before. A simple hash set of flagged addresses? Gmail processes 1.8 billion active users. A hash set exact-matching approach would require gigabytes of memory per lookup server, and you need many lookup servers for redundancy.
Or imagine a CDN like Cloudflare trying to check whether a URL is in its cache before doing a full lookup. Full cache lookup is expensive. You don't want to hit the backing store on every request just to find out "nope, not cached."
The pattern is always the same: you have a massive dataset, you need to query it constantly, and you can't afford the memory or latency of exact answers.
A website stores every unique visitor ID in a HashSet to count daily uniques. At 2 billion users (8 bytes each), approximately how much RAM does this require per server?
Enter Probabilistic Data Structures
Probabilistic data structures make a trade: give up a little accuracy, gain a lot of efficiency. Instead of telling you exactly how many unique users you had, they tell you "approximately 2.1 billion, ±2%." Instead of telling you definitely whether an element is in a set, they say "definitely NOT in the set" or "probably in the set."
That sounds risky. But think about the use cases:
- Counting unique visitors: Do you really care if it's 2,000,000,000 vs 2,001,432,789? The product decision is the same either way.
- Spam checking: A 1% false positive rate means 1% of non-spam gets flagged. That's acceptable if it means catching 99.9% of actual spam.
- Cache existence check: A false positive (thinking something is cached when it isn't) just means you do one unnecessary expensive lookup. No data is lost.
The tradeoff is deliberate. And in the right context, it's an excellent one.
Probabilistic data structures can return false negatives — telling you an element is NOT in the set when it actually is.
The Three Structures You Need to Know
The chapters that follow go deep on three probabilistic data structures that show up constantly in real production systems:
Bloom Filters: "Is this element in the set?"
A Bloom filter answers membership queries in O(1) time and constant space, regardless of how many elements are in the set. It has no false negatives (if it says "not in the set," it's definitely not in the set) but can have false positives (if it says "in the set," it might be wrong).
Cassandra uses Bloom filters to avoid disk reads. Chrome uses one for safe browsing. BigTable uses one to skip SSTable lookups. This is everywhere.
Which probabilistic data structure is best suited for answering "how many distinct users visited today?"
HyperLogLog: "How many distinct elements have I seen?"
HyperLogLog estimates the cardinality (number of unique elements) of a dataset using just a few kilobytes of memory, regardless of how many elements you've seen. The error rate is typically less than 1%.
Redis has HyperLogLog built in. Google used it to count distinct Google searches. Every analytics system doing "daily active users" at scale either uses HyperLogLog or something like it.
A CDN uses a probabilistic structure to check if a URL is cached before hitting the backing store. A false positive from this check means the CDN incorrectly serves stale data to the user.
Count-Min Sketch: "How many times have I seen this element?"
Count-Min Sketch is like a frequency table that uses sublinear space. You can ask "how many times has user X appeared in my stream?" and get an approximate answer using a fraction of the memory a full frequency map would require. The answer always over-counts, never under-counts.
Twitter uses Count-Min Sketch for trending topics. It's used in network traffic analysis, heavy hitter detection, and anywhere you need approximate frequency counts over a stream.
Count-Min Sketch is used for approximate frequency counts over a stream. How does it handle errors?
A Mental Model for All Three
Think of it this way: all three structures use hash functions to compress information, deliberately losing some precision in the process. They're one-way: you can ask summary questions, but you can't reconstruct the original data. That's fine. The summary questions are the ones you actually want to answer.
| Structure | Question It Answers | Error Type | Memory |
|---|---|---|---|
| Bloom Filter | Is X in the set? | False positives | Constant |
| HyperLogLog | How many distinct elements? | ±small % | ~12 KB |
| Count-Min Sketch | How often does X appear? | Over-counts | Sublinear |
All three probabilistic data structures discussed — Bloom Filter, HyperLogLog, and Count-Min Sketch — use hash functions to compress information, deliberately losing some precision.
Summary
At billion-scale, exact data structures often become impractical. The memory requirements are simply too large, and the latency requirements are too tight. Probabilistic data structures solve this by trading a small, bounded amount of accuracy for dramatic gains in space and speed. Bloom filters, HyperLogLog, and Count-Min Sketch are the core three, and they appear in production systems at Google, Amazon, Redis, Cassandra, and essentially every system you've heard of. Understanding how and why they work makes you a dramatically more effective distributed systems engineer.
Saved on this device only
Sign in to sync progress across devices