HyperLogLog
Updated June 3, 2026Imagine you run YouTube, and you want to display the number of unique viewers on the "Gangnam Style" video.
If you just increment a counter every time the video is played, you get total views. But to count unique views, you have to remember the user ID of every single person who watched it, so you don't count them twice.
If a billion unique people watch the video, storing a billion user IDs in a Hash Set would take gigabytes of RAM. Now multiply that by billions of videos on YouTube. It's an impossible amount of memory just to track a view count.
To solve this, system designers use an algorithm with a very silly name: HyperLogLog.
Why is storing every unique user ID in a Hash Set impractical for platforms like YouTube?
The Coin Flipping Game
To understand how it works, imagine you are locked in a room. Outside the room, your friend flips a coin. They flip it until they get "Heads", and then they yell the number of times they flipped it to you.
- "I flipped it 1 time!"
- "I flipped it 3 times!"
- "I flipped it 2 times!"
If your friend yells, "I flipped it 10 times!", you know they had to flip "Tails" 9 times in a row.
Statistically, getting 9 Tails in a row is incredibly rare (a 1 in 512 chance). So, just by hearing "10 times," you can confidently guess that your friend has been flipping the coin for a very, very long time.
You don't need to count every single flip. You just need to remember the longest streak of Tails they ever got, and you can mathematically estimate the total number of flips.
In the coin-flipping analogy, what piece of information does HyperLogLog actually store?
Hashing is the Coin Flip
Computers don't flip coins, but they generate Hashes.
When a user watches a YouTube video, HyperLogLog takes their User ID and hashes it into a long string of 0s and 1s (like 01001101...).
Instead of looking for streaks of "Tails," HyperLogLog looks at the hash and counts the streak of leading zeros (e.g., 0000101... has 4 leading zeros).
Just like the coin game, seeing a hash with a huge streak of leading zeros is statistically rare.
- For every unique User ID, HyperLogLog hashes it and counts the leading zeros.
- It simply remembers the highest number of leading zeros it has ever seen.
- Using that highest number, it can mathematically estimate how many unique users must have passed through the system to generate a hash that rare!
HyperLogLog uses leading zeros in a hash value the same way the coin-flipping game uses a streak of Tails.
The Catch: It's an Estimate
HyperLogLog doesn't give you the exact number. It gives you an estimate. Usually, it is accurate within 2%.
For YouTube, if a video has 10,000,000 unique views, showing 10,100,000 is perfectly fine. The users won't care about a 1% margin of error, but the engineers care deeply that they just tracked 10 million users using only 12 Kilobytes of memory.
Let that sink in: HyperLogLog can estimate the unique count of billions of items using just a few kilobytes of RAM.
What is the typical error margin of a HyperLogLog estimate?
HyperLogLog is only useful when exact counts are not required.
Real-World Examples
- Reddit & YouTube: Estimating unique views on posts and videos.
- Google Analytics: Quickly estimating the number of unique daily visitors to a website without scanning a massive database table.
- Redis: Has a built-in HyperLogLog data structure (
PFADDandPFCOUNTcommands) specifically for tracking unique metrics in real-time.
Which Redis commands are used to work with its built-in HyperLogLog data structure?
Summary
HyperLogLog is an incredibly memory-efficient algorithm used for the "Count-Distinct Problem." Instead of storing every unique item to count them, it hashes the items and looks for statistical anomalies (like long streaks of zeros) to estimate the total count. It trades a tiny bit of accuracy (usually ~2% error) for massive, gigabyte-saving memory reductions, making it essential for big data analytics.
How helpful was this content?
Comments
Sign in to join the discussion
Saved on this device only
Sign in to sync progress across devices