Bloom Filters

Updated June 3, 2026

Magic Magnets Team

10 min read

Here's a question that comes up everywhere in systems engineering: "Have I seen this thing before?"

Is this URL in my cache? Has this username been taken? Is this email address on the spam list? Has this document been indexed already?

The naive answer is to keep a set and check it. That works until your set gets large enough that it doesn't fit in memory. At that point, you need something smarter. That something is a Bloom filter.

What a Bloom Filter Does

A Bloom filter is a space-efficient data structure that answers the question: is element X in the set?

It has two possible answers:

Definitely not in the set — 100% accurate, zero false negatives
Probably in the set — could be wrong, small false positive rate

That asymmetry is the whole trick. It can never tell you "not in the set" when the element actually is. But it can occasionally tell you "in the set" when the element actually isn't.

For most use cases — "should I do an expensive lookup or skip it?" — this is perfectly acceptable. You pay a tiny false positive rate in exchange for massive memory savings.

Quiz Time

A Bloom filter says an element is "probably in the set." What does this guarantee?

How It Works: Bit Array + Hash Functions

A Bloom filter is just two things:

A bit array of size m, initialized to all zeros
k independent hash functions, each mapping an element to a position in the bit array

Inserting an element: Run the element through all k hash functions. Each function gives you an index. Set the bits at those k indices to 1.

Querying an element: Run the element through the same k hash functions. Get the k indices. If all those bits are 1, the element is probably in the set. If any bit is 0, the element is definitely not in the set.

Let's trace through a simple example with m=10 bits and k=3 hash functions:

Insert "alice":

h1("alice") = 2 → set bit 2
h2("alice") = 5 → set bit 5
h3("alice") = 8 → set bit 8

Insert "bob":

h1("bob") = 1 → set bit 1
h2("bob") = 5 → set bit 5 (already set)
h3("bob") = 9 → set bit 9

Query "carol":

h1("carol") = 2 → bit 2 is 1
h2("carol") = 5 → bit 5 is 1
h3("carol") = 9 → bit 9 is 1

All bits are set, so the filter says "carol is probably in the set." But carol was never inserted! This is a false positive — "carol" happened to hash to positions that were already set by alice and bob.

Quiz Time

When querying an element, a Bloom filter returns "definitely not in the set." How is this determined?

Why There Are No False Negatives

This is the elegant part. When you insert an element, you set k specific bits. Those bits are never cleared (Bloom filters don't support deletion). So when you query that element later, those same k bits are still set. The filter will always say "probably in the set" for elements that were actually inserted.

The only way to get a "definitely not in the set" answer is if at least one of the k bit positions is still 0 — which means that exact combination was never set by any inserted element.

Quiz Time

Bloom filters cannot produce false negatives.

The Math: Sizing Your Bloom Filter

You control three variables: the number of elements n you'll insert, the size of the bit array m, and the number of hash functions k. Together they determine your false positive rate p.

The optimal number of hash functions:

k = (m/n) × ln(2) ≈ 0.693 × (m/n)

The false positive probability given optimal k:

p ≈ (1 - e^(-kn/m))^k

Working backwards: if you want a 1% false positive rate for 1 million elements, you need:

m = -n × ln(p) / (ln(2))²
m = -1,000,000 × ln(0.01) / 0.480
m ≈ 9,585,000 bits ≈ 1.2 MB

1.2 MB to represent 1 million elements with 1% error. A hash set of 1 million 32-byte strings would cost ~30 MB. That's a 25× compression ratio.

For 0.1% error on 1 million elements: ~1.8 MB. Still tiny.

Elements	False Positive Rate	Bloom Filter Size	Hash Set Size
1M	1%	1.2 MB	~30 MB
1M	0.1%	1.8 MB	~30 MB
100M	1%	120 MB	~3 GB
1B	1%	1.2 GB	~30 GB

Quiz Time

Approximately how many bits per element does a Bloom filter need to achieve a 1% false positive rate?

Where Bloom Filters Show Up in Production

Cassandra

Cassandra stores data in SSTables on disk. When you read a key, Cassandra might need to check multiple SSTables to find it. Each SSTable has a Bloom filter. Before touching disk, Cassandra checks the Bloom filter: if it says "definitely not here," Cassandra skips that SSTable entirely. This dramatically reduces disk I/O for reads.

Google BigTable and LevelDB

Same idea — each Level in LevelDB has a Bloom filter. Looking up a key that doesn't exist becomes fast because you can eliminate almost every SSTable without a disk read.

Chrome's Safe Browsing

Chrome maintains a local Bloom filter of known malicious URLs. When you navigate to a URL, Chrome first checks the local filter. "Definitely not in the malicious list" → proceed with no network call. "Probably malicious" → do a full server-side check. This avoids sending every URL you visit to Google's servers while still catching most malicious URLs with minimal latency.

Akamai CDN

Akamai uses Bloom filters to avoid caching "one-hit wonders" — objects that are only requested once and would just pollute the cache. A Bloom filter tracks whether an object has been seen before. Only objects seen at least twice get added to the main cache.

Quiz Time

Cassandra uses Bloom filters to reduce disk I/O during reads.

Medium

Medium uses Bloom filters to filter out articles you've already read from your recommendation feed. They store your read history in a Bloom filter — if it says "probably read," the article is excluded. A false positive means you occasionally miss an article you haven't actually read. Acceptable tradeoff.

Limitations

No deletion — You can't remove elements from a standard Bloom filter (setting bits back to 0 would break other elements that share those bits). There are variants like Counting Bloom Filters that support deletion at the cost of more memory.
No enumeration — You can't list all elements in a Bloom filter. It's purely for membership queries.
Fixed capacity — The false positive rate increases as you add more elements beyond the designed capacity. You need to know your expected n upfront.

Quiz Time

Why can't you delete an element from a standard Bloom filter?

Summary

A Bloom filter is a bit array plus k hash functions that gives you fast, space-efficient membership testing with zero false negatives and a tunable false positive rate. The math is clean: for 1% false positives, you need about 9.6 bits per element regardless of element size. For 0.1%, about 14.4 bits. These are tiny compared to storing the actual elements. That's why Bloom filters appear everywhere you need to answer "have I seen this?" at scale — Cassandra, BigTable, Chrome, Akamai, and dozens of other systems all rely on them. Understand the trade-off (false positives are possible, false negatives are not), size it correctly for your expected load, and it's one of the most powerful tools in your distributed systems toolkit.

Cuckoo Filter

How helpful was this content?

Comments

0/2000

Saved on this device only