MinHash

Updated June 3, 2026

Magic Magnets Team

6 min read

Imagine you are building a search engine like Google, and you want to prevent identical (or nearly identical) articles from flooding your search results. If someone copies a Wikipedia article, changes a few words, and posts it on their blog, you want to recognize it as a duplicate.

Checking for exact duplicates is easy. Just hash the entire document and compare the hashes. But checking for near duplicates is brutally hard.

If two 500-page books differ by just one comma, their standard cryptographic hashes (like SHA-256) will look completely different. Comparing every document word-by-word against every other document on the internet is mathematically impossible.

To solve the "Near-Duplicate Problem," we use an elegant algorithm called MinHash.

Shingling the Roof

Before we can use MinHash, we have to turn a document into a set of pieces. Think of overlapping roof shingles.

Instead of looking at the whole article, we break it into overlapping chunks of 5 words (called n-grams or shingles). "The quick brown fox jumps" "quick brown fox jumps over" "brown fox jumps over the"

Now, Document A is just a massive bucket of these 5-word shingles. Document B is another bucket of shingles. If they share 95% of their shingles, they are near-duplicates.

But checking every shingle in Bucket A against every shingle in Bucket B is still too slow. We need a shortcut.

Quiz Time

What is the main problem that MinHash solves that a standard cryptographic hash (like SHA-256) cannot?

The MinHash Trick

Instead of comparing millions of shingles, what if we randomly pick just one shingle from Bucket A and one from Bucket B? If they match, there's a good chance the documents are similar.

But picking randomly is messy. We need a systematic way to pick the "same" random shingle if it exists in both documents.

Here is the magic of MinHash:

We take a Hash Function and run every single shingle in Document A through it.
The Hash Function spits out a bunch of random-looking numbers.
We find the absolute lowest number (the Minimum Hash).
We throw away everything else. We just save that one minimum number to represent Document A.

We do the same thing for Document B using the exact same Hash Function.

Here is the mathematical breakthrough: If Document A and Document B are 80% similar, there is exactly an 80% chance that their Minimum Hash will be identical.

Quiz Time

In the shingling step, a document is broken into overlapping chunks of words (n-grams). What is the purpose of this step?

Quiz Time

If two documents are 70% similar, the probability that a single MinHash function produces the same minimum value for both documents is 70%.

Scaling it up

Relying on just one Hash Function is risky. So, in the real world, we use 100 different Hash Functions.

For a 500-page book, instead of storing millions of shingles, we run all the shingles through the 100 Hash Functions, find the minimums, and we are left with a tiny array of exactly 100 numbers (called a Signature).

To compare any two books in the world, we just compare their 100-number Signatures. If 85 out of the 100 numbers match, the books are 85% similar.

We turned a massive text-comparison problem into a simple, lightning-fast integer comparison!

Quiz Time

Why does the real-world implementation of MinHash use 100 hash functions instead of just one?

Quiz Time

A MinHash signature stores all the shingle hashes produced by every hash function, not just the minimum values.

Real-World Examples

Google Search: Uses MinHash to detect scraped or plagiarized content so they only show the original source in search results.
Bioinformatics: Used to compare massive strings of DNA to find genetic similarities between organisms.
News Aggregators: Grouping together articles from different websites that are reporting on the exact same press release.

Quiz Time

Which of the following is NOT a real-world application of MinHash mentioned in the article?

Summary

MinHash is a technique for quickly estimating how similar two sets of data are, specifically solving the "Near-Duplicate" problem. By converting documents into sets of shingles, hashing them with multiple functions, and keeping only the minimum values, MinHash creates a tiny "Signature" for massive documents. Comparing these small signatures allows systems to detect plagiarism or grouping similar items across billions of records in milliseconds.

Skip Lists

How helpful was this content?

Comments

0/2000

Saved on this device only