19.9 MinHash
Many large-scale systems need to compare sets.
19.9 MinHash
Many large-scale systems need to compare sets. Search engines compare documents. Recommendation systems compare user interests. Security systems compare malware signatures. Data integration pipelines compare records from different sources.
The obvious approach is to compute the exact similarity between every pair of sets. Unfortunately, exact comparison becomes expensive as datasets grow. A search index may contain billions of documents. A recommendation platform may store billions of user interactions. Computing exact set intersections repeatedly can dominate system costs.
MinHash solves this problem by replacing large sets with compact probabilistic signatures. These signatures preserve similarity information and allow fast approximate comparisons. The remarkable property of MinHash is that the probability of two signatures matching equals the similarity of the original sets.
This transforms a potentially expensive set-comparison problem into a simple comparison of small arrays.
Problem
Suppose we have two sets:
A = {1, 2, 3, 4, 5}
B = {2, 3, 4, 5, 6}
How similar are they?
A common measure is the Jaccard similarity:
|A ∩ B|
--------
|A ∪ B|
Intersection:
{2, 3, 4, 5}
Size:
4
Union:
{1, 2, 3, 4, 5, 6}
Size:
6
Similarity:
4 / 6 = 0.667
For two small sets, this is easy.
For millions of large sets, repeated intersection and union computations become expensive.
Recipe: Random Permutations
The key insight behind MinHash begins with a random permutation.
Suppose we randomly reorder the universe:
Original:
1 2 3 4 5 6
Random permutation:
4 1 6 2 5 3
Now identify the first element of each set that appears in the permutation.
For:
A = {1,2,3,4,5}
the first element encountered is:
4
For:
B = {2,3,4,5,6}
the first element encountered is also:
4
The signatures match.
A different permutation may produce different results.
The remarkable fact is that the probability of a match equals the Jaccard similarity.
Why It Works
Consider:
A ∪ B
Under a random permutation, every element in the union is equally likely to appear first.
The first element must belong to one of two groups:
A ∩ B
or
(A ∪ B) - (A ∩ B)
A signature match occurs exactly when the first element belongs to:
A ∩ B
Therefore:
P(match)
=
|A ∩ B|
---------
|A ∪ B|
which is precisely the Jaccard similarity.
This observation is the foundation of MinHash.
Single Hash Estimate
Suppose:
J(A,B) = 0.75
A single MinHash comparison behaves like a coin flip:
| Event | Probability |
|---|---|
| Match | 75% |
| No Match | 25% |
One observation is noisy.
We need many observations.
Building a Signature
Instead of one random permutation, use many.
Example:
Hash 1 → first element = 5
Hash 2 → first element = 2
Hash 3 → first element = 9
Hash 4 → first element = 1
Hash 5 → first element = 7
Signature:
[5, 2, 9, 1, 7]
Each position represents one independent MinHash experiment.
Two sets produce two signatures.
Similarity is estimated by counting matching positions.
Example
Set A signature:
[5, 2, 9, 1, 7]
Set B signature:
[5, 2, 8, 1, 4]
Matches:
5
2
1
Three matches out of five positions.
Estimated similarity:
3 / 5 = 0.6
Increasing signature length improves accuracy.
Practical Hash Functions
True random permutations are impractical for large universes.
Instead, MinHash simulates permutations using randomized hash functions.
A common family is:
def hash_i(x, a, b, p):
return (a * x + b) % p
Each hash function uses different random coefficients.
For each set:
- Apply the hash function to every element.
- Keep the minimum value.
- Store that minimum.
Hence the name:
MinHash
The signature consists of multiple minimum hash values.
Implementation
import random
class MinHash:
def __init__(
self,
num_hashes: int,
max_value: int = 1_000_000
):
self.prime = 2_147_483_647
self.hash_functions = []
for _ in range(num_hashes):
a = random.randint(1, self.prime - 1)
b = random.randint(0, self.prime - 1)
self.hash_functions.append((a, b))
def signature(
self,
values: set[int]
) -> list[int]:
signature = []
for a, b in self.hash_functions:
minimum = min(
((a * x + b) % self.prime)
for x in values
)
signature.append(minimum)
return signature
Example:
mh = MinHash(100)
a = {1, 2, 3, 4, 5}
b = {2, 3, 4, 5, 6}
sig_a = mh.signature(a)
sig_b = mh.signature(b)
Estimating Similarity
def estimate_similarity(
sig_a: list[int],
sig_b: list[int]
) -> float:
matches = sum(
1
for x, y in zip(sig_a, sig_b)
if x == y
)
return matches / len(sig_a)
Usage:
estimate_similarity(sig_a, sig_b)
Possible output:
0.67
close to the exact Jaccard value.
Accuracy
Suppose the true similarity is:
0.6
Signature length:
k = 100
Expected matches:
60
The estimate fluctuates around the true value.
As:
k
increases:
Variance ↓
Accuracy ↑
Typical production signatures range from:
64
to
512
hashes.
Why MinHash Is Useful
Consider:
10 million documents
Each document contains:
10,000 shingles
Exact pairwise comparison requires huge amounts of memory and computation.
A MinHash signature might contain:
128 integers
instead.
Comparison becomes:
128 comparisons
rather than:
10,000 set operations
per document pair.
The savings are dramatic.
Application: Near-Duplicate Documents
Search engines frequently encounter duplicate content.
Example:
Document A:
"The quick brown fox jumps"
Document B:
"The quick brown fox jumps over"
The documents are nearly identical.
MinHash signatures become highly similar.
Documents with high estimated similarity can be grouped together before expensive exact processing.
This technique is widely used in web indexing.
Application: Recommendation Systems
Users can be represented as sets.
Example:
User A:
{movie1, movie2, movie3}
User B:
{movie2, movie3, movie4}
MinHash quickly identifies users with similar preferences.
The system can then perform more expensive analysis only on promising candidates.
Application: Data Deduplication
Large storage systems often contain:
- Duplicate files
- Duplicate records
- Similar records
- Versioned documents
MinHash helps identify likely duplicates without comparing every byte.
Locality-Sensitive Hashing
MinHash becomes even more powerful when combined with Locality-Sensitive Hashing (LSH).
Instead of comparing every signature pair:
O(n²)
LSH groups similar signatures into buckets.
Only nearby candidates are compared.
This technique enables similarity search at internet scale.
We examine LSH later in this chapter.
Complexity
Let:
n = set size
k = signature length
Building a signature:
O(kn)
Comparing two signatures:
O(k)
Storage:
O(k)
per set.
The comparison cost becomes independent of the original set size.
This is the key advantage.
Probabilistic Analysis
Each signature position is an independent Bernoulli trial.
Success probability:
p = Jaccard similarity
Expected matches:
kp
Variance:
kp(1-p)
Increasing:
k
reduces estimation error.
This follows directly from standard probability theory.
Common Mistakes
Using Too Few Hashes
Small signatures have high variance.
A signature length of:
4
or:
8
is usually insufficient.
Comparing Raw Sets Repeatedly
The value of MinHash comes from signature reuse.
Build once.
Compare many times.
Ignoring Universe Representation
Set construction often dominates overall cost.
Efficient preprocessing matters.
Confusing Similarity with Equality
A high similarity score does not imply identical sets.
MinHash estimates similarity, not exact equality.
Discussion
MinHash is one of the most elegant examples of randomized algorithm design. A simple probabilistic observation transforms expensive set comparisons into compact signatures that preserve similarity information. The resulting technique scales from thousands of sets to billions of documents while maintaining mathematically predictable accuracy.
The power of MinHash comes from replacing exact computation with a carefully designed estimator. Similar sets tend to produce similar signatures. Dissimilar sets tend to produce different signatures. This allows large systems to focus expensive computation only where it matters.
The next section examines HyperLogLog, another influential probabilistic algorithm that uses randomness to solve a different problem: estimating the number of distinct elements in massive datasets while using only a few kilobytes of memory.