19.9 MinHash

Many large-scale systems need to compare sets.

19.9 MinHash

Many large-scale systems need to compare sets. Search engines compare documents. Recommendation systems compare user interests. Security systems compare malware signatures. Data integration pipelines compare records from different sources.

The obvious approach is to compute the exact similarity between every pair of sets. Unfortunately, exact comparison becomes expensive as datasets grow. A search index may contain billions of documents. A recommendation platform may store billions of user interactions. Computing exact set intersections repeatedly can dominate system costs.

MinHash solves this problem by replacing large sets with compact probabilistic signatures. These signatures preserve similarity information and allow fast approximate comparisons. The remarkable property of MinHash is that the probability of two signatures matching equals the similarity of the original sets.

This transforms a potentially expensive set-comparison problem into a simple comparison of small arrays.

Problem

Suppose we have two sets:

A = {1, 2, 3, 4, 5}
B = {2, 3, 4, 5, 6}

How similar are they?

A common measure is the Jaccard similarity:

|A ∩ B|
--------
|A ∪ B|

Intersection:

{2, 3, 4, 5}

Size:

4

Union:

{1, 2, 3, 4, 5, 6}

Size:

6

Similarity:

4 / 6 = 0.667

For two small sets, this is easy.

For millions of large sets, repeated intersection and union computations become expensive.

Recipe: Random Permutations

The key insight behind MinHash begins with a random permutation.

Suppose we randomly reorder the universe:

Original:
1 2 3 4 5 6

Random permutation:

4 1 6 2 5 3

Now identify the first element of each set that appears in the permutation.

For:

A = {1,2,3,4,5}

the first element encountered is:

4

For:

B = {2,3,4,5,6}

the first element encountered is also:

4

The signatures match.

A different permutation may produce different results.

The remarkable fact is that the probability of a match equals the Jaccard similarity.

Why It Works

Consider:

A ∪ B

Under a random permutation, every element in the union is equally likely to appear first.

The first element must belong to one of two groups:

A ∩ B

or

(A ∪ B) - (A ∩ B)

A signature match occurs exactly when the first element belongs to:

A ∩ B

Therefore:

P(match)
=
|A ∩ B|
---------
|A ∪ B|

which is precisely the Jaccard similarity.

This observation is the foundation of MinHash.

Single Hash Estimate

Suppose:

J(A,B) = 0.75

A single MinHash comparison behaves like a coin flip:

Event Probability
Match 75%
No Match 25%

One observation is noisy.

We need many observations.

Building a Signature

Instead of one random permutation, use many.

Example:

Hash 1 → first element = 5
Hash 2 → first element = 2
Hash 3 → first element = 9
Hash 4 → first element = 1
Hash 5 → first element = 7

Signature:

[5, 2, 9, 1, 7]

Each position represents one independent MinHash experiment.

Two sets produce two signatures.

Similarity is estimated by counting matching positions.

Example

Set A signature:

[5, 2, 9, 1, 7]

Set B signature:

[5, 2, 8, 1, 4]

Matches:

5
2
1

Three matches out of five positions.

Estimated similarity:

3 / 5 = 0.6

Increasing signature length improves accuracy.

Practical Hash Functions

True random permutations are impractical for large universes.

Instead, MinHash simulates permutations using randomized hash functions.

A common family is:

def hash_i(x, a, b, p):
    return (a * x + b) % p

Each hash function uses different random coefficients.

For each set:

  1. Apply the hash function to every element.
  2. Keep the minimum value.
  3. Store that minimum.

Hence the name:

MinHash

The signature consists of multiple minimum hash values.

Implementation

import random

class MinHash:
    def __init__(
        self,
        num_hashes: int,
        max_value: int = 1_000_000
    ):
        self.prime = 2_147_483_647
        self.hash_functions = []

        for _ in range(num_hashes):
            a = random.randint(1, self.prime - 1)
            b = random.randint(0, self.prime - 1)

            self.hash_functions.append((a, b))

    def signature(
        self,
        values: set[int]
    ) -> list[int]:

        signature = []

        for a, b in self.hash_functions:
            minimum = min(
                ((a * x + b) % self.prime)
                for x in values
            )

            signature.append(minimum)

        return signature

Example:

mh = MinHash(100)

a = {1, 2, 3, 4, 5}
b = {2, 3, 4, 5, 6}

sig_a = mh.signature(a)
sig_b = mh.signature(b)

Estimating Similarity

def estimate_similarity(
    sig_a: list[int],
    sig_b: list[int]
) -> float:

    matches = sum(
        1
        for x, y in zip(sig_a, sig_b)
        if x == y
    )

    return matches / len(sig_a)

Usage:

estimate_similarity(sig_a, sig_b)

Possible output:

0.67

close to the exact Jaccard value.

Accuracy

Suppose the true similarity is:

0.6

Signature length:

k = 100

Expected matches:

60

The estimate fluctuates around the true value.

As:

k

increases:

Variance ↓
Accuracy ↑

Typical production signatures range from:

64

to

512

hashes.

Why MinHash Is Useful

Consider:

10 million documents

Each document contains:

10,000 shingles

Exact pairwise comparison requires huge amounts of memory and computation.

A MinHash signature might contain:

128 integers

instead.

Comparison becomes:

128 comparisons

rather than:

10,000 set operations

per document pair.

The savings are dramatic.

Application: Near-Duplicate Documents

Search engines frequently encounter duplicate content.

Example:

Document A:
"The quick brown fox jumps"

Document B:
"The quick brown fox jumps over"

The documents are nearly identical.

MinHash signatures become highly similar.

Documents with high estimated similarity can be grouped together before expensive exact processing.

This technique is widely used in web indexing.

Application: Recommendation Systems

Users can be represented as sets.

Example:

User A:
{movie1, movie2, movie3}

User B:
{movie2, movie3, movie4}

MinHash quickly identifies users with similar preferences.

The system can then perform more expensive analysis only on promising candidates.

Application: Data Deduplication

Large storage systems often contain:

  • Duplicate files
  • Duplicate records
  • Similar records
  • Versioned documents

MinHash helps identify likely duplicates without comparing every byte.

Locality-Sensitive Hashing

MinHash becomes even more powerful when combined with Locality-Sensitive Hashing (LSH).

Instead of comparing every signature pair:

O(n²)

LSH groups similar signatures into buckets.

Only nearby candidates are compared.

This technique enables similarity search at internet scale.

We examine LSH later in this chapter.

Complexity

Let:

n = set size
k = signature length

Building a signature:

O(kn)

Comparing two signatures:

O(k)

Storage:

O(k)

per set.

The comparison cost becomes independent of the original set size.

This is the key advantage.

Probabilistic Analysis

Each signature position is an independent Bernoulli trial.

Success probability:

p = Jaccard similarity

Expected matches:

kp

Variance:

kp(1-p)

Increasing:

k

reduces estimation error.

This follows directly from standard probability theory.

Common Mistakes

Using Too Few Hashes

Small signatures have high variance.

A signature length of:

4

or:

8

is usually insufficient.

Comparing Raw Sets Repeatedly

The value of MinHash comes from signature reuse.

Build once.

Compare many times.

Ignoring Universe Representation

Set construction often dominates overall cost.

Efficient preprocessing matters.

Confusing Similarity with Equality

A high similarity score does not imply identical sets.

MinHash estimates similarity, not exact equality.

Discussion

MinHash is one of the most elegant examples of randomized algorithm design. A simple probabilistic observation transforms expensive set comparisons into compact signatures that preserve similarity information. The resulting technique scales from thousands of sets to billions of documents while maintaining mathematically predictable accuracy.

The power of MinHash comes from replacing exact computation with a carefully designed estimator. Similar sets tend to produce similar signatures. Dissimilar sets tend to produce different signatures. This allows large systems to focus expensive computation only where it matters.

The next section examines HyperLogLog, another influential probabilistic algorithm that uses randomness to solve a different problem: estimating the number of distinct elements in massive datasets while using only a few kilobytes of memory.