MinHash Similarity Search

MinHash similarity search estimates the similarity between sets and retrieves candidates that are likely similar to a query. It is widely used for near duplicate detection when items are represented as sets, such as document shingles.

The key idea is that MinHash signatures preserve Jaccard similarity in expectation.

Problem

Given a collection of sets and a query set $Q$ , find sets $S$ such that:

\operatorname{Jaccard}(Q, S)

is high.

Model

The Jaccard similarity between two sets is:

J(A,B) = \frac{|A \cap B|}{|A \cup B|}

A MinHash signature of a set is constructed by applying multiple hash functions and taking the minimum hash value for each.

For $k$ hash functions:

\text{signature}(S) = [\min h_1(S), \min h_2(S), \dots, \min h_k(S)]

The probability that two sets have equal MinHash values at a position equals their Jaccard similarity.

Algorithm

minhash_search(index, query, bands, rows):
    sig = compute_minhash_signature(query)
    candidates = empty set

    for each band:
        key = band_hash(sig, band)
        for each item in index[band][key]:
            add item to candidates

    score each candidate using Jaccard or signature similarity
    return best matches

Example

Let:

Q = {a, b, c, d}

and:

S = {a, b, e, f}

Then:

$|Q \cap S| = 2$
$|Q \cup S| = 6$

So:

J(Q, S) = \frac{2}{6} = 0.33

Their MinHash signatures will match in approximately 33 percent of positions.

Using banding, if at least one band matches exactly, the pair becomes a candidate.

Correctness

MinHash provides an unbiased estimator of Jaccard similarity. For each hash function:

\Pr[\min h_i(Q) = \min h_i(S)] = J(Q, S)

Banding groups signature rows so that similar sets collide in at least one band with high probability, while dissimilar sets rarely collide.

The candidate generation is probabilistic. The final similarity check ensures correctness for the reported results.

Complexity

Let:

$k$ be signature length
$b$ be number of bands
$C$ be number of candidates

operation	time
signature computation	$O(k	Q	)$
band lookup	$O(b)$
candidate verification	$O(C)$

Space

O(nk)

Each item stores a signature of length $k$ plus band index entries.

When to Use

MinHash similarity search is appropriate when:

items are sets
Jaccard similarity is meaningful
approximate search is acceptable
large scale deduplication is needed
candidate generation must be fast

It is common in web crawling, document deduplication, recommendation systems, clustering, and search indexing.

Implementation

import random
from collections import defaultdict

class MinHashLSH:
    def __init__(self, num_hashes=64, bands=8):
        self.num_hashes = num_hashes
        self.bands = bands
        self.rows = num_hashes // bands
        self.hash_funcs = [
            (random.randint(1, 10**9), random.randint(0, 10**9))
            for _ in range(num_hashes)
        ]
        self.buckets = [defaultdict(list) for _ in range(bands)]

    def _hash(self, x, a, b):
        return (a * hash(x) + b) % (2**31 - 1)

    def signature(self, s):
        sig = []
        for a, b in self.hash_funcs:
            sig.append(min(self._hash(x, a, b) for x in s))
        return sig

    def add(self, item_id, s):
        sig = self.signature(s)
        for b in range(self.bands):
            start = b * self.rows
            band = tuple(sig[start:start+self.rows])
            self.buckets[b][band].append((item_id, sig))

    def search(self, query):
        sig = self.signature(query)
        candidates = {}

        for b in range(self.bands):
            start = b * self.rows
            band = tuple(sig[start:start+self.rows])
            for item_id, other_sig in self.buckets[b].get(band, []):
                candidates[item_id] = other_sig

        return list(candidates.keys())

type MinHash struct {
	HashA []uint64
	HashB []uint64
}

func (mh *MinHash) hash(x string, a, b uint64) uint64 {
	var h uint64 = 1469598103934665603
	for i := 0; i < len(x); i++ {
		h ^= uint64(x[i])
		h *= 1099511628211
	}
	return (a*h + b) % 2147483647
}

func (mh *MinHash) Signature(set []string) []uint64 {
	k := len(mh.HashA)
	sig := make([]uint64, k)

	for i := 0; i < k; i++ {
		minVal := uint64(^uint64(0))
		for _, x := range set {
			v := mh.hash(x, mh.HashA[i], mh.HashB[i])
			if v < minVal {
				minVal = v
			}
		}
		sig[i] = minVal
	}

	return sig
}