Estimate set similarity and find similar items using compact MinHash signatures and banding.
MinHash similarity search estimates the similarity between sets and retrieves candidates that are likely similar to a query. It is widely used for near duplicate detection when items are represented as sets, such as document shingles.
The key idea is that MinHash signatures preserve Jaccard similarity in expectation.
Problem
Given a collection of sets and a query set , find sets such that:
is high.
Model
The Jaccard similarity between two sets is:
J(A,B) = \frac{|A \cap B|}{|A \cup B|}
A MinHash signature of a set is constructed by applying multiple hash functions and taking the minimum hash value for each.
For hash functions:
The probability that two sets have equal MinHash values at a position equals their Jaccard similarity.
Algorithm
minhash_search(index, query, bands, rows):
sig = compute_minhash_signature(query)
candidates = empty set
for each band:
key = band_hash(sig, band)
for each item in index[band][key]:
add item to candidates
score each candidate using Jaccard or signature similarity
return best matchesExample
Let:
and:
Then:
So:
Their MinHash signatures will match in approximately 33 percent of positions.
Using banding, if at least one band matches exactly, the pair becomes a candidate.
Correctness
MinHash provides an unbiased estimator of Jaccard similarity. For each hash function:
Banding groups signature rows so that similar sets collide in at least one band with high probability, while dissimilar sets rarely collide.
The candidate generation is probabilistic. The final similarity check ensures correctness for the reported results.
Complexity
Let:
- be signature length
- be number of bands
- be number of candidates
| operation | time | ||
|---|---|---|---|
| signature computation | $O(k | Q | )$ |
| band lookup | |||
| candidate verification |
Space
Each item stores a signature of length plus band index entries.
When to Use
MinHash similarity search is appropriate when:
- items are sets
- Jaccard similarity is meaningful
- approximate search is acceptable
- large scale deduplication is needed
- candidate generation must be fast
It is common in web crawling, document deduplication, recommendation systems, clustering, and search indexing.
Implementation
import random
from collections import defaultdict
class MinHashLSH:
def __init__(self, num_hashes=64, bands=8):
self.num_hashes = num_hashes
self.bands = bands
self.rows = num_hashes // bands
self.hash_funcs = [
(random.randint(1, 10**9), random.randint(0, 10**9))
for _ in range(num_hashes)
]
self.buckets = [defaultdict(list) for _ in range(bands)]
def _hash(self, x, a, b):
return (a * hash(x) + b) % (2**31 - 1)
def signature(self, s):
sig = []
for a, b in self.hash_funcs:
sig.append(min(self._hash(x, a, b) for x in s))
return sig
def add(self, item_id, s):
sig = self.signature(s)
for b in range(self.bands):
start = b * self.rows
band = tuple(sig[start:start+self.rows])
self.buckets[b][band].append((item_id, sig))
def search(self, query):
sig = self.signature(query)
candidates = {}
for b in range(self.bands):
start = b * self.rows
band = tuple(sig[start:start+self.rows])
for item_id, other_sig in self.buckets[b].get(band, []):
candidates[item_id] = other_sig
return list(candidates.keys())type MinHash struct {
HashA []uint64
HashB []uint64
}
func (mh *MinHash) hash(x string, a, b uint64) uint64 {
var h uint64 = 1469598103934665603
for i := 0; i < len(x); i++ {
h ^= uint64(x[i])
h *= 1099511628211
}
return (a*h + b) % 2147483647
}
func (mh *MinHash) Signature(set []string) []uint64 {
k := len(mh.HashA)
sig := make([]uint64, k)
for i := 0; i < k; i++ {
minVal := uint64(^uint64(0))
for _, x := range set {
v := mh.hash(x, mh.HashA[i], mh.HashB[i])
if v < minVal {
minVal = v
}
}
sig[i] = minVal
}
return sig
}