Find near duplicate documents by comparing compact SimHash fingerprints under Hamming distance.
SimHash near duplicate search uses compact bit fingerprints to detect items with similar weighted features. It is commonly used for near duplicate documents, pages, records, and text fragments.
The main idea is to convert a high dimensional feature vector into a fixed width fingerprint, usually 64 bits. Similar inputs tend to produce fingerprints with small Hamming distance.
Problem
Given a collection of items and a query item , find stored items whose SimHash fingerprints differ from by at most bits.
Formally, find all items such that:
Model
Each item is represented as weighted features:
To compute a bit SimHash:
- Initialize an array of length with zeros.
- Hash each feature to a bit value.
- For each bit position, add the feature weight if the bit is 1, otherwise subtract the weight.
- Set output bit to 1 if , otherwise 0.
Algorithm
simhash_search(index, query, t):
fp = simhash(query)
candidates = lookup_candidate_buckets(index, fp)
result = empty list
for each item in candidates:
if hamming_distance(fp, item.fingerprint) <= t:
append item to result
return resultThe candidate lookup step usually partitions the fingerprint into bands. If two fingerprints are within a small Hamming distance, they must share at least one band under suitable partitioning.
Example
Suppose fingerprints are 8 bits:
| item | fingerprint |
|---|---|
| query | 11010110 |
| doc1 | 11010100 |
| doc2 | 01000111 |
| doc3 | 11011110 |
Compare by Hamming distance:
| item | distance from query |
|---|---|
| doc1 | 1 |
| doc2 | 3 |
| doc3 | 1 |
For threshold , return doc1 and doc3.
Correctness
SimHash search has two layers.
The verification layer is exact for the fingerprints: every candidate whose Hamming distance is at most is accepted, and every candidate above is rejected.
The candidate generation layer may be approximate, depending on the indexing scheme. A full scan over all fingerprints is exact for the chosen SimHash threshold. A banded index is faster but must be configured carefully to avoid missing valid near duplicates.
Complexity
Let:
- be the fingerprint width
- be the number of candidates
- be the number of stored items
| operation | time |
|---|---|
| full scan query | |
| indexed query | |
| Hamming check | for fixed width fingerprints |
Space
Each item stores a fixed width fingerprint plus optional bucket index entries.
When to Use
SimHash near duplicate search is appropriate when:
- items are represented by weighted features
- near duplicate detection is needed
- compact fingerprints are preferred
- small differences should preserve similarity
- approximate candidate generation is acceptable
It is common in web crawling, document deduplication, search indexing, spam detection, and content clustering.
Implementation
from collections import defaultdict
def feature_hash(feature):
return hash(feature) & ((1 << 64) - 1)
def simhash(features):
weights = [0] * 64
for feature, weight in features:
h = feature_hash(feature)
for bit in range(64):
if (h >> bit) & 1:
weights[bit] += weight
else:
weights[bit] -= weight
fp = 0
for bit, value in enumerate(weights):
if value >= 0:
fp |= 1 << bit
return fp
def hamming_distance(a, b):
return (a ^ b).bit_count()
class SimHashIndex:
def __init__(self, bands=4, band_bits=16):
self.bands = bands
self.band_bits = band_bits
self.buckets = defaultdict(list)
def _band_key(self, fp, band):
mask = (1 << self.band_bits) - 1
return (band, (fp >> (band * self.band_bits)) & mask)
def add(self, item_id, features):
fp = simhash(features)
for band in range(self.bands):
self.buckets[self._band_key(fp, band)].append((item_id, fp))
def search(self, features, threshold):
fp = simhash(features)
candidates = {}
for band in range(self.bands):
key = self._band_key(fp, band)
for item_id, other_fp in self.buckets.get(key, []):
candidates[item_id] = other_fp
result = []
for item_id, other_fp in candidates.items():
if hamming_distance(fp, other_fp) <= threshold:
result.append(item_id)
return resulttype Feature struct {
Name string
Weight int
}
type SimHashItem struct {
ID string
FP uint64
}
func hashFeature(s string) uint64 {
var h uint64 = 1469598103934665603
for i := 0; i < len(s); i++ {
h ^= uint64(s[i])
h *= 1099511628211
}
return h
}
func SimHash(features []Feature) uint64 {
var weights [64]int
for _, f := range features {
h := hashFeature(f.Name)
for bit := 0; bit < 64; bit++ {
if (h>>uint(bit))&1 == 1 {
weights[bit] += f.Weight
} else {
weights[bit] -= f.Weight
}
}
}
var fp uint64
for bit := 0; bit < 64; bit++ {
if weights[bit] >= 0 {
fp |= uint64(1) << uint(bit)
}
}
return fp
}
func HammingDistance(a, b uint64) int {
x := a ^ b
count := 0
for x != 0 {
x &= x - 1
count++
}
return count
}