Skip to content

SimHash Near Duplicate Search

Find near duplicate documents by comparing compact SimHash fingerprints under Hamming distance.

SimHash near duplicate search uses compact bit fingerprints to detect items with similar weighted features. It is commonly used for near duplicate documents, pages, records, and text fragments.

The main idea is to convert a high dimensional feature vector into a fixed width fingerprint, usually 64 bits. Similar inputs tend to produce fingerprints with small Hamming distance.

Problem

Given a collection of items and a query item qq, find stored items whose SimHash fingerprints differ from qq by at most tt bits.

Formally, find all items xx such that:

HammingDistance(simhash(q),simhash(x))t \operatorname{HammingDistance}(\operatorname{simhash}(q), \operatorname{simhash}(x)) \le t

Model

Each item is represented as weighted features:

(f1,w1),(f2,w2),,(fn,wn) (f_1, w_1), (f_2, w_2), \dots, (f_n, w_n)

To compute a bb bit SimHash:

  1. Initialize an array vv of length bb with zeros.
  2. Hash each feature to a bb bit value.
  3. For each bit position, add the feature weight if the bit is 1, otherwise subtract the weight.
  4. Set output bit jj to 1 if v[j]0v[j] \ge 0, otherwise 0.

Algorithm

simhash_search(index, query, t):
    fp = simhash(query)
    candidates = lookup_candidate_buckets(index, fp)

    result = empty list
    for each item in candidates:
        if hamming_distance(fp, item.fingerprint) <= t:
            append item to result

    return result

The candidate lookup step usually partitions the fingerprint into bands. If two fingerprints are within a small Hamming distance, they must share at least one band under suitable partitioning.

Example

Suppose fingerprints are 8 bits:

itemfingerprint
query11010110
doc111010100
doc201000111
doc311011110

Compare by Hamming distance:

itemdistance from query
doc11
doc23
doc31

For threshold t=1t = 1, return doc1 and doc3.

Correctness

SimHash search has two layers.

The verification layer is exact for the fingerprints: every candidate whose Hamming distance is at most tt is accepted, and every candidate above tt is rejected.

The candidate generation layer may be approximate, depending on the indexing scheme. A full scan over all fingerprints is exact for the chosen SimHash threshold. A banded index is faster but must be configured carefully to avoid missing valid near duplicates.

Complexity

Let:

  • bb be the fingerprint width
  • CC be the number of candidates
  • nn be the number of stored items
operationtime
full scan queryO(n)O(n)
indexed queryO(C)O(C)
Hamming checkO(1)O(1) for fixed width fingerprints

Space

O(n) O(n)

Each item stores a fixed width fingerprint plus optional bucket index entries.

When to Use

SimHash near duplicate search is appropriate when:

  • items are represented by weighted features
  • near duplicate detection is needed
  • compact fingerprints are preferred
  • small differences should preserve similarity
  • approximate candidate generation is acceptable

It is common in web crawling, document deduplication, search indexing, spam detection, and content clustering.

Implementation

from collections import defaultdict

def feature_hash(feature):
    return hash(feature) & ((1 << 64) - 1)

def simhash(features):
    weights = [0] * 64

    for feature, weight in features:
        h = feature_hash(feature)
        for bit in range(64):
            if (h >> bit) & 1:
                weights[bit] += weight
            else:
                weights[bit] -= weight

    fp = 0
    for bit, value in enumerate(weights):
        if value >= 0:
            fp |= 1 << bit

    return fp

def hamming_distance(a, b):
    return (a ^ b).bit_count()

class SimHashIndex:
    def __init__(self, bands=4, band_bits=16):
        self.bands = bands
        self.band_bits = band_bits
        self.buckets = defaultdict(list)

    def _band_key(self, fp, band):
        mask = (1 << self.band_bits) - 1
        return (band, (fp >> (band * self.band_bits)) & mask)

    def add(self, item_id, features):
        fp = simhash(features)
        for band in range(self.bands):
            self.buckets[self._band_key(fp, band)].append((item_id, fp))

    def search(self, features, threshold):
        fp = simhash(features)
        candidates = {}

        for band in range(self.bands):
            key = self._band_key(fp, band)
            for item_id, other_fp in self.buckets.get(key, []):
                candidates[item_id] = other_fp

        result = []
        for item_id, other_fp in candidates.items():
            if hamming_distance(fp, other_fp) <= threshold:
                result.append(item_id)

        return result
type Feature struct {
	Name   string
	Weight int
}

type SimHashItem struct {
	ID string
	FP uint64
}

func hashFeature(s string) uint64 {
	var h uint64 = 1469598103934665603
	for i := 0; i < len(s); i++ {
		h ^= uint64(s[i])
		h *= 1099511628211
	}
	return h
}

func SimHash(features []Feature) uint64 {
	var weights [64]int

	for _, f := range features {
		h := hashFeature(f.Name)
		for bit := 0; bit < 64; bit++ {
			if (h>>uint(bit))&1 == 1 {
				weights[bit] += f.Weight
			} else {
				weights[bit] -= f.Weight
			}
		}
	}

	var fp uint64
	for bit := 0; bit < 64; bit++ {
		if weights[bit] >= 0 {
			fp |= uint64(1) << uint(bit)
		}
	}

	return fp
}

func HammingDistance(a, b uint64) int {
	x := a ^ b
	count := 0
	for x != 0 {
		x &= x - 1
		count++
	}
	return count
}