Locality Sensitive Hashing Search

Locality sensitive hashing, often abbreviated as LSH, is a search technique for approximate nearest neighbor queries. It uses hash functions that are designed so that similar objects are more likely to collide than dissimilar objects.

You use it when exact nearest neighbor search is too expensive, especially in high dimensional spaces.

Problem

Given a dataset $D$ and a query object $q$ , find items in $D$ that are likely to be close to $q$ under a similarity or distance function.

Common examples include:

cosine similarity for vectors
Jaccard similarity for sets
Hamming distance for bit strings
Euclidean distance for points

Model

An LSH family is a family of hash functions where close objects collide with higher probability than far objects.

For a hash function $h$ from the family:

\Pr[h(x) = h(y)]

is larger when $x$ and $y$ are similar.

To improve recall, LSH usually uses multiple hash tables. A query hashes into each table, retrieves candidates from matching buckets, then verifies candidates using the real distance function.

Algorithm

lsh_search(tables, q, limit):
    candidates = empty set

    for each table in tables:
        key = table.hash(q)
        for each item in table.bucket[key]:
            add item to candidates

    score each candidate by true distance to q
    return best limit candidates

Example

Suppose we search for near duplicate documents using MinHash based LSH.

A document is represented as a set of shingles:

S_q = {a, b, c, d, e}

Each LSH table hashes the MinHash signature band of the query. Documents with matching bands become candidates.

table	query bucket	candidates
1	18	doc2, doc9
2	44	doc9
3	7	doc1, doc2

The candidate set is:

{\text{doc1}, \text{doc2}, \text{doc9}}

Only these candidates are compared with the true similarity measure.

Correctness

LSH is an approximate algorithm. It does not guarantee that the exact nearest neighbor will be returned.

Its correctness is probabilistic: similar items have a high probability of colliding in at least one table, so they are likely to become candidates. Dissimilar items have a lower probability of collision, so the candidate set is usually much smaller than the full dataset.

The final verification step ranks candidates using the actual distance or similarity function.

Complexity

Let:

$L$ be the number of hash tables
$C$ be the number of retrieved candidates
$d$ be the cost of computing true distance

operation	time
query hashing	$O(L)$
candidate verification	$O(Cd)$
total query	$O(L + Cd)$

The practical goal is to keep $C \ll n$ .

Space

O(Ln)

Each item is stored once per hash table, plus metadata for buckets and signatures.

When to Use

Locality sensitive hashing is appropriate when:

approximate search is acceptable
exact nearest neighbor search is too slow
data is high dimensional
candidate generation is needed before exact reranking
recall can be improved by using more tables

It is common in image search, near duplicate detection, recommendation systems, vector search, document retrieval, and clustering pipelines.

Implementation

import random
from collections import defaultdict

class RandomHyperplaneLSH:
    def __init__(self, dim, tables=8, bits=8):
        self.dim = dim
        self.tables = tables
        self.bits = bits
        self.planes = [
            [[random.gauss(0, 1) for _ in range(dim)] for _ in range(bits)]
            for _ in range(tables)
        ]
        self.buckets = [defaultdict(list) for _ in range(tables)]

    def _dot(self, a, b):
        return sum(x * y for x, y in zip(a, b))

    def _hash(self, table_id, vector):
        key = 0
        for i, plane in enumerate(self.planes[table_id]):
            if self._dot(vector, plane) >= 0:
                key |= 1 << i
        return key

    def add(self, item_id, vector):
        for t in range(self.tables):
            key = self._hash(t, vector)
            self.buckets[t][key].append((item_id, vector))

    def search(self, query, limit=10):
        seen = {}
        for t in range(self.tables):
            key = self._hash(t, query)
            for item_id, vector in self.buckets[t].get(key, []):
                seen[item_id] = vector

        scored = []
        for item_id, vector in seen.items():
            score = self._dot(query, vector)
            scored.append((score, item_id))

        scored.sort(reverse=True)
        return [item_id for _, item_id in scored[:limit]]

type Item struct {
	ID     string
	Vector []float64
}

type LSH struct {
	Buckets []map[uint64][]Item
	Planes  [][][]float64
}

func dot(a, b []float64) float64 {
	var s float64
	for i := range a {
		s += a[i] * b[i]
	}
	return s
}

func (l *LSH) hash(table int, v []float64) uint64 {
	var key uint64
	for i, plane := range l.Planes[table] {
		if dot(v, plane) >= 0 {
			key |= uint64(1) << uint(i)
		}
	}
	return key
}

func (l *LSH) Search(q []float64) []Item {
	seen := map[string]Item{}

	for t := range l.Buckets {
		key := l.hash(t, q)
		for _, item := range l.Buckets[t][key] {
			seen[item.ID] = item
		}
	}

	out := make([]Item, 0, len(seen))
	for _, item := range seen {
		out = append(out, item)
	}

	return out
}