Search for approximate nearest neighbors by hashing similar items into the same buckets with high probability.
Locality sensitive hashing, often abbreviated as LSH, is a search technique for approximate nearest neighbor queries. It uses hash functions that are designed so that similar objects are more likely to collide than dissimilar objects.
You use it when exact nearest neighbor search is too expensive, especially in high dimensional spaces.
Problem
Given a dataset and a query object , find items in that are likely to be close to under a similarity or distance function.
Common examples include:
- cosine similarity for vectors
- Jaccard similarity for sets
- Hamming distance for bit strings
- Euclidean distance for points
Model
An LSH family is a family of hash functions where close objects collide with higher probability than far objects.
For a hash function from the family:
is larger when and are similar.
To improve recall, LSH usually uses multiple hash tables. A query hashes into each table, retrieves candidates from matching buckets, then verifies candidates using the real distance function.
Algorithm
lsh_search(tables, q, limit):
candidates = empty set
for each table in tables:
key = table.hash(q)
for each item in table.bucket[key]:
add item to candidates
score each candidate by true distance to q
return best limit candidatesExample
Suppose we search for near duplicate documents using MinHash based LSH.
A document is represented as a set of shingles:
Each LSH table hashes the MinHash signature band of the query. Documents with matching bands become candidates.
| table | query bucket | candidates |
|---|---|---|
| 1 | 18 | doc2, doc9 |
| 2 | 44 | doc9 |
| 3 | 7 | doc1, doc2 |
The candidate set is:
Only these candidates are compared with the true similarity measure.
Correctness
LSH is an approximate algorithm. It does not guarantee that the exact nearest neighbor will be returned.
Its correctness is probabilistic: similar items have a high probability of colliding in at least one table, so they are likely to become candidates. Dissimilar items have a lower probability of collision, so the candidate set is usually much smaller than the full dataset.
The final verification step ranks candidates using the actual distance or similarity function.
Complexity
Let:
- be the number of hash tables
- be the number of retrieved candidates
- be the cost of computing true distance
| operation | time |
|---|---|
| query hashing | |
| candidate verification | |
| total query |
The practical goal is to keep .
Space
Each item is stored once per hash table, plus metadata for buckets and signatures.
When to Use
Locality sensitive hashing is appropriate when:
- approximate search is acceptable
- exact nearest neighbor search is too slow
- data is high dimensional
- candidate generation is needed before exact reranking
- recall can be improved by using more tables
It is common in image search, near duplicate detection, recommendation systems, vector search, document retrieval, and clustering pipelines.
Implementation
import random
from collections import defaultdict
class RandomHyperplaneLSH:
def __init__(self, dim, tables=8, bits=8):
self.dim = dim
self.tables = tables
self.bits = bits
self.planes = [
[[random.gauss(0, 1) for _ in range(dim)] for _ in range(bits)]
for _ in range(tables)
]
self.buckets = [defaultdict(list) for _ in range(tables)]
def _dot(self, a, b):
return sum(x * y for x, y in zip(a, b))
def _hash(self, table_id, vector):
key = 0
for i, plane in enumerate(self.planes[table_id]):
if self._dot(vector, plane) >= 0:
key |= 1 << i
return key
def add(self, item_id, vector):
for t in range(self.tables):
key = self._hash(t, vector)
self.buckets[t][key].append((item_id, vector))
def search(self, query, limit=10):
seen = {}
for t in range(self.tables):
key = self._hash(t, query)
for item_id, vector in self.buckets[t].get(key, []):
seen[item_id] = vector
scored = []
for item_id, vector in seen.items():
score = self._dot(query, vector)
scored.append((score, item_id))
scored.sort(reverse=True)
return [item_id for _, item_id in scored[:limit]]type Item struct {
ID string
Vector []float64
}
type LSH struct {
Buckets []map[uint64][]Item
Planes [][][]float64
}
func dot(a, b []float64) float64 {
var s float64
for i := range a {
s += a[i] * b[i]
}
return s
}
func (l *LSH) hash(table int, v []float64) uint64 {
var key uint64
for i, plane := range l.Planes[table] {
if dot(v, plane) >= 0 {
key |= uint64(1) << uint(i)
}
}
return key
}
func (l *LSH) Search(q []float64) []Item {
seen := map[string]Item{}
for t := range l.Buckets {
key := l.hash(t, q)
for _, item := range l.Buckets[t][key] {
seen[item.ID] = item
}
}
out := make([]Item, 0, len(seen))
for _, item := range seen {
out = append(out, item)
}
return out
}