Skip to content

Locality Sensitive Hashing Search

Search for approximate nearest neighbors by hashing similar items into the same buckets with high probability.

Locality sensitive hashing, often abbreviated as LSH, is a search technique for approximate nearest neighbor queries. It uses hash functions that are designed so that similar objects are more likely to collide than dissimilar objects.

You use it when exact nearest neighbor search is too expensive, especially in high dimensional spaces.

Problem

Given a dataset DD and a query object qq, find items in DD that are likely to be close to qq under a similarity or distance function.

Common examples include:

  • cosine similarity for vectors
  • Jaccard similarity for sets
  • Hamming distance for bit strings
  • Euclidean distance for points

Model

An LSH family is a family of hash functions where close objects collide with higher probability than far objects.

For a hash function hh from the family:

Pr[h(x)=h(y)] \Pr[h(x) = h(y)]

is larger when xx and yy are similar.

To improve recall, LSH usually uses multiple hash tables. A query hashes into each table, retrieves candidates from matching buckets, then verifies candidates using the real distance function.

Algorithm

lsh_search(tables, q, limit):
    candidates = empty set

    for each table in tables:
        key = table.hash(q)
        for each item in table.bucket[key]:
            add item to candidates

    score each candidate by true distance to q
    return best limit candidates

Example

Suppose we search for near duplicate documents using MinHash based LSH.

A document is represented as a set of shingles:

Sq=a,b,c,d,e S_q = {a, b, c, d, e}

Each LSH table hashes the MinHash signature band of the query. Documents with matching bands become candidates.

tablequery bucketcandidates
118doc2, doc9
244doc9
37doc1, doc2

The candidate set is:

doc1,doc2,doc9 {\text{doc1}, \text{doc2}, \text{doc9}}

Only these candidates are compared with the true similarity measure.

Correctness

LSH is an approximate algorithm. It does not guarantee that the exact nearest neighbor will be returned.

Its correctness is probabilistic: similar items have a high probability of colliding in at least one table, so they are likely to become candidates. Dissimilar items have a lower probability of collision, so the candidate set is usually much smaller than the full dataset.

The final verification step ranks candidates using the actual distance or similarity function.

Complexity

Let:

  • LL be the number of hash tables
  • CC be the number of retrieved candidates
  • dd be the cost of computing true distance
operationtime
query hashingO(L)O(L)
candidate verificationO(Cd)O(Cd)
total queryO(L+Cd)O(L + Cd)

The practical goal is to keep CnC \ll n.

Space

O(Ln) O(Ln)

Each item is stored once per hash table, plus metadata for buckets and signatures.

When to Use

Locality sensitive hashing is appropriate when:

  • approximate search is acceptable
  • exact nearest neighbor search is too slow
  • data is high dimensional
  • candidate generation is needed before exact reranking
  • recall can be improved by using more tables

It is common in image search, near duplicate detection, recommendation systems, vector search, document retrieval, and clustering pipelines.

Implementation

import random
from collections import defaultdict

class RandomHyperplaneLSH:
    def __init__(self, dim, tables=8, bits=8):
        self.dim = dim
        self.tables = tables
        self.bits = bits
        self.planes = [
            [[random.gauss(0, 1) for _ in range(dim)] for _ in range(bits)]
            for _ in range(tables)
        ]
        self.buckets = [defaultdict(list) for _ in range(tables)]

    def _dot(self, a, b):
        return sum(x * y for x, y in zip(a, b))

    def _hash(self, table_id, vector):
        key = 0
        for i, plane in enumerate(self.planes[table_id]):
            if self._dot(vector, plane) >= 0:
                key |= 1 << i
        return key

    def add(self, item_id, vector):
        for t in range(self.tables):
            key = self._hash(t, vector)
            self.buckets[t][key].append((item_id, vector))

    def search(self, query, limit=10):
        seen = {}
        for t in range(self.tables):
            key = self._hash(t, query)
            for item_id, vector in self.buckets[t].get(key, []):
                seen[item_id] = vector

        scored = []
        for item_id, vector in seen.items():
            score = self._dot(query, vector)
            scored.append((score, item_id))

        scored.sort(reverse=True)
        return [item_id for _, item_id in scored[:limit]]
type Item struct {
	ID     string
	Vector []float64
}

type LSH struct {
	Buckets []map[uint64][]Item
	Planes  [][][]float64
}

func dot(a, b []float64) float64 {
	var s float64
	for i := range a {
		s += a[i] * b[i]
	}
	return s
}

func (l *LSH) hash(table int, v []float64) uint64 {
	var key uint64
	for i, plane := range l.Planes[table] {
		if dot(v, plane) >= 0 {
			key |= uint64(1) << uint(i)
		}
	}
	return key
}

func (l *LSH) Search(q []float64) []Item {
	seen := map[string]Item{}

	for t := range l.Buckets {
		key := l.hash(t, q)
		for _, item := range l.Buckets[t][key] {
			seen[item.ID] = item
		}
	}

	out := make([]Item, 0, len(seen))
	for _, item := range seen {
		out = append(out, item)
	}

	return out
}