19.13 Locality-Sensitive Hashing
Many applications need to find similar items inside very large collections.
19.13 Locality-Sensitive Hashing
Many applications need to find similar items inside very large collections. Search engines look for near-duplicate documents. Recommendation systems search for users with similar preferences. Image retrieval systems search for visually similar images. Fraud detection systems search for related transactions.
The obvious solution is to compare every item against every other item. Unfortunately, this approach scales poorly. A collection containing one million objects requires roughly:
500 billion
pairwise comparisons.
Even when a compact similarity measure such as MinHash exists, comparing every pair quickly becomes impractical.
Locality-Sensitive Hashing (LSH) addresses this problem. Instead of comparing all objects, it uses randomized hashing to place similar objects into the same buckets with high probability. Objects that land in the same bucket become candidates for detailed comparison. Objects in different buckets are ignored.
The result is a dramatic reduction in computational cost.
Problem
Suppose a collection contains:
10,000,000 documents
Each document has a MinHash signature:
[12, 55, 81, 4, ...]
To find near-duplicates, we could compare every pair:
O(n²)
comparisons.
For:
n = 10,000,000
this is impossible.
Can we identify likely matches without examining every pair?
Core Idea
A standard hash table tries to separate keys.
Locality-sensitive hashing does the opposite.
It tries to place similar items into the same bucket.
The goal is:
Similar items
→
High probability of collision
and:
Dissimilar items
→
Low probability of collision
This is fundamentally different from ordinary hashing.
Candidate Generation
Suppose we have four documents.
D1
D2
D3
D4
True similarities:
| Pair | Similarity |
|---|---|
| D1,D2 | 0.95 |
| D1,D3 | 0.20 |
| D1,D4 | 0.15 |
| D2,D3 | 0.18 |
| D2,D4 | 0.12 |
| D3,D4 | 0.85 |
Instead of comparing all six pairs, LSH attempts to generate:
Candidate pairs:
(D1,D2)
(D3,D4)
Only those candidates receive expensive similarity calculations.
MinHash Signatures
Suppose a MinHash signature contains:
100 values
Example:
[5, 18, 42, 7, 9, ...]
Two similar documents may agree on:
85 of 100 positions
Two unrelated documents may agree on:
10 of 100 positions
LSH exploits this difference.
Bands and Rows
Divide the signature into groups.
Example:
100 values
split into:
20 bands
of:
5 rows
each.
Visualization:
Band 1:
5 18 42 7 9
Band 2:
11 6 22 1 17
Band 3:
...
Each band becomes a separate hash key.
Hashing a Band
Consider:
Band:
5 18 42 7 9
Hash the entire band:
def hash_band(values):
return hash(tuple(values))
Result:
Bucket 8712
Documents with identical band values land in the same bucket.
Candidate Rule
If two documents match in at least one band:
Candidate pair
Otherwise:
Ignore pair
This simple rule eliminates the overwhelming majority of comparisons.
Example
Document A:
[1,2,3,4,5]
[6,7,8,9,10]
[11,12,13,14,15]
Document B:
[1,2,3,4,5]
[16,17,18,19,20]
[21,22,23,24,25]
Band 1 matches exactly.
Therefore:
A and B
become candidates.
A detailed similarity calculation follows.
Why It Works
Suppose similarity is:
s
For a band containing:
r
rows:
Probability that all rows match:
s^r
If there are:
b
bands:
Probability that at least one band matches:
1 - (1 - s^r)^b
This is the fundamental LSH probability curve.
Understanding the Probability Curve
Example parameters:
20 bands
5 rows per band
Similarity:
s = 0.9
Band match probability:
0.9^5
≈ 0.59
Candidate probability:
1 - (1 - 0.59)^20
≈ 0.99999998
Nearly certain.
Now consider:
s = 0.2
Band match probability:
0.2^5
=
0.00032
Candidate probability:
1 - (1 - 0.00032)^20
≈ 0.006
Very unlikely.
This separation is exactly what we want.
Building an LSH Index
A practical implementation stores bucket memberships.
from collections import defaultdict
class LSHIndex:
def __init__(
self,
bands: int,
rows_per_band: int,
):
self.bands = bands
self.rows_per_band = rows_per_band
self.tables = [
defaultdict(list)
for _ in range(bands)
]
Each band has its own hash table.
Inserting a Signature
def insert(
self,
doc_id,
signature,
):
for band in range(self.bands):
start = (
band
* self.rows_per_band
)
end = (
start
+ self.rows_per_band
)
key = tuple(
signature[start:end]
)
self.tables[band][key].append(
doc_id
)
Documents sharing a band enter the same bucket.
Candidate Retrieval
def candidates(
self,
signature,
):
result = set()
for band in range(self.bands):
start = (
band
* self.rows_per_band
)
end = (
start
+ self.rows_per_band
)
key = tuple(
signature[start:end]
)
result.update(
self.tables[band][key]
)
return result
Only returned candidates require expensive similarity calculations.
Document Deduplication
One of the classic uses of LSH is duplicate detection.
Consider web pages:
Page A:
"The quick brown fox jumps"
Page B:
"The quick brown fox jumps over"
Page C:
"Financial report Q4"
Pages A and B are nearly identical.
Page C is unrelated.
MinHash plus LSH quickly identifies:
(A, B)
as a candidate pair.
No exhaustive search is required.
Recommendation Systems
Represent users as sets.
User A:
{movie1,movie2,movie3}
User B:
{movie1,movie2,movie4}
User C:
{movie9,movie10}
LSH groups similar users.
Collaborative filtering can then focus on nearby candidates.
This dramatically reduces search cost.
Image Search
Images can be represented by feature vectors.
Examples:
- Color histograms
- Deep-learning embeddings
- Texture descriptors
LSH helps locate visually similar images.
A user uploads an image.
Instead of searching millions of images directly:
Generate candidates
then:
Compute exact similarity
only for those candidates.
Nearest Neighbor Search
High-dimensional nearest-neighbor search is notoriously difficult.
Exact algorithms often degrade as dimensionality increases.
LSH provides an approximate solution:
Fast
Scalable
Probabilistic
Approximate nearest-neighbor systems frequently use LSH variants.
Trade-Offs
Increasing:
Number of bands
creates:
More candidates
Higher recall
Increasing:
Rows per band
creates:
Fewer candidates
Higher precision
Parameter tuning depends on application requirements.
False Positives
Two unrelated documents may match in one band.
They become candidates unnecessarily.
This increases work.
The final exact similarity calculation removes them.
False positives affect efficiency, not correctness.
False Negatives
Two similar documents may fail to match in every band.
The pair is missed.
This affects recall.
The probability depends on chosen parameters.
LSH systems balance:
Precision
Recall
Cost
through parameter selection.
Complexity
Suppose:
n = documents
b = bands
Insertion:
O(b)
Candidate lookup:
O(b + candidates)
instead of:
O(n)
or:
O(n²)
for exhaustive comparison.
This scalability is the primary reason LSH exists.
Comparison with Exact Search
| Property | Exact Search | LSH |
|---|---|---|
| Accuracy | Exact | Approximate |
| Candidate Generation | None | Probabilistic |
| Scalability | Poor | Excellent |
| False Positives | No | Yes |
| False Negatives | No | Possible |
| Memory | Moderate | Additional index |
LSH sacrifices completeness for speed.
Common Mistakes
Skipping Exact Verification
LSH generates candidates.
It does not prove similarity.
Always perform a final similarity calculation.
Poor Parameter Selection
Too many bands:
Too many candidates
Too many rows:
Too many missed matches
Using LSH Without a Similarity Model
LSH requires a meaningful similarity measure.
The indexing strategy must match the data representation.
Expecting Exact Results
LSH is an approximate retrieval method.
Its value comes from scalability, not perfection.
Discussion
Locality-Sensitive Hashing is one of the most influential techniques in large-scale similarity search. It transforms a prohibitively expensive all-pairs comparison problem into a manageable candidate-generation problem. By deliberately using hash functions that preserve similarity rather than destroy it, LSH allows systems to focus computation where it is most likely to matter.
Combined with MinHash, LSH powers large-scale document deduplication, recommendation systems, image retrieval, clustering pipelines, and approximate nearest-neighbor search. The technique illustrates a recurring theme in randomized algorithms: a carefully designed probabilistic shortcut can reduce computational costs by orders of magnitude while preserving most of the useful information in the original problem.
The next section examines approximation algorithms and approximation ratios, shifting from probabilistic data structures toward optimization problems where exact solutions are computationally infeasible.