19.13 Locality-Sensitive Hashing

Many applications need to find similar items inside very large collections.

19.13 Locality-Sensitive Hashing

Many applications need to find similar items inside very large collections. Search engines look for near-duplicate documents. Recommendation systems search for users with similar preferences. Image retrieval systems search for visually similar images. Fraud detection systems search for related transactions.

The obvious solution is to compare every item against every other item. Unfortunately, this approach scales poorly. A collection containing one million objects requires roughly:

500 billion

pairwise comparisons.

Even when a compact similarity measure such as MinHash exists, comparing every pair quickly becomes impractical.

Locality-Sensitive Hashing (LSH) addresses this problem. Instead of comparing all objects, it uses randomized hashing to place similar objects into the same buckets with high probability. Objects that land in the same bucket become candidates for detailed comparison. Objects in different buckets are ignored.

The result is a dramatic reduction in computational cost.

Problem

Suppose a collection contains:

10,000,000 documents

Each document has a MinHash signature:

[12, 55, 81, 4, ...]

To find near-duplicates, we could compare every pair:

O(n²)

comparisons.

For:

n = 10,000,000

this is impossible.

Can we identify likely matches without examining every pair?

Core Idea

A standard hash table tries to separate keys.

Locality-sensitive hashing does the opposite.

It tries to place similar items into the same bucket.

The goal is:

Similar items
→
High probability of collision

and:

Dissimilar items
→
Low probability of collision

This is fundamentally different from ordinary hashing.

Candidate Generation

Suppose we have four documents.

D1
D2
D3
D4

True similarities:

Pair Similarity
D1,D2 0.95
D1,D3 0.20
D1,D4 0.15
D2,D3 0.18
D2,D4 0.12
D3,D4 0.85

Instead of comparing all six pairs, LSH attempts to generate:

Candidate pairs:
(D1,D2)
(D3,D4)

Only those candidates receive expensive similarity calculations.

MinHash Signatures

Suppose a MinHash signature contains:

100 values

Example:

[5, 18, 42, 7, 9, ...]

Two similar documents may agree on:

85 of 100 positions

Two unrelated documents may agree on:

10 of 100 positions

LSH exploits this difference.

Bands and Rows

Divide the signature into groups.

Example:

100 values

split into:

20 bands

of:

5 rows

each.

Visualization:

Band 1:
5 18 42 7 9

Band 2:
11 6 22 1 17

Band 3:
...

Each band becomes a separate hash key.

Hashing a Band

Consider:

Band:
5 18 42 7 9

Hash the entire band:

def hash_band(values):
    return hash(tuple(values))

Result:

Bucket 8712

Documents with identical band values land in the same bucket.

Candidate Rule

If two documents match in at least one band:

Candidate pair

Otherwise:

Ignore pair

This simple rule eliminates the overwhelming majority of comparisons.

Example

Document A:

[1,2,3,4,5]
[6,7,8,9,10]
[11,12,13,14,15]

Document B:

[1,2,3,4,5]
[16,17,18,19,20]
[21,22,23,24,25]

Band 1 matches exactly.

Therefore:

A and B

become candidates.

A detailed similarity calculation follows.

Why It Works

Suppose similarity is:

s

For a band containing:

r

rows:

Probability that all rows match:

s^r

If there are:

b

bands:

Probability that at least one band matches:

1 - (1 - s^r)^b

This is the fundamental LSH probability curve.

Understanding the Probability Curve

Example parameters:

20 bands
5 rows per band

Similarity:

s = 0.9

Band match probability:

0.9^5
≈ 0.59

Candidate probability:

1 - (1 - 0.59)^20
≈ 0.99999998

Nearly certain.

Now consider:

s = 0.2

Band match probability:

0.2^5
=
0.00032

Candidate probability:

1 - (1 - 0.00032)^20
≈ 0.006

Very unlikely.

This separation is exactly what we want.

Building an LSH Index

A practical implementation stores bucket memberships.

from collections import defaultdict

class LSHIndex:
    def __init__(
        self,
        bands: int,
        rows_per_band: int,
    ):
        self.bands = bands
        self.rows_per_band = rows_per_band

        self.tables = [
            defaultdict(list)
            for _ in range(bands)
        ]

Each band has its own hash table.

Inserting a Signature

def insert(
    self,
    doc_id,
    signature,
):
    for band in range(self.bands):

        start = (
            band
            * self.rows_per_band
        )

        end = (
            start
            + self.rows_per_band
        )

        key = tuple(
            signature[start:end]
        )

        self.tables[band][key].append(
            doc_id
        )

Documents sharing a band enter the same bucket.

Candidate Retrieval

def candidates(
    self,
    signature,
):
    result = set()

    for band in range(self.bands):

        start = (
            band
            * self.rows_per_band
        )

        end = (
            start
            + self.rows_per_band
        )

        key = tuple(
            signature[start:end]
        )

        result.update(
            self.tables[band][key]
        )

    return result

Only returned candidates require expensive similarity calculations.

Document Deduplication

One of the classic uses of LSH is duplicate detection.

Consider web pages:

Page A:
"The quick brown fox jumps"

Page B:
"The quick brown fox jumps over"

Page C:
"Financial report Q4"

Pages A and B are nearly identical.

Page C is unrelated.

MinHash plus LSH quickly identifies:

(A, B)

as a candidate pair.

No exhaustive search is required.

Recommendation Systems

Represent users as sets.

User A:
{movie1,movie2,movie3}

User B:
{movie1,movie2,movie4}

User C:
{movie9,movie10}

LSH groups similar users.

Collaborative filtering can then focus on nearby candidates.

This dramatically reduces search cost.

Images can be represented by feature vectors.

Examples:

  • Color histograms
  • Deep-learning embeddings
  • Texture descriptors

LSH helps locate visually similar images.

A user uploads an image.

Instead of searching millions of images directly:

Generate candidates

then:

Compute exact similarity

only for those candidates.

High-dimensional nearest-neighbor search is notoriously difficult.

Exact algorithms often degrade as dimensionality increases.

LSH provides an approximate solution:

Fast
Scalable
Probabilistic

Approximate nearest-neighbor systems frequently use LSH variants.

Trade-Offs

Increasing:

Number of bands

creates:

More candidates
Higher recall

Increasing:

Rows per band

creates:

Fewer candidates
Higher precision

Parameter tuning depends on application requirements.

False Positives

Two unrelated documents may match in one band.

They become candidates unnecessarily.

This increases work.

The final exact similarity calculation removes them.

False positives affect efficiency, not correctness.

False Negatives

Two similar documents may fail to match in every band.

The pair is missed.

This affects recall.

The probability depends on chosen parameters.

LSH systems balance:

Precision
Recall
Cost

through parameter selection.

Complexity

Suppose:

n = documents
b = bands

Insertion:

O(b)

Candidate lookup:

O(b + candidates)

instead of:

O(n)

or:

O(n²)

for exhaustive comparison.

This scalability is the primary reason LSH exists.

Property Exact Search LSH
Accuracy Exact Approximate
Candidate Generation None Probabilistic
Scalability Poor Excellent
False Positives No Yes
False Negatives No Possible
Memory Moderate Additional index

LSH sacrifices completeness for speed.

Common Mistakes

Skipping Exact Verification

LSH generates candidates.

It does not prove similarity.

Always perform a final similarity calculation.

Poor Parameter Selection

Too many bands:

Too many candidates

Too many rows:

Too many missed matches

Using LSH Without a Similarity Model

LSH requires a meaningful similarity measure.

The indexing strategy must match the data representation.

Expecting Exact Results

LSH is an approximate retrieval method.

Its value comes from scalability, not perfection.

Discussion

Locality-Sensitive Hashing is one of the most influential techniques in large-scale similarity search. It transforms a prohibitively expensive all-pairs comparison problem into a manageable candidate-generation problem. By deliberately using hash functions that preserve similarity rather than destroy it, LSH allows systems to focus computation where it is most likely to matter.

Combined with MinHash, LSH powers large-scale document deduplication, recommendation systems, image retrieval, clustering pipelines, and approximate nearest-neighbor search. The technique illustrates a recurring theme in randomized algorithms: a carefully designed probabilistic shortcut can reduce computational costs by orders of magnitude while preserving most of the useful information in the original problem.

The next section examines approximation algorithms and approximation ratios, shifting from probabilistic data structures toward optimization problems where exact solutions are computationally infeasible.