16.16 String Hashing

Many string algorithms repeatedly compare substrings.

16.16 String Hashing

Problem

Many string algorithms repeatedly compare substrings.

Consider:

Text = abracadabra

Suppose you want to determine whether:

abra

appears multiple times.

A direct comparison examines characters one by one.

abra
abra

This is inexpensive for a single query.

Now imagine:

  • Millions of substring comparisons
  • Large documents
  • Plagiarism detection
  • Search indexing
  • DNA analysis
  • Duplicate detection

Repeated character-by-character comparison becomes a bottleneck.

The challenge is to compare substrings quickly while preserving correctness.

Solution

String hashing converts a string into a numeric fingerprint.

Instead of comparing:

substring A
substring B

character by character, compare:

hash(A)
hash(B)

If the hashes differ, the strings must differ.

If the hashes match, the strings are probably equal.

The usefulness of hashing comes from two properties:

  1. Fast comparison
  2. Fast substring extraction

With proper preprocessing, the hash of any substring can be computed in constant time.

Basic Idea

Assign each character a numeric value.

Example:

a = 1
b = 2
c = 3
...

Choose a base:

p = 31

Compute:

abc
=
1×31² + 2×31¹ + 3×31⁰

Result:

961 + 62 + 3
=
1026

This number becomes the string fingerprint.

Different strings usually produce different fingerprints.

Polynomial Hash

The most common approach is the polynomial rolling hash.

For a string:

s[0..n-1]

the hash is:

Hash(s)
=
Σ(s[i] × pⁱ)

Often calculations are performed modulo a large prime:

Hash(s)
=
Σ(s[i] × pⁱ) mod M

where:

Symbol Meaning
p Base
M Large modulus

Typical values:

p = 31
M = 1,000,000,007

or:

p = 911382323
M = 972663749

depending on the implementation strategy.

Prefix Hashes

The real power appears when hashing entire strings.

Suppose:

abracadabra

Prefix hashes:

Position Prefix
0 a
1 ab
2 abr
3 abra
... ...

Store:

H[i]

representing the hash of:

s[0..i)

Once these values exist, arbitrary substring hashes become easy to compute.

Substring Hashes

Suppose:

s = abracadabra

and we want:

rac

which occupies:

s[2..5)

Using prefix hashes:

Hash(2,5)
=
H[5] - H[2]

after appropriate normalization.

The important result is:

O(1)

substring hashing.

No characters are examined.

No substring allocation occurs.

Only arithmetic operations are performed.

Why Normalization Matters

Consider:

abc

and:

abc

appearing at different locations.

Their raw polynomial values may be multiplied by different powers of the base.

To compare them correctly, both hashes must be normalized to the same scale.

This is typically achieved using:

power[i]

arrays storing:

pⁱ

modulo the chosen modulus.

Normalization ensures that equal substrings receive equal hash values regardless of position.

Example

String:

banana

Substrings:

ana

at positions:

1..4

and:

3..6

Visually:

banana
 ^^^

banana
   ^^^

Hashing allows comparison in constant time:

hash(1,4)
=
hash(3,6)

Therefore:

ana = ana

without examining any characters.

Lean Sketch

Precompute powers.

def buildPowers
  (n : Nat)
  (base modv : Nat)
  : Array Nat :=
by
  -- powers[i] = base^i mod modv
  sorry

Build prefix hashes.

def buildHashes
  (s : String)
  (base modv : Nat)
  : Array Nat :=
by
  -- Prefix hash table.
  sorry

Query a substring.

def substringHash
  (l r : Nat)
  (hashes powers : Array Nat)
  (modv : Nat)
  : Nat :=
by
  -- O(1) substring hash.
  sorry

The details vary, but the interface remains remarkably simple.

Double Hashing

A single hash can collide.

Two different strings may produce the same value.

Example:

hash(abc) = 12345
hash(xyz) = 12345

This is called a collision.

To reduce risk, use two independent hashes.

Hash₁
Hash₂

A match requires:

Hash₁ equal
AND
Hash₂ equal

Collision probability becomes extraordinarily small.

Many competitive-programming and production systems use double hashing by default.

Longest Common Prefix Queries

Hashing allows efficient prefix comparison.

Suppose:

suffix(i)
suffix(j)

You want the longest common prefix length.

Binary search on the answer:

mid

Check:

hash(i, i+mid)
=
hash(j, j+mid)

Each check costs:

O(1)

The total query becomes:

O(log n)

This technique appears frequently in suffix-array applications.

Detecting Repeated Substrings

Consider:

banana

To determine whether a length-3 substring repeats:

Compute hashes of:

ban
ana
nan
ana

Store hashes in a set.

When a duplicate hash appears:

ana

has already been observed.

Repeated-substring detection becomes a hash-table problem.

Many algorithms for longest repeated substring use this idea.

Palindrome Testing

Hashing can also test palindromes.

Build:

forward hash
reverse hash

Then compare:

substring

with:

reversed substring

using hashes.

If the values match:

Probably palindrome

This supports efficient palindrome queries after preprocessing.

Applications

Plagiarism Detection

Compare document fragments using fingerprints.

Duplicate Detection

Identify repeated text blocks.

DNA Analysis

Compare genetic sequences rapidly.

Search Engines

Index and compare terms.

Compression

Detect recurring patterns.

Suffix Arrays

Accelerate suffix comparisons.

Competitive Programming

Perform substring equality checks in constant time.

Complexity Analysis

Let:

Symbol Meaning
n String length

Preprocessing:

Operation Complexity
Power table O(n)
Prefix hashes O(n)

Queries:

Operation Complexity
Substring hash O(1)
Equality check O(1)
Longest common prefix O(log n)

Storage:

Resource Complexity
Powers O(n)
Hashes O(n)

The preprocessing cost is linear. Every subsequent substring query becomes constant time.

Correctness

If two hashes differ, the corresponding strings must differ.

If two hashes match, the strings are very likely equal.

Using double hashing reduces collision probability to the point that it is negligible for most applications.

The mathematical guarantee is probabilistic rather than absolute. This distinguishes hashing from suffix arrays, automata, and exact matching algorithms.

Common Pitfalls

Do not forget normalization when comparing substrings at different positions.

Do not assume a single hash is collision-free.

Be careful with modular arithmetic and negative intermediate values.

Use sufficiently large moduli and bases.

Remember that hashing proves inequality exactly but proves equality only probabilistically.

String Hashing vs Other Techniques

Technique Equality Check Exact Guarantee
Character comparison O(m) Yes
KMP O(n+m) search Yes
Suffix Array O(m log n) query Yes
Suffix Automaton O(m) query Yes
String Hashing O(1) comparison Probabilistic

Hashing trades a tiny probability of collision for dramatically faster comparisons.

Takeaway

String hashing transforms strings into compact fingerprints that can be compared, indexed, and manipulated efficiently. With prefix hashes and power tables, arbitrary substring comparisons become constant-time operations. This ability to replace expensive character-by-character comparisons with arithmetic computations makes hashing one of the most versatile techniques in modern string processing, appearing throughout search systems, text indexing, bioinformatics, compression, and large-scale data analysis.