16.16 String Hashing
Many string algorithms repeatedly compare substrings.
16.16 String Hashing
Problem
Many string algorithms repeatedly compare substrings.
Consider:
Text = abracadabra
Suppose you want to determine whether:
abra
appears multiple times.
A direct comparison examines characters one by one.
abra
abra
This is inexpensive for a single query.
Now imagine:
- Millions of substring comparisons
- Large documents
- Plagiarism detection
- Search indexing
- DNA analysis
- Duplicate detection
Repeated character-by-character comparison becomes a bottleneck.
The challenge is to compare substrings quickly while preserving correctness.
Solution
String hashing converts a string into a numeric fingerprint.
Instead of comparing:
substring A
substring B
character by character, compare:
hash(A)
hash(B)
If the hashes differ, the strings must differ.
If the hashes match, the strings are probably equal.
The usefulness of hashing comes from two properties:
- Fast comparison
- Fast substring extraction
With proper preprocessing, the hash of any substring can be computed in constant time.
Basic Idea
Assign each character a numeric value.
Example:
a = 1
b = 2
c = 3
...
Choose a base:
p = 31
Compute:
abc
=
1×31² + 2×31¹ + 3×31⁰
Result:
961 + 62 + 3
=
1026
This number becomes the string fingerprint.
Different strings usually produce different fingerprints.
Polynomial Hash
The most common approach is the polynomial rolling hash.
For a string:
s[0..n-1]
the hash is:
Hash(s)
=
Σ(s[i] × pⁱ)
Often calculations are performed modulo a large prime:
Hash(s)
=
Σ(s[i] × pⁱ) mod M
where:
| Symbol | Meaning |
|---|---|
| p | Base |
| M | Large modulus |
Typical values:
p = 31
M = 1,000,000,007
or:
p = 911382323
M = 972663749
depending on the implementation strategy.
Prefix Hashes
The real power appears when hashing entire strings.
Suppose:
abracadabra
Prefix hashes:
| Position | Prefix |
|---|---|
| 0 | a |
| 1 | ab |
| 2 | abr |
| 3 | abra |
| ... | ... |
Store:
H[i]
representing the hash of:
s[0..i)
Once these values exist, arbitrary substring hashes become easy to compute.
Substring Hashes
Suppose:
s = abracadabra
and we want:
rac
which occupies:
s[2..5)
Using prefix hashes:
Hash(2,5)
=
H[5] - H[2]
after appropriate normalization.
The important result is:
O(1)
substring hashing.
No characters are examined.
No substring allocation occurs.
Only arithmetic operations are performed.
Why Normalization Matters
Consider:
abc
and:
abc
appearing at different locations.
Their raw polynomial values may be multiplied by different powers of the base.
To compare them correctly, both hashes must be normalized to the same scale.
This is typically achieved using:
power[i]
arrays storing:
pⁱ
modulo the chosen modulus.
Normalization ensures that equal substrings receive equal hash values regardless of position.
Example
String:
banana
Substrings:
ana
at positions:
1..4
and:
3..6
Visually:
banana
^^^
banana
^^^
Hashing allows comparison in constant time:
hash(1,4)
=
hash(3,6)
Therefore:
ana = ana
without examining any characters.
Lean Sketch
Precompute powers.
def buildPowers
(n : Nat)
(base modv : Nat)
: Array Nat :=
by
-- powers[i] = base^i mod modv
sorry
Build prefix hashes.
def buildHashes
(s : String)
(base modv : Nat)
: Array Nat :=
by
-- Prefix hash table.
sorry
Query a substring.
def substringHash
(l r : Nat)
(hashes powers : Array Nat)
(modv : Nat)
: Nat :=
by
-- O(1) substring hash.
sorry
The details vary, but the interface remains remarkably simple.
Double Hashing
A single hash can collide.
Two different strings may produce the same value.
Example:
hash(abc) = 12345
hash(xyz) = 12345
This is called a collision.
To reduce risk, use two independent hashes.
Hash₁
Hash₂
A match requires:
Hash₁ equal
AND
Hash₂ equal
Collision probability becomes extraordinarily small.
Many competitive-programming and production systems use double hashing by default.
Longest Common Prefix Queries
Hashing allows efficient prefix comparison.
Suppose:
suffix(i)
suffix(j)
You want the longest common prefix length.
Binary search on the answer:
mid
Check:
hash(i, i+mid)
=
hash(j, j+mid)
Each check costs:
O(1)
The total query becomes:
O(log n)
This technique appears frequently in suffix-array applications.
Detecting Repeated Substrings
Consider:
banana
To determine whether a length-3 substring repeats:
Compute hashes of:
ban
ana
nan
ana
Store hashes in a set.
When a duplicate hash appears:
ana
has already been observed.
Repeated-substring detection becomes a hash-table problem.
Many algorithms for longest repeated substring use this idea.
Palindrome Testing
Hashing can also test palindromes.
Build:
forward hash
reverse hash
Then compare:
substring
with:
reversed substring
using hashes.
If the values match:
Probably palindrome
This supports efficient palindrome queries after preprocessing.
Applications
Plagiarism Detection
Compare document fragments using fingerprints.
Duplicate Detection
Identify repeated text blocks.
DNA Analysis
Compare genetic sequences rapidly.
Search Engines
Index and compare terms.
Compression
Detect recurring patterns.
Suffix Arrays
Accelerate suffix comparisons.
Competitive Programming
Perform substring equality checks in constant time.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | String length |
Preprocessing:
| Operation | Complexity |
|---|---|
| Power table | O(n) |
| Prefix hashes | O(n) |
Queries:
| Operation | Complexity |
|---|---|
| Substring hash | O(1) |
| Equality check | O(1) |
| Longest common prefix | O(log n) |
Storage:
| Resource | Complexity |
|---|---|
| Powers | O(n) |
| Hashes | O(n) |
The preprocessing cost is linear. Every subsequent substring query becomes constant time.
Correctness
If two hashes differ, the corresponding strings must differ.
If two hashes match, the strings are very likely equal.
Using double hashing reduces collision probability to the point that it is negligible for most applications.
The mathematical guarantee is probabilistic rather than absolute. This distinguishes hashing from suffix arrays, automata, and exact matching algorithms.
Common Pitfalls
Do not forget normalization when comparing substrings at different positions.
Do not assume a single hash is collision-free.
Be careful with modular arithmetic and negative intermediate values.
Use sufficiently large moduli and bases.
Remember that hashing proves inequality exactly but proves equality only probabilistically.
String Hashing vs Other Techniques
| Technique | Equality Check | Exact Guarantee |
|---|---|---|
| Character comparison | O(m) | Yes |
| KMP | O(n+m) search | Yes |
| Suffix Array | O(m log n) query | Yes |
| Suffix Automaton | O(m) query | Yes |
| String Hashing | O(1) comparison | Probabilistic |
Hashing trades a tiny probability of collision for dramatically faster comparisons.
Takeaway
String hashing transforms strings into compact fingerprints that can be compared, indexed, and manipulated efficiently. With prefix hashes and power tables, arbitrary substring comparisons become constant-time operations. This ability to replace expensive character-by-character comparisons with arithmetic computations makes hashing one of the most versatile techniques in modern string processing, appearing throughout search systems, text indexing, bioinformatics, compression, and large-scale data analysis.