16.4 Rabin-Karp
Suppose you need to search for a pattern inside a large text.
16.4 Rabin-Karp
Problem
Suppose you need to search for a pattern inside a large text.
Text: THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG
Pattern: BROWN
A straightforward algorithm compares characters one by one.
Even efficient algorithms such as KMP and the Z Algorithm ultimately rely on direct character comparisons.
Now consider a different situation:
Text length = 10,000,000
Patterns = thousands
Searches/day = millions
Repeated character-by-character comparison becomes expensive.
The challenge is to compare substrings quickly without examining every character each time.
Solution
Rabin-Karp transforms strings into numeric fingerprints called hashes.
Instead of comparing:
BROWN
BROWN
character by character, we compare:
Hash(BROWN)
Hash(BROWN)
If the hashes differ, the strings definitely differ.
If the hashes match, the strings are probably identical and can be verified if necessary.
The key innovation is the rolling hash.
When moving from one substring to the next, the new hash is computed in constant time without recomputing the entire substring.
This reduces pattern matching from repeated string comparisons to arithmetic operations.
Polynomial Rolling Hash
Consider a string:
ABC
Assign numeric values:
A = 1
B = 2
C = 3
Choose:
Base = 31
Hash:
1×31² + 2×31¹ + 3×31⁰
Result:
961 + 62 + 3
= 1026
Different strings usually produce different values.
For practical implementations we compute modulo a large number.
General formula:
Hash(s) =
Σ(s[i] × pⁱ) mod M
where:
| Symbol | Meaning |
|---|---|
| p | Base |
| M | Large modulus |
Rolling Hash
Suppose:
Text = ABCDE
Pattern length = 3
Windows:
ABC
BCD
CDE
Naively computing each hash requires:
O(m)
work per window.
Instead, Rabin-Karp updates the hash incrementally.
Example:
ABC → BCD
Remove:
A
Shift remaining characters.
Add:
D
The new hash is obtained using arithmetic rather than recomputing the entire substring.
This is the rolling hash principle.
Visualizing the Slide
Current window:
A B C
^
remove
Next window:
B C D
^
add
Only two characters affect the update:
- Character leaving.
- Character entering.
Everything else is reused.
Search Algorithm
Step 1
Compute:
Hash(pattern)
Step 2
Compute hash of first text window.
Step 3
Compare hashes.
If:
hash(window) ≠ hash(pattern)
then:
No match
Move immediately to the next window.
Step 4
If hashes match:
hash(window) = hash(pattern)
perform character verification.
Step 5
Roll to the next window.
Repeat until the text ends.
Example
Text:
ABCCDABCDABCD
Pattern:
ABCD
Hashes:
ABCC → different
BCCD → different
CCDA → different
CDAB → different
DABC → different
ABCD → same
Only the matching hash triggers character verification.
Most windows are discarded immediately.
Lean Implementation
First, a simple polynomial hash:
def hashString
(s : String)
(base : Nat := 31)
: Nat :=
s.foldl
(fun acc ch =>
acc * base + ch.toNat)
0
A practical implementation uses modular arithmetic.
def modHash
(s : String)
(base modv : Nat)
: Nat :=
s.foldl
(fun acc ch =>
((acc * base) + ch.toNat) % modv)
0
The rolling version updates hashes incrementally.
def rollHash
(current : Nat)
(outgoing : Nat)
(incoming : Nat)
(power : Nat)
(base modv : Nat)
: Nat :=
(((current + modv
- outgoing * power % modv)
* base)
+ incoming) % modv
This update requires constant time.
Why Rolling Hash Works
Suppose:
Window = ABC
Hash:
A×p² + B×p + C
Move right:
BCD
Subtract:
A×p²
Multiply remaining value by:
p
Add:
D
The resulting value is exactly the hash of:
BCD
No character-by-character recomputation is necessary.
Hash Collisions
Different strings can produce identical hashes.
Example:
Hash(XYZ) = 12345
Hash(ABC) = 12345
This event is called a collision.
A collision can cause a false match.
Therefore Rabin-Karp usually performs final verification.
Hashes equal
↓
Compare characters
↓
Confirm match
Correctness remains guaranteed.
Double Hashing
Collision probability can be reduced dramatically.
Compute:
Hash1 using modulus M1
Hash2 using modulus M2
A match requires:
Hash1 equal
AND
Hash2 equal
Example:
| Hash Function | Value |
|---|---|
| Hash₁ | 8347821 |
| Hash₂ | 1927443 |
Both values must match.
Double hashing is common in competitive programming and large-scale text systems.
Multiple Pattern Matching
One advantage of Rabin-Karp is support for many patterns.
Suppose:
Patterns:
DOG
CAT
FOX
OWL
Store all pattern hashes:
Hash(DOG)
Hash(CAT)
Hash(FOX)
Hash(OWL)
Each text window computes one hash.
Lookup becomes:
Hash(window)
↓
Hash table lookup
This capability makes Rabin-Karp attractive when many patterns share the same length.
Applications
Plagiarism Detection
Compare document fingerprints rather than entire documents.
Duplicate File Detection
Compare chunk hashes.
DNA Matching
Search genomic sequences efficiently.
Malware Detection
Compare signature hashes.
Search Engines
Index and compare text fragments.
Distributed Storage
Detect repeated blocks using fingerprints.
Correctness
The algorithm maintains the hash of the current window exactly.
Every text window receives a hash value.
Whenever a window hash differs from the pattern hash, the strings cannot match.
Whenever hashes match, character verification confirms equality.
Therefore:
- No valid occurrence is missed.
- No false occurrence survives verification.
The algorithm returns exactly the correct match positions.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Text length |
| m | Pattern length |
Expected Case
| Operation | Complexity |
|---|---|
| Initial hashing | O(m) |
| Rolling updates | O(n) |
| Verification | Rare |
| Total | O(n + m) |
Worst Case
If every window collides:
O(nm)
This situation is extremely unlikely with good hash functions and large moduli.
Memory
| Resource | Complexity |
|---|---|
| Hash state | O(1) |
KMP vs Z vs Rabin-Karp
| Feature | KMP | Z | Rabin-Karp |
|---|---|---|---|
| Worst-case guarantee | O(n+m) | O(n+m) | O(nm) |
| Expected performance | Excellent | Excellent | Excellent |
| Rolling updates | No | No | Yes |
| Multiple patterns | Moderate | Moderate | Excellent |
| Hash-based | No | No | Yes |
| Collision risk | None | None | Yes |
KMP and Z rely on structural properties of strings.
Rabin-Karp relies on probabilistic fingerprints and arithmetic.
This shift from direct comparison to hashing introduces a powerful idea that appears throughout computer science: representing large objects with compact fingerprints that can be compared and manipulated efficiently. Subsequent algorithms for document similarity, distributed storage, databases, caching systems, and large-scale search all build upon this fundamental concept.