16.4 Rabin-Karp

Suppose you need to search for a pattern inside a large text.

16.4 Rabin-Karp

Problem

Suppose you need to search for a pattern inside a large text.

Text:    THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG
Pattern: BROWN

A straightforward algorithm compares characters one by one.

Even efficient algorithms such as KMP and the Z Algorithm ultimately rely on direct character comparisons.

Now consider a different situation:

Text length     = 10,000,000
Patterns        = thousands
Searches/day    = millions

Repeated character-by-character comparison becomes expensive.

The challenge is to compare substrings quickly without examining every character each time.

Solution

Rabin-Karp transforms strings into numeric fingerprints called hashes.

Instead of comparing:

BROWN
BROWN

character by character, we compare:

Hash(BROWN)
Hash(BROWN)

If the hashes differ, the strings definitely differ.

If the hashes match, the strings are probably identical and can be verified if necessary.

The key innovation is the rolling hash.

When moving from one substring to the next, the new hash is computed in constant time without recomputing the entire substring.

This reduces pattern matching from repeated string comparisons to arithmetic operations.

Polynomial Rolling Hash

Consider a string:

ABC

Assign numeric values:

A = 1
B = 2
C = 3

Choose:

Base = 31

Hash:

1×31² + 2×31¹ + 3×31⁰

Result:

961 + 62 + 3
= 1026

Different strings usually produce different values.

For practical implementations we compute modulo a large number.

General formula:

Hash(s) =
Σ(s[i] × pⁱ) mod M

where:

Symbol Meaning
p Base
M Large modulus

Rolling Hash

Suppose:

Text = ABCDE
Pattern length = 3

Windows:

ABC
BCD
CDE

Naively computing each hash requires:

O(m)

work per window.

Instead, Rabin-Karp updates the hash incrementally.

Example:

ABC → BCD

Remove:

A

Shift remaining characters.

Add:

D

The new hash is obtained using arithmetic rather than recomputing the entire substring.

This is the rolling hash principle.

Visualizing the Slide

Current window:

A B C
^
remove

Next window:

B C D
    ^
   add

Only two characters affect the update:

  1. Character leaving.
  2. Character entering.

Everything else is reused.

Search Algorithm

Step 1

Compute:

Hash(pattern)

Step 2

Compute hash of first text window.

Step 3

Compare hashes.

If:

hash(window) ≠ hash(pattern)

then:

No match

Move immediately to the next window.

Step 4

If hashes match:

hash(window) = hash(pattern)

perform character verification.

Step 5

Roll to the next window.

Repeat until the text ends.

Example

Text:

ABCCDABCDABCD

Pattern:

ABCD

Hashes:

ABCC → different
BCCD → different
CCDA → different
CDAB → different
DABC → different
ABCD → same

Only the matching hash triggers character verification.

Most windows are discarded immediately.

Lean Implementation

First, a simple polynomial hash:

def hashString
  (s : String)
  (base : Nat := 31)
  : Nat :=

  s.foldl
    (fun acc ch =>
      acc * base + ch.toNat)
    0

A practical implementation uses modular arithmetic.

def modHash
  (s : String)
  (base modv : Nat)
  : Nat :=

  s.foldl
    (fun acc ch =>
      ((acc * base) + ch.toNat) % modv)
    0

The rolling version updates hashes incrementally.

def rollHash
  (current : Nat)
  (outgoing : Nat)
  (incoming : Nat)
  (power : Nat)
  (base modv : Nat)
  : Nat :=

  (((current + modv
      - outgoing * power % modv)
      * base)
      + incoming) % modv

This update requires constant time.

Why Rolling Hash Works

Suppose:

Window = ABC

Hash:

A×p² + B×p + C

Move right:

BCD

Subtract:

A×p²

Multiply remaining value by:

p

Add:

D

The resulting value is exactly the hash of:

BCD

No character-by-character recomputation is necessary.

Hash Collisions

Different strings can produce identical hashes.

Example:

Hash(XYZ) = 12345
Hash(ABC) = 12345

This event is called a collision.

A collision can cause a false match.

Therefore Rabin-Karp usually performs final verification.

Hashes equal
    ↓
Compare characters
    ↓
Confirm match

Correctness remains guaranteed.

Double Hashing

Collision probability can be reduced dramatically.

Compute:

Hash1 using modulus M1
Hash2 using modulus M2

A match requires:

Hash1 equal
AND
Hash2 equal

Example:

Hash Function Value
Hash₁ 8347821
Hash₂ 1927443

Both values must match.

Double hashing is common in competitive programming and large-scale text systems.

Multiple Pattern Matching

One advantage of Rabin-Karp is support for many patterns.

Suppose:

Patterns:
DOG
CAT
FOX
OWL

Store all pattern hashes:

Hash(DOG)
Hash(CAT)
Hash(FOX)
Hash(OWL)

Each text window computes one hash.

Lookup becomes:

Hash(window)
    ↓
Hash table lookup

This capability makes Rabin-Karp attractive when many patterns share the same length.

Applications

Plagiarism Detection

Compare document fingerprints rather than entire documents.

Duplicate File Detection

Compare chunk hashes.

DNA Matching

Search genomic sequences efficiently.

Malware Detection

Compare signature hashes.

Search Engines

Index and compare text fragments.

Distributed Storage

Detect repeated blocks using fingerprints.

Correctness

The algorithm maintains the hash of the current window exactly.

Every text window receives a hash value.

Whenever a window hash differs from the pattern hash, the strings cannot match.

Whenever hashes match, character verification confirms equality.

Therefore:

  1. No valid occurrence is missed.
  2. No false occurrence survives verification.

The algorithm returns exactly the correct match positions.

Complexity Analysis

Let:

Symbol Meaning
n Text length
m Pattern length

Expected Case

Operation Complexity
Initial hashing O(m)
Rolling updates O(n)
Verification Rare
Total O(n + m)

Worst Case

If every window collides:

O(nm)

This situation is extremely unlikely with good hash functions and large moduli.

Memory

Resource Complexity
Hash state O(1)

KMP vs Z vs Rabin-Karp

Feature KMP Z Rabin-Karp
Worst-case guarantee O(n+m) O(n+m) O(nm)
Expected performance Excellent Excellent Excellent
Rolling updates No No Yes
Multiple patterns Moderate Moderate Excellent
Hash-based No No Yes
Collision risk None None Yes

KMP and Z rely on structural properties of strings.

Rabin-Karp relies on probabilistic fingerprints and arithmetic.

This shift from direct comparison to hashing introduces a powerful idea that appears throughout computer science: representing large objects with compact fingerprints that can be compared and manipulated efficiently. Subsequent algorithms for document similarity, distributed storage, databases, caching systems, and large-scale search all build upon this fundamental concept.