16.2 Knuth-Morris-Pratt (KMP)
The naive exact matching algorithm repeatedly compares the same characters after every mismatch.
16.2 Knuth-Morris-Pratt (KMP)
Problem
The naive exact matching algorithm repeatedly compares the same characters after every mismatch. When the text and pattern contain repeated prefixes, large amounts of work are duplicated.
Consider:
Text: AAAAAAAAAAAAAAAA
Pattern: AAAAAB
The naive algorithm repeatedly verifies the same sequence of A characters before discovering that the final character differs.
The challenge is to reuse information gained from previous comparisons instead of restarting from the beginning after every mismatch.
Solution
The Knuth-Morris-Pratt algorithm, commonly called KMP, preprocesses the pattern and computes a failure table that describes how much of the pattern remains useful after a mismatch.
Rather than moving the pattern one position to the right and starting over, KMP determines the longest prefix of the pattern that is also a suffix of the already matched characters.
This allows the search to continue without rechecking characters that are already known to match.
Core Idea
Suppose we are searching:
Pattern: ABABC
During matching we successfully compare:
ABAB
and then encounter a mismatch.
The prefix:
AB
is also a suffix of:
ABAB
Therefore the algorithm already knows that those characters match and can continue from that point.
Instead of restarting from position zero, KMP jumps directly to the largest valid prefix.
Prefix Function
For every position in the pattern, compute:
LPS[i]
where LPS stands for:
Longest Proper Prefix which is also a Suffix
Example:
Pattern: ABABCABAB
| Index | Character | LPS |
|---|---|---|
| 0 | A | 0 |
| 1 | B | 0 |
| 2 | A | 1 |
| 3 | B | 2 |
| 4 | C | 0 |
| 5 | A | 1 |
| 6 | B | 2 |
| 7 | A | 3 |
| 8 | B | 4 |
The table records how much of the pattern can be reused after failure.
Building the LPS Table
Consider:
Pattern: ABABACA
Processing proceeds left to right.
A → 0
AB → 0
ABA → 1
ABAB → 2
ABABA → 3
ABABAC → 0
ABABACA → 1
Result:
| Position | Value |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 0 |
| 6 | 1 |
The preprocessing phase requires only one pass through the pattern.
Lean Implementation: LPS Construction
def buildLPS (pattern : Array Char) : Array Nat :=
Id.run do
let mut lps := Array.mkArray pattern.size 0
let mut len := 0
let mut i := 1
while i < pattern.size do
if pattern[i]! = pattern[len]! then
len := len + 1
lps := lps.set! i len
i := i + 1
else if len > 0 then
len := lps[len - 1]!
else
lps := lps.set! i 0
i := i + 1
return lps
The preprocessing cost is linear in the pattern length.
Search Phase
Suppose:
Text: ABABDABACDABABCABAB
Pattern: ABABCABAB
During matching:
ABABC
matches successfully.
A mismatch occurs later.
Instead of restarting from the beginning:
ABABCABAB
^
KMP consults the LPS table and jumps directly to the next valid state.
No character in the text is revisited.
This property gives KMP its linear running time.
Lean Implementation: Search
def kmpSearch
(text pattern : Array Char)
: List Nat :=
if pattern.isEmpty then
[0]
else
let lps := buildLPS pattern
let rec loop
(i j : Nat)
(acc : List Nat) :=
if h : i = text.size then
acc.reverse
else if text[i]! = pattern[j]! then
let i := i + 1
let j := j + 1
if j = pattern.size then
let pos := i - j
loop i lps[j - 1]! (pos :: acc)
else
loop i j acc
else if j > 0 then
loop i lps[j - 1]! acc
else
loop (i + 1) j acc
loop 0 0 []
Example Walkthrough
Search:
Text: ABABABABC
Pattern: ABABC
Matching progresses:
ABAB
ABAB
Mismatch:
ABABA
ABABC
^
The naive algorithm would restart.
KMP observes:
AB
is both a prefix and a suffix.
Search resumes immediately from the corresponding pattern position.
The previous work is preserved.
Why KMP Is Linear
The key observation is that neither text indices nor pattern indices move backward unnecessarily.
Each text character:
T[i]
is examined at most a constant number of times.
Each pattern character contributes to a bounded number of transitions.
Therefore:
Preprocessing = O(m)
Search = O(n)
Total:
O(n + m)
where:
| Symbol | Meaning |
|---|---|
| n | Text length |
| m | Pattern length |
Comparison with Naive Matching
| Algorithm | Worst Case |
|---|---|
| Naive | O(nm) |
| KMP | O(n + m) |
For:
n = 1,000,000
m = 1,000
the difference can be enormous.
The naive algorithm may perform billions of comparisons.
KMP remains proportional to the combined size of the text and pattern.
Correctness
KMP maintains the invariant:
Pattern[0..j-1]
matches:
Text[i-j..i-1]
at every step.
When a mismatch occurs, the LPS table identifies the largest prefix that can still match.
No valid occurrence is skipped because every shift preserves all information implied by previous comparisons.
Every reported position corresponds to a complete match, and every complete match is eventually discovered.
Complexity Analysis
| Operation | Complexity |
|---|---|
| Build LPS | O(m) |
| Search | O(n) |
| Total | O(n + m) |
| Extra memory | O(m) |
When to Use KMP
KMP is an excellent choice when:
- The pattern is searched many times.
- Worst-case guarantees matter.
- Deterministic performance is required.
- The pattern contains repeated prefixes.
- Memory usage should remain small.
KMP introduced one of the most important ideas in algorithm design: preprocessing the pattern so that information learned during matching is never wasted. This concept reappears throughout string algorithms, indexing structures, dynamic programming, and automata theory.