16.2 Knuth-Morris-Pratt (KMP)

Problem

The naive exact matching algorithm repeatedly compares the same characters after every mismatch. When the text and pattern contain repeated prefixes, large amounts of work are duplicated.

Consider:

Text:    AAAAAAAAAAAAAAAA
Pattern: AAAAAB

The naive algorithm repeatedly verifies the same sequence of A characters before discovering that the final character differs.

The challenge is to reuse information gained from previous comparisons instead of restarting from the beginning after every mismatch.

Solution

The Knuth-Morris-Pratt algorithm, commonly called KMP, preprocesses the pattern and computes a failure table that describes how much of the pattern remains useful after a mismatch.

Rather than moving the pattern one position to the right and starting over, KMP determines the longest prefix of the pattern that is also a suffix of the already matched characters.

This allows the search to continue without rechecking characters that are already known to match.

Core Idea

Suppose we are searching:

Pattern: ABABC

During matching we successfully compare:

ABAB

and then encounter a mismatch.

The prefix:

AB

is also a suffix of:

ABAB

Therefore the algorithm already knows that those characters match and can continue from that point.

Instead of restarting from position zero, KMP jumps directly to the largest valid prefix.

Prefix Function

For every position in the pattern, compute:

LPS[i]

where LPS stands for:

Longest Proper Prefix which is also a Suffix

Example:

Pattern: ABABCABAB

Index	Character	LPS
0	A	0
1	B	0
2	A	1
3	B	2
4	C	0
5	A	1
6	B	2
7	A	3
8	B	4

The table records how much of the pattern can be reused after failure.

Building the LPS Table

Consider:

Pattern: ABABACA

Processing proceeds left to right.

A       → 0
AB      → 0
ABA     → 1
ABAB    → 2
ABABA   → 3
ABABAC  → 0
ABABACA → 1

Result:

Position	Value
0	0
1	0
2	1
3	2
4	3
5	0
6	1

The preprocessing phase requires only one pass through the pattern.

Lean Implementation: LPS Construction

def buildLPS (pattern : Array Char) : Array Nat :=
  Id.run do
    let mut lps := Array.mkArray pattern.size 0
    let mut len := 0
    let mut i := 1

    while i < pattern.size do
      if pattern[i]! = pattern[len]! then
        len := len + 1
        lps := lps.set! i len
        i := i + 1
      else if len > 0 then
        len := lps[len - 1]!
      else
        lps := lps.set! i 0
        i := i + 1

    return lps

The preprocessing cost is linear in the pattern length.

Search Phase

Suppose:

Text:    ABABDABACDABABCABAB
Pattern: ABABCABAB

During matching:

ABABC

matches successfully.

A mismatch occurs later.

Instead of restarting from the beginning:

ABABCABAB
    ^

KMP consults the LPS table and jumps directly to the next valid state.

No character in the text is revisited.

This property gives KMP its linear running time.

Lean Implementation: Search

def kmpSearch
  (text pattern : Array Char)
  : List Nat :=

  if pattern.isEmpty then
    [0]
  else
    let lps := buildLPS pattern

    let rec loop
      (i j : Nat)
      (acc : List Nat) :=
      if h : i = text.size then
        acc.reverse
      else if text[i]! = pattern[j]! then
        let i := i + 1
        let j := j + 1

        if j = pattern.size then
          let pos := i - j
          loop i lps[j - 1]! (pos :: acc)
        else
          loop i j acc

      else if j > 0 then
        loop i lps[j - 1]! acc
      else
        loop (i + 1) j acc

    loop 0 0 []

Example Walkthrough

Search:

Text:    ABABABABC
Pattern: ABABC

Matching progresses:

ABAB
ABAB

Mismatch:

ABABA
ABABC
    ^

The naive algorithm would restart.

KMP observes:

AB

is both a prefix and a suffix.

Search resumes immediately from the corresponding pattern position.

The previous work is preserved.

Why KMP Is Linear

The key observation is that neither text indices nor pattern indices move backward unnecessarily.

Each text character:

T[i]

is examined at most a constant number of times.

Each pattern character contributes to a bounded number of transitions.

Therefore:

Preprocessing = O(m)
Search        = O(n)

Total:

O(n + m)

where:

Symbol	Meaning
n	Text length
m	Pattern length

Comparison with Naive Matching

Algorithm	Worst Case
Naive	O(nm)
KMP	O(n + m)

For:

n = 1,000,000
m = 1,000

the difference can be enormous.

The naive algorithm may perform billions of comparisons.

KMP remains proportional to the combined size of the text and pattern.

Correctness

KMP maintains the invariant:

Pattern[0..j-1]

matches:

Text[i-j..i-1]

at every step.

When a mismatch occurs, the LPS table identifies the largest prefix that can still match.

No valid occurrence is skipped because every shift preserves all information implied by previous comparisons.

Every reported position corresponds to a complete match, and every complete match is eventually discovered.

Complexity Analysis

Operation	Complexity
Build LPS	O(m)
Search	O(n)
Total	O(n + m)
Extra memory	O(m)

When to Use KMP

KMP is an excellent choice when:

The pattern is searched many times.
Worst-case guarantees matter.
Deterministic performance is required.
The pattern contains repeated prefixes.
Memory usage should remain small.

KMP introduced one of the most important ideas in algorithm design: preprocessing the pattern so that information learned during matching is never wasted. This concept reappears throughout string algorithms, indexing structures, dynamic programming, and automata theory.