16.5 Boyer-Moore Overview
Most exact matching algorithms process the text from left to right.
16.5 Boyer-Moore Overview
Problem
Most exact matching algorithms process the text from left to right.
Consider:
Text: HERE IS A SIMPLE EXAMPLE
Pattern: EXAMPLE
A typical algorithm begins by comparing:
E
E
then:
X
X
and continues forward.
This approach works well, but it often examines many characters that ultimately contribute nothing to the search.
Human readers rarely search this way. When looking for a word in a sentence, we often glance at distinctive characters near the end of the word and quickly eliminate impossible locations.
The challenge is to skip large portions of the text rather than advancing one position at a time.
Solution
The Boyer-Moore algorithm searches from right to left within the pattern.
Instead of aligning the beginning of the pattern with the current text position, Boyer-Moore compares the last pattern character first.
Example:
Text: HERE IS A SIMPLE EXAMPLE
Pattern: EXAMPLE
Alignment:
HERE IS A SIMPLE EXAMPLE
EXAMPLE
^
The comparison begins at:
E
the final character of the pattern.
If that comparison fails, Boyer-Moore often shifts the pattern several positions at once.
This ability to make large jumps is the reason Boyer-Moore is one of the fastest practical exact matching algorithms.
Historical Significance
Published by:
- entity["people","Robert S. Boyer","computer scientist"]
- entity["people","J Strother Moore","computer scientist"]
in 1977, Boyer-Moore became one of the most influential string matching algorithms ever developed.
Many text editors, search tools, databases, and operating systems have used Boyer-Moore variants because their real-world performance is often better than algorithms with similar theoretical complexity.
Right-to-Left Matching
Consider:
Text: ABCDABCDABCD
Pattern: ABCD
Alignment:
ABCDABCDABCD
ABCD
^
Comparison starts with:
D
D
then:
C
C
then:
B
B
then:
A
A
A complete match is found.
Now suppose:
ABCDABXDABCD
ABCD
^
Comparison:
D
D
Success.
Next:
C
X
Failure.
Instead of shifting by one position, Boyer-Moore analyzes the mismatch and often jumps several characters.
Bad Character Rule
The first optimization is the Bad Character Rule.
Suppose:
Pattern: EXAMPLE
Alignment:
TEXTEXZMPLE
EXAMPLE
^
Comparison:
M
Z
Mismatch.
The character:
Z
does not appear in the pattern.
Therefore the pattern cannot possibly match any alignment containing this character in the current position.
The algorithm shifts past it immediately.
Large portions of the text are skipped.
Example
Text: ABCDEFGHIJ
Pattern: XYZ
First comparison:
Z
C
Mismatch.
Since:
C
does not appear anywhere in:
XYZ
the entire pattern jumps beyond that position.
No further analysis is needed.
Bad Character Table
Before searching, construct a table:
| Character | Last Position |
|---|---|
| E | 6 |
| X | 1 |
| A | 2 |
| M | 3 |
| P | 4 |
| L | 5 |
for:
EXAMPLE
When a mismatch occurs, the table determines the nearest possible alignment.
Lean Example
def buildBadCharacterTable
(pattern : Array Char)
: Std.HashMap Char Nat :=
Id.run do
let mut table := {}
for i in [:pattern.size] do
table := table.insert pattern[i]! i
return table
The table records the rightmost occurrence of every character.
Good Suffix Rule
The second optimization is more sophisticated.
Suppose:
Pattern: ABCABC
and:
ABCXBC
ABCABC
Comparison begins at the end:
BC
BC
These characters already match.
The mismatch occurs earlier.
The suffix:
BC
is known to be correct.
Instead of discarding this information, Boyer-Moore searches for another occurrence of:
BC
inside the pattern.
The pattern is shifted so the known suffix remains aligned.
This often produces even larger jumps than the bad character rule.
Example of Good Suffix
Pattern:
ABCDABC
Matched suffix:
ABC
The algorithm searches the pattern for another occurrence of:
ABC
and realigns accordingly.
If no occurrence exists, the pattern may jump beyond the matched suffix entirely.
This produces dramatic reductions in comparisons.
Combining Both Rules
At every mismatch:
- Compute shift from Bad Character Rule.
- Compute shift from Good Suffix Rule.
- Use the larger shift.
shift =
max(
badCharacterShift,
goodSuffixShift
)
The larger jump guarantees progress while preserving correctness.
Example Walkthrough
Text:
HERE IS A SIMPLE EXAMPLE
Pattern:
EXAMPLE
Initial alignment:
HERE IS A SIMPLE EXAMPLE
EXAMPLE
Mismatch occurs immediately.
Bad character analysis suggests a large shift.
Several characters are skipped.
After a few jumps:
HERE IS A SIMPLE EXAMPLE
EXAMPLE
The match is found.
Notice that many characters were never examined.
Why It Is Fast
KMP and Z achieve efficiency by avoiding repeated work.
Boyer-Moore achieves efficiency by avoiding work entirely.
In large texts:
Text length = millions
Pattern length = dozens
the algorithm frequently jumps multiple characters after every mismatch.
Average behavior is often substantially better than linear scanning.
This makes Boyer-Moore particularly effective for:
- Text editors
- Search utilities
- Document indexing
- Log analysis
- DNA sequence search
Simplified Lean Search
A complete Boyer-Moore implementation is lengthy, but the central idea is straightforward.
def shiftAmount
(table : Std.HashMap Char Nat)
(c : Char)
(j : Nat)
: Nat :=
match table.get? c with
| none => j + 1
| some pos =>
if pos < j then
j - pos
else
1
The shift depends on the mismatching character and its position within the pattern.
Correctness
Boyer-Moore only shifts to positions that could still produce a valid match.
The Bad Character Rule preserves alignments consistent with the observed character.
The Good Suffix Rule preserves alignments consistent with the already matched suffix.
Therefore:
- No valid occurrence is skipped.
- Every reported occurrence is a genuine match.
Correctness follows from the fact that every discarded alignment is provably impossible.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Text length |
| m | Pattern length |
Preprocessing
| Operation | Complexity |
|---|---|
| Bad character table | O(m) |
| Good suffix table | O(m) |
Search
| Case | Complexity |
|---|---|
| Best case | O(n / m) |
| Average case | Sublinear |
| Worst case | O(nm) |
Modern variants improve worst-case guarantees.
Boyer-Moore Variants
Several important descendants exist:
| Algorithm | Purpose |
|---|---|
| Boyer-Moore-Horspool | Simpler implementation |
| Sunday Algorithm | Improved practical performance |
| Turbo Boyer-Moore | Better reuse of matches |
| Apostolico-Giancarlo | Enhanced suffix handling |
Many production systems use these variants rather than the original algorithm.
Comparison
| Algorithm | Key Idea |
|---|---|
| KMP | Reuse matched prefixes |
| Z Algorithm | Reuse prefix similarities |
| Rabin-Karp | Compare hashes |
| Boyer-Moore | Skip impossible regions |
KMP and Z avoid redundant comparisons.
Rabin-Karp replaces comparisons with fingerprints.
Boyer-Moore takes a different path: it attempts to avoid visiting large parts of the text altogether. This idea of using mismatch information to guide aggressive skipping has influenced search systems, indexing structures, database query engines, and many later string matching algorithms.