16.5 Boyer-Moore Overview

Most exact matching algorithms process the text from left to right.

16.5 Boyer-Moore Overview

Problem

Most exact matching algorithms process the text from left to right.

Consider:

Text:    HERE IS A SIMPLE EXAMPLE
Pattern: EXAMPLE

A typical algorithm begins by comparing:

E
E

then:

X
X

and continues forward.

This approach works well, but it often examines many characters that ultimately contribute nothing to the search.

Human readers rarely search this way. When looking for a word in a sentence, we often glance at distinctive characters near the end of the word and quickly eliminate impossible locations.

The challenge is to skip large portions of the text rather than advancing one position at a time.

Solution

The Boyer-Moore algorithm searches from right to left within the pattern.

Instead of aligning the beginning of the pattern with the current text position, Boyer-Moore compares the last pattern character first.

Example:

Text:    HERE IS A SIMPLE EXAMPLE
Pattern: EXAMPLE

Alignment:

HERE IS A SIMPLE EXAMPLE
EXAMPLE
      ^

The comparison begins at:

E

the final character of the pattern.

If that comparison fails, Boyer-Moore often shifts the pattern several positions at once.

This ability to make large jumps is the reason Boyer-Moore is one of the fastest practical exact matching algorithms.

Historical Significance

Published by:

  • entity["people","Robert S. Boyer","computer scientist"]
  • entity["people","J Strother Moore","computer scientist"]

in 1977, Boyer-Moore became one of the most influential string matching algorithms ever developed.

Many text editors, search tools, databases, and operating systems have used Boyer-Moore variants because their real-world performance is often better than algorithms with similar theoretical complexity.

Right-to-Left Matching

Consider:

Text:    ABCDABCDABCD
Pattern: ABCD

Alignment:

ABCDABCDABCD
ABCD
   ^

Comparison starts with:

D
D

then:

C
C

then:

B
B

then:

A
A

A complete match is found.

Now suppose:

ABCDABXDABCD
ABCD
   ^

Comparison:

D
D

Success.

Next:

C
X

Failure.

Instead of shifting by one position, Boyer-Moore analyzes the mismatch and often jumps several characters.

Bad Character Rule

The first optimization is the Bad Character Rule.

Suppose:

Pattern: EXAMPLE

Alignment:

TEXTEXZMPLE
EXAMPLE
      ^

Comparison:

M
Z

Mismatch.

The character:

Z

does not appear in the pattern.

Therefore the pattern cannot possibly match any alignment containing this character in the current position.

The algorithm shifts past it immediately.

Large portions of the text are skipped.

Example

Text:    ABCDEFGHIJ
Pattern: XYZ

First comparison:

Z
C

Mismatch.

Since:

C

does not appear anywhere in:

XYZ

the entire pattern jumps beyond that position.

No further analysis is needed.

Bad Character Table

Before searching, construct a table:

Character Last Position
E 6
X 1
A 2
M 3
P 4
L 5

for:

EXAMPLE

When a mismatch occurs, the table determines the nearest possible alignment.

Lean Example

def buildBadCharacterTable
  (pattern : Array Char)
  : Std.HashMap Char Nat :=

  Id.run do
    let mut table := {}
    for i in [:pattern.size] do
      table := table.insert pattern[i]! i
    return table

The table records the rightmost occurrence of every character.

Good Suffix Rule

The second optimization is more sophisticated.

Suppose:

Pattern: ABCABC

and:

ABCXBC
ABCABC

Comparison begins at the end:

BC
BC

These characters already match.

The mismatch occurs earlier.

The suffix:

BC

is known to be correct.

Instead of discarding this information, Boyer-Moore searches for another occurrence of:

BC

inside the pattern.

The pattern is shifted so the known suffix remains aligned.

This often produces even larger jumps than the bad character rule.

Example of Good Suffix

Pattern:

ABCDABC

Matched suffix:

ABC

The algorithm searches the pattern for another occurrence of:

ABC

and realigns accordingly.

If no occurrence exists, the pattern may jump beyond the matched suffix entirely.

This produces dramatic reductions in comparisons.

Combining Both Rules

At every mismatch:

  1. Compute shift from Bad Character Rule.
  2. Compute shift from Good Suffix Rule.
  3. Use the larger shift.
shift =
max(
    badCharacterShift,
    goodSuffixShift
)

The larger jump guarantees progress while preserving correctness.

Example Walkthrough

Text:

HERE IS A SIMPLE EXAMPLE

Pattern:

EXAMPLE

Initial alignment:

HERE IS A SIMPLE EXAMPLE
EXAMPLE

Mismatch occurs immediately.

Bad character analysis suggests a large shift.

Several characters are skipped.

After a few jumps:

HERE IS A SIMPLE EXAMPLE
                EXAMPLE

The match is found.

Notice that many characters were never examined.

Why It Is Fast

KMP and Z achieve efficiency by avoiding repeated work.

Boyer-Moore achieves efficiency by avoiding work entirely.

In large texts:

Text length = millions
Pattern length = dozens

the algorithm frequently jumps multiple characters after every mismatch.

Average behavior is often substantially better than linear scanning.

This makes Boyer-Moore particularly effective for:

  • Text editors
  • Search utilities
  • Document indexing
  • Log analysis
  • DNA sequence search

A complete Boyer-Moore implementation is lengthy, but the central idea is straightforward.

def shiftAmount
  (table : Std.HashMap Char Nat)
  (c : Char)
  (j : Nat)
  : Nat :=

  match table.get? c with
  | none => j + 1
  | some pos =>
      if pos < j then
        j - pos
      else
        1

The shift depends on the mismatching character and its position within the pattern.

Correctness

Boyer-Moore only shifts to positions that could still produce a valid match.

The Bad Character Rule preserves alignments consistent with the observed character.

The Good Suffix Rule preserves alignments consistent with the already matched suffix.

Therefore:

  • No valid occurrence is skipped.
  • Every reported occurrence is a genuine match.

Correctness follows from the fact that every discarded alignment is provably impossible.

Complexity Analysis

Let:

Symbol Meaning
n Text length
m Pattern length

Preprocessing

Operation Complexity
Bad character table O(m)
Good suffix table O(m)
Case Complexity
Best case O(n / m)
Average case Sublinear
Worst case O(nm)

Modern variants improve worst-case guarantees.

Boyer-Moore Variants

Several important descendants exist:

Algorithm Purpose
Boyer-Moore-Horspool Simpler implementation
Sunday Algorithm Improved practical performance
Turbo Boyer-Moore Better reuse of matches
Apostolico-Giancarlo Enhanced suffix handling

Many production systems use these variants rather than the original algorithm.

Comparison

Algorithm Key Idea
KMP Reuse matched prefixes
Z Algorithm Reuse prefix similarities
Rabin-Karp Compare hashes
Boyer-Moore Skip impossible regions

KMP and Z avoid redundant comparisons.

Rabin-Karp replaces comparisons with fingerprints.

Boyer-Moore takes a different path: it attempts to avoid visiting large parts of the text altogether. This idea of using mismatch information to guide aggressive skipping has influenced search systems, indexing structures, database query engines, and many later string matching algorithms.