16.7 Aho-Corasick

Suppose you need to search a text for thousands of patterns simultaneously.

16.7 Aho-Corasick

Problem

Suppose you need to search a text for thousands of patterns simultaneously.

Consider a dictionary:

he
her
hers
hero
help
hello
world
search
engine
algorithm

and a text stream:

the hero helped her while the search engine processed the world

A straightforward solution runs a separate search for every pattern.

search("he")
search("her")
search("hers")
search("hero")
...

Even if each search uses KMP or Boyer-Moore, the text must be processed repeatedly.

For a text of length n and k patterns, the cost quickly becomes substantial.

The challenge is to process the text once while simultaneously detecting every occurrence of every pattern.

Solution

Aho-Corasick combines two ideas:

  1. A trie for sharing common prefixes.
  2. Failure links inspired by KMP.

The result is a finite-state machine that processes the text in a single pass.

Instead of restarting whenever a mismatch occurs, the automaton follows failure transitions that preserve previously matched information.

Conceptually:

Trie
  +
Failure Links
  +
Output Lists
  =
Aho-Corasick Automaton

The automaton reads one character at a time and continuously maintains the longest matching prefix among all patterns.

Building the Trie

Begin with a trie containing every pattern.

For:

he
her
hers
hero

the trie looks like:

(root)
   |
   h
   |
   e*
   |
   r*
  / \\n s*  o*

Each terminal node records one or more matching patterns.

At this stage we have efficient prefix lookup but no efficient recovery after mismatches.

Suppose the automaton has matched:

her

and the next character is:

p

The path:

h → e → r → p

does not exist.

A naive trie search would restart from the root.

However, some information has already been learned.

The suffix:

er

may still participate in another match.

The failure link identifies the longest suffix that is also a valid trie prefix.

Instead of restarting completely, the automaton jumps directly to that state.

Consider the patterns:

he
her
hers
she

The trie contains:

(root)
├── h
│   └── e
│       └── r
│           └── s
└── s
    └── h
        └── e

Suppose we are currently at:

hers

and encounter a mismatch.

The failure link points to the longest valid suffix that is also present elsewhere in the trie.

This allows matching to continue immediately.

The idea mirrors the KMP prefix function, except it applies to an entire trie rather than a single pattern.

Failure links are computed using breadth-first search.

The root's children receive:

failure = root

For deeper nodes:

  1. Follow the parent's failure link.
  2. Look for a matching transition.
  3. Continue until a valid state is found.
  4. Use the root if none exists.

This procedure guarantees that every failure link points to the longest possible reusable prefix.

Example

Patterns:

he
she
his
hers

Failure relationships become:

she → he
hers → s
his → s

These links encode reusable matching information discovered during construction.

A node may correspond to multiple matches.

Suppose the patterns are:

he
she

When matching:

she

the final node represents:

she

but also contains the suffix:

he

Therefore both patterns should be reported.

Output links propagate these relationships.

When a state is reached, every pattern associated with that state and its output chain is emitted.

This allows overlapping matches to be discovered automatically.

Automaton Traversal

Once construction is complete, searching becomes simple.

Initialize:

state = root

For each character:

  1. Follow a matching edge if available.
  2. Otherwise follow failure links.
  3. Continue until a valid transition exists.
  4. Enter the new state.
  5. Report all outputs.

The text is scanned exactly once.

No backtracking occurs.

No characters are revisited.

Example

Patterns:

he
her
hero
help

Text:

hero helped her

Processing:

h
e
r
o

reaches:

hero

Match reported:

hero
her
he

depending on the output structure.

Continuing:

help

produces another match.

Later:

her

produces additional matches.

All occurrences are discovered during a single pass.

Lean Representation

A simplified node definition:

structure ACNode where
  children : Std.HashMap Char Nat
  failure  : Nat
  outputs  : Array String

Each node stores:

  • Outgoing transitions
  • Failure destination
  • Patterns ending at the node

The automaton itself is typically represented as an array of nodes.

abbrev Automaton := Array ACNode

Node identifiers become state numbers.

A breadth-first traversal constructs failure transitions.

partial def buildFailureLinks
  (automaton : Automaton)
  : Automaton :=
by
  -- BFS construction
  sorry

Production implementations typically use a queue and process nodes level by level.

The overall construction remains linear in the total size of the pattern set.

Searching

Search maintains a current state.

partial def search
  (automaton : Automaton)
  (text : String)
  : List (Nat × String) :=
by
  -- Traverse automaton
  sorry

Each character causes a small number of state transitions.

Every reported match contains:

(position, pattern)

indicating where the pattern ends.

Example Walkthrough

Patterns:

he
she
his
hers

Text:

ushers

Traversal:

u
s
h
e
r
s

Matches discovered:

she
he
hers

Notice that:

she

and:

he

overlap.

The automaton reports both automatically.

A naive approach would require multiple independent searches.

Complexity Analysis

Let:

Symbol Meaning
n Text length
m Total pattern length
z Number of matches

Construction:

Operation Complexity
Trie build O(m)
Failure links O(m)
Total O(m)

Search:

Operation Complexity
Text scan O(n)
Output reporting O(z)
Total O(n + z)

Memory:

Resource Complexity
Automaton O(m)

This efficiency is the defining achievement of Aho-Corasick.

Thousands of patterns can be searched simultaneously with a single linear pass through the text.

Real-World Applications

Search Engines

Large dictionaries of keywords are matched against documents.

Intrusion Detection

Network packets are scanned for thousands of attack signatures.

Antivirus Software

Malware signatures are searched within files and memory regions.

DNA Analysis

Large collections of genetic markers are matched against genomic sequences.

Log Processing

Event streams are scanned for many patterns simultaneously.

Content Filtering

Blocked phrases and keywords are detected in real time.

Comparison with Earlier Algorithms

Algorithm Patterns Complexity
KMP 1 O(n + m)
Z Algorithm 1 O(n + m)
Rabin-Karp Many (same length) O(n + m) expected
Boyer-Moore 1 Excellent practical performance
Aho-Corasick Many O(n + z)

KMP generalizes information reuse within a single pattern.

The Z Algorithm generalizes prefix similarity across a string.

Boyer-Moore exploits mismatches to skip large regions.

Aho-Corasick extends these ideas into a full automaton capable of matching an entire dictionary at once.

The transition from individual pattern matching to automata-based matching is one of the major conceptual shifts in string processing. Rather than asking whether a particular pattern appears in a text, we construct a machine that recognizes every pattern simultaneously and then stream the text through that machine. This automaton perspective will appear repeatedly throughout the remainder of the chapter, especially in suffix structures and advanced text indexing systems.