16.7 Aho-Corasick
Suppose you need to search a text for thousands of patterns simultaneously.
16.7 Aho-Corasick
Problem
Suppose you need to search a text for thousands of patterns simultaneously.
Consider a dictionary:
he
her
hers
hero
help
hello
world
search
engine
algorithm
and a text stream:
the hero helped her while the search engine processed the world
A straightforward solution runs a separate search for every pattern.
search("he")
search("her")
search("hers")
search("hero")
...
Even if each search uses KMP or Boyer-Moore, the text must be processed repeatedly.
For a text of length n and k patterns, the cost quickly becomes substantial.
The challenge is to process the text once while simultaneously detecting every occurrence of every pattern.
Solution
Aho-Corasick combines two ideas:
- A trie for sharing common prefixes.
- Failure links inspired by KMP.
The result is a finite-state machine that processes the text in a single pass.
Instead of restarting whenever a mismatch occurs, the automaton follows failure transitions that preserve previously matched information.
Conceptually:
Trie
+
Failure Links
+
Output Lists
=
Aho-Corasick Automaton
The automaton reads one character at a time and continuously maintains the longest matching prefix among all patterns.
Building the Trie
Begin with a trie containing every pattern.
For:
he
her
hers
hero
the trie looks like:
(root)
|
h
|
e*
|
r*
/ \\n s* o*
Each terminal node records one or more matching patterns.
At this stage we have efficient prefix lookup but no efficient recovery after mismatches.
Why Failure Links Are Needed
Suppose the automaton has matched:
her
and the next character is:
p
The path:
h → e → r → p
does not exist.
A naive trie search would restart from the root.
However, some information has already been learned.
The suffix:
er
may still participate in another match.
The failure link identifies the longest suffix that is also a valid trie prefix.
Instead of restarting completely, the automaton jumps directly to that state.
Understanding Failure Links
Consider the patterns:
he
her
hers
she
The trie contains:
(root)
├── h
│ └── e
│ └── r
│ └── s
└── s
└── h
└── e
Suppose we are currently at:
hers
and encounter a mismatch.
The failure link points to the longest valid suffix that is also present elsewhere in the trie.
This allows matching to continue immediately.
The idea mirrors the KMP prefix function, except it applies to an entire trie rather than a single pattern.
Constructing Failure Links
Failure links are computed using breadth-first search.
The root's children receive:
failure = root
For deeper nodes:
- Follow the parent's failure link.
- Look for a matching transition.
- Continue until a valid state is found.
- Use the root if none exists.
This procedure guarantees that every failure link points to the longest possible reusable prefix.
Example
Patterns:
he
she
his
hers
Failure relationships become:
she → he
hers → s
his → s
These links encode reusable matching information discovered during construction.
Output Links
A node may correspond to multiple matches.
Suppose the patterns are:
he
she
When matching:
she
the final node represents:
she
but also contains the suffix:
he
Therefore both patterns should be reported.
Output links propagate these relationships.
When a state is reached, every pattern associated with that state and its output chain is emitted.
This allows overlapping matches to be discovered automatically.
Automaton Traversal
Once construction is complete, searching becomes simple.
Initialize:
state = root
For each character:
- Follow a matching edge if available.
- Otherwise follow failure links.
- Continue until a valid transition exists.
- Enter the new state.
- Report all outputs.
The text is scanned exactly once.
No backtracking occurs.
No characters are revisited.
Example
Patterns:
he
her
hero
help
Text:
hero helped her
Processing:
h
e
r
o
reaches:
hero
Match reported:
hero
her
he
depending on the output structure.
Continuing:
help
produces another match.
Later:
her
produces additional matches.
All occurrences are discovered during a single pass.
Lean Representation
A simplified node definition:
structure ACNode where
children : Std.HashMap Char Nat
failure : Nat
outputs : Array String
Each node stores:
- Outgoing transitions
- Failure destination
- Patterns ending at the node
The automaton itself is typically represented as an array of nodes.
abbrev Automaton := Array ACNode
Node identifiers become state numbers.
Building Failure Links
A breadth-first traversal constructs failure transitions.
partial def buildFailureLinks
(automaton : Automaton)
: Automaton :=
by
-- BFS construction
sorry
Production implementations typically use a queue and process nodes level by level.
The overall construction remains linear in the total size of the pattern set.
Searching
Search maintains a current state.
partial def search
(automaton : Automaton)
(text : String)
: List (Nat × String) :=
by
-- Traverse automaton
sorry
Each character causes a small number of state transitions.
Every reported match contains:
(position, pattern)
indicating where the pattern ends.
Example Walkthrough
Patterns:
he
she
his
hers
Text:
ushers
Traversal:
u
s
h
e
r
s
Matches discovered:
she
he
hers
Notice that:
she
and:
he
overlap.
The automaton reports both automatically.
A naive approach would require multiple independent searches.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Text length |
| m | Total pattern length |
| z | Number of matches |
Construction:
| Operation | Complexity |
|---|---|
| Trie build | O(m) |
| Failure links | O(m) |
| Total | O(m) |
Search:
| Operation | Complexity |
|---|---|
| Text scan | O(n) |
| Output reporting | O(z) |
| Total | O(n + z) |
Memory:
| Resource | Complexity |
|---|---|
| Automaton | O(m) |
This efficiency is the defining achievement of Aho-Corasick.
Thousands of patterns can be searched simultaneously with a single linear pass through the text.
Real-World Applications
Search Engines
Large dictionaries of keywords are matched against documents.
Intrusion Detection
Network packets are scanned for thousands of attack signatures.
Antivirus Software
Malware signatures are searched within files and memory regions.
DNA Analysis
Large collections of genetic markers are matched against genomic sequences.
Log Processing
Event streams are scanned for many patterns simultaneously.
Content Filtering
Blocked phrases and keywords are detected in real time.
Comparison with Earlier Algorithms
| Algorithm | Patterns | Complexity |
|---|---|---|
| KMP | 1 | O(n + m) |
| Z Algorithm | 1 | O(n + m) |
| Rabin-Karp | Many (same length) | O(n + m) expected |
| Boyer-Moore | 1 | Excellent practical performance |
| Aho-Corasick | Many | O(n + z) |
KMP generalizes information reuse within a single pattern.
The Z Algorithm generalizes prefix similarity across a string.
Boyer-Moore exploits mismatches to skip large regions.
Aho-Corasick extends these ideas into a full automaton capable of matching an entire dictionary at once.
The transition from individual pattern matching to automata-based matching is one of the major conceptual shifts in string processing. Rather than asking whether a particular pattern appears in a text, we construct a machine that recognizes every pattern simultaneously and then stream the text through that machine. This automaton perspective will appear repeatedly throughout the remainder of the chapter, especially in suffix structures and advanced text indexing systems.