16.10 Suffix Automata
Suppose you need to answer questions such as: - Does a substring occur in the text?
16.10 Suffix Automata
Problem
Suppose you need to answer questions such as:
- Does a substring occur in the text?
- How many distinct substrings exist?
- What is the longest repeated substring?
- How many times does a substring appear?
- What is the longest common substring between two strings?
A suffix array and LCP array can answer many of these questions efficiently, but they are fundamentally indexing structures. Queries often require binary searches, range queries, or additional data structures.
The challenge is to build a structure that directly represents every substring of a text and supports many string operations through simple graph traversal.
Solution
A suffix automaton is a compressed finite-state machine that represents all substrings of a string.
For a text:
banana
the automaton compactly encodes:
b
a
n
ba
an
na
ban
ana
nan
...
banana
Rather than storing every substring explicitly, equivalent substrings are grouped into states.
The resulting structure contains only:
O(n)
states and transitions.
This compression is the remarkable property that makes suffix automata practical.
From Tries to Automata
Consider building a trie of all suffixes:
banana
anana
nana
ana
na
a
This structure contains every substring implicitly.
Unfortunately, it can require:
O(n²)
space.
Many paths represent equivalent information.
A suffix automaton merges equivalent states, producing a compact representation.
Conceptually:
Suffix Trie
↓
State Merging
↓
Suffix Automaton
The automaton preserves all substring information while dramatically reducing storage.
Understanding States
A state does not represent a single substring.
Instead, it represents a set of substrings sharing the same continuation behavior.
For example:
ana
nana
may belong to the same equivalence class if every future extension behaves identically.
This is the key abstraction.
The automaton stores behavior rather than individual strings.
State Structure
Each state contains:
- Outgoing transitions.
- A suffix link.
- The length of the longest represented substring.
A simplified Lean representation:
structure State where
length : Nat
link : Nat
next : Std.HashMap Char Nat
The automaton itself is an array of states.
abbrev SuffixAutomaton := Array State
Each state receives an integer identifier.
Suffix Links
Suffix links play a role similar to failure links in Aho-Corasick and prefix links in KMP.
Suppose a state represents:
banana
Its suffix link points toward the state representing the longest proper suffix:
anana
which itself links to:
nana
and so on.
These links form a hierarchy of suffix relationships.
banana
↓
anana
↓
nana
↓
ana
↓
na
↓
a
Many important algorithms traverse these links.
Incremental Construction
One of the most elegant properties of a suffix automaton is that it can be built online.
Characters are processed one at a time.
b
ba
ban
bana
banan
banana
After each new character:
- Create a new state.
- Add transitions.
- Update suffix links.
- Clone states when necessary.
The entire automaton is built in a single pass.
Example
Process:
ababa
After reading:
a
the automaton contains substrings:
a
After reading:
ab
it contains:
a
b
ab
After reading:
aba
it contains:
a
b
ab
ba
aba
The structure grows incrementally while maintaining compactness.
Cloning States
The most subtle part of construction involves cloning.
Suppose adding a character would violate automaton invariants.
Instead of modifying an existing state directly, the algorithm creates a clone.
A clone:
- Copies transitions.
- Receives a different length.
- Preserves correctness.
Conceptually:
Original State
↓
Clone
↓
Rewire Links
Without cloning, some substrings would be represented incorrectly.
Although initially surprising, cloning is what allows linear-time construction.
Lean Sketch
Construction typically exposes a single operation:
def extend
(automaton : SuffixAutomaton)
(c : Char)
: SuffixAutomaton :=
by
-- Create state.
-- Update transitions.
-- Maintain suffix links.
-- Clone when needed.
sorry
Building the full automaton becomes:
def build
(text : String)
: SuffixAutomaton :=
by
-- Repeatedly extend.
sorry
The implementation is intricate, but the interface remains simple.
Substring Membership
One of the easiest applications is substring lookup.
Suppose the text is:
banana
Query:
ana
Begin at the start state.
Follow transitions:
a
↓
n
↓
a
If every transition exists:
Match found
Otherwise:
Substring absent
No suffix array search is required.
Membership becomes simple graph traversal.
Counting Distinct Substrings
A remarkable result is that the number of distinct substrings can be computed directly from state lengths.
For each state:
length(state)
-
length(link(state))
contributes the number of new substrings introduced by that state.
Summing over all states yields:
Distinct Substrings
For:
banana
this calculation produces:
15
without explicitly enumerating any substrings.
Longest Repeated Substring
Repeated substrings correspond to states visited by multiple suffixes.
By tracking occurrence counts during construction, the automaton can identify:
Longest Repeated Substring
efficiently.
For:
banana
the answer is:
ana
The same result obtained through suffix arrays and LCP arrays emerges naturally from automaton structure.
Occurrence Counting
After construction, occurrence counts can be propagated through suffix links.
Suppose:
banana
Query:
ana
The automaton can determine:
Occurrences = 2
without rescanning the original text.
This capability makes suffix automata attractive for frequency-analysis problems.
Longest Common Substring
Given:
String A = banana
String B = ananas
Build a suffix automaton for:
banana
Then process:
ananas
through the automaton.
Track the longest successful traversal.
Result:
anana
This computes the longest common substring in linear time.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Text length |
Construction:
| Operation | Complexity |
|---|---|
| Build automaton | O(n) |
| States | O(n) |
| Transitions | O(n) |
Queries:
| Operation | Complexity |
|---|---|
| Substring membership | O(m) |
| Occurrence counting | O(m) |
| Longest common substring | O(n + m) |
where m is the query length.
These bounds make suffix automata among the most powerful linear-time string structures.
Suffix Automata vs Suffix Arrays
| Feature | Suffix Array | Suffix Automaton |
|---|---|---|
| Construction | O(n log n) typical | O(n) |
| Memory | O(n) | O(n) |
| Substring lookup | O(m log n) | O(m) |
| Distinct substring count | Requires LCP | Direct |
| Occurrence counting | Additional structures | Natural |
| Implementation difficulty | Moderate | High |
Suffix arrays are often simpler to understand and debug.
Suffix automata provide more direct support for substring-oriented queries.
The choice depends on workload and implementation constraints.
Correctness
Every path from the start state corresponds to a substring of the original text.
Every substring of the original text corresponds to a path in the automaton.
State merging preserves continuation behavior, ensuring no information is lost.
Suffix links maintain relationships between substrings and their longest proper suffixes.
Together, these invariants guarantee that the automaton represents exactly the set of all substrings.
Common Pitfalls
Do not think of states as individual substrings.
A state may represent many substrings simultaneously.
Be careful when implementing cloning logic. Most bugs arise there.
Remember that occurrence counts usually require a separate propagation phase after construction.
Avoid storing explicit substring data inside states. The automaton's efficiency depends on representing behavior rather than materialized strings.
Takeaway
A suffix automaton is one of the most elegant structures in string processing. It represents every substring of a text in linear space while supporting a surprisingly large collection of queries. Where suffix arrays organize suffixes and LCP arrays measure similarity, a suffix automaton captures the entire substring universe of a text as a compact graph. For substring-centric problems, it is often the most powerful tool available.