16.10 Suffix Automata

Problem

Suppose you need to answer questions such as:

Does a substring occur in the text?
How many distinct substrings exist?
What is the longest repeated substring?
How many times does a substring appear?
What is the longest common substring between two strings?

A suffix array and LCP array can answer many of these questions efficiently, but they are fundamentally indexing structures. Queries often require binary searches, range queries, or additional data structures.

The challenge is to build a structure that directly represents every substring of a text and supports many string operations through simple graph traversal.

Solution

A suffix automaton is a compressed finite-state machine that represents all substrings of a string.

For a text:

banana

the automaton compactly encodes:

b
a
n
ba
an
na
ban
ana
nan
...
banana

Rather than storing every substring explicitly, equivalent substrings are grouped into states.

The resulting structure contains only:

O(n)

states and transitions.

This compression is the remarkable property that makes suffix automata practical.

From Tries to Automata

Consider building a trie of all suffixes:

banana
anana
nana
ana
na
a

This structure contains every substring implicitly.

Unfortunately, it can require:

O(n²)

space.

Many paths represent equivalent information.

A suffix automaton merges equivalent states, producing a compact representation.

Conceptually:

Suffix Trie
      ↓
State Merging
      ↓
Suffix Automaton

The automaton preserves all substring information while dramatically reducing storage.

Understanding States

A state does not represent a single substring.

Instead, it represents a set of substrings sharing the same continuation behavior.

For example:

ana
nana

may belong to the same equivalence class if every future extension behaves identically.

This is the key abstraction.

The automaton stores behavior rather than individual strings.

State Structure

Each state contains:

Outgoing transitions.
A suffix link.
The length of the longest represented substring.

A simplified Lean representation:

structure State where
  length : Nat
  link : Nat
  next : Std.HashMap Char Nat

The automaton itself is an array of states.

abbrev SuffixAutomaton := Array State

Each state receives an integer identifier.

Suffix Links

Suffix links play a role similar to failure links in Aho-Corasick and prefix links in KMP.

Suppose a state represents:

banana

Its suffix link points toward the state representing the longest proper suffix:

anana

which itself links to:

nana

and so on.

These links form a hierarchy of suffix relationships.

banana
  ↓
anana
  ↓
nana
  ↓
ana
  ↓
na
  ↓
a

Many important algorithms traverse these links.

Incremental Construction

One of the most elegant properties of a suffix automaton is that it can be built online.

Characters are processed one at a time.

b
ba
ban
bana
banan
banana

After each new character:

Create a new state.
Add transitions.
Update suffix links.
Clone states when necessary.

The entire automaton is built in a single pass.

Example

Process:

ababa

After reading:

the automaton contains substrings:

After reading:

ab

it contains:

a
b
ab

After reading:

aba

it contains:

a
b
ab
ba
aba

The structure grows incrementally while maintaining compactness.

Cloning States

The most subtle part of construction involves cloning.

Suppose adding a character would violate automaton invariants.

Instead of modifying an existing state directly, the algorithm creates a clone.

A clone:

Copies transitions.
Receives a different length.
Preserves correctness.

Conceptually:

Original State
       ↓
    Clone
       ↓
Rewire Links

Without cloning, some substrings would be represented incorrectly.

Although initially surprising, cloning is what allows linear-time construction.

Lean Sketch

Construction typically exposes a single operation:

def extend
  (automaton : SuffixAutomaton)
  (c : Char)
  : SuffixAutomaton :=
by
  -- Create state.
  -- Update transitions.
  -- Maintain suffix links.
  -- Clone when needed.
  sorry

Building the full automaton becomes:

def build
  (text : String)
  : SuffixAutomaton :=
by
  -- Repeatedly extend.
  sorry

The implementation is intricate, but the interface remains simple.

Substring Membership

One of the easiest applications is substring lookup.

Suppose the text is:

banana

Query:

ana

Begin at the start state.

Follow transitions:

a
↓
n
↓
a

If every transition exists:

Match found

Otherwise:

Substring absent

No suffix array search is required.

Membership becomes simple graph traversal.

Counting Distinct Substrings

A remarkable result is that the number of distinct substrings can be computed directly from state lengths.

For each state:

length(state)
-
length(link(state))

contributes the number of new substrings introduced by that state.

Summing over all states yields:

Distinct Substrings

For:

banana

this calculation produces:

without explicitly enumerating any substrings.

Longest Repeated Substring

Repeated substrings correspond to states visited by multiple suffixes.

By tracking occurrence counts during construction, the automaton can identify:

Longest Repeated Substring

efficiently.

For:

banana

the answer is:

ana

The same result obtained through suffix arrays and LCP arrays emerges naturally from automaton structure.

Occurrence Counting

After construction, occurrence counts can be propagated through suffix links.

Suppose:

banana

Query:

ana

The automaton can determine:

Occurrences = 2

without rescanning the original text.

This capability makes suffix automata attractive for frequency-analysis problems.

Longest Common Substring

Given:

String A = banana
String B = ananas

Build a suffix automaton for:

banana

Then process:

ananas

through the automaton.

Track the longest successful traversal.

Result:

anana

This computes the longest common substring in linear time.

Complexity Analysis

Let:

Symbol	Meaning
n	Text length

Construction:

Operation	Complexity
Build automaton	O(n)
States	O(n)
Transitions	O(n)

Queries:

Operation	Complexity
Substring membership	O(m)
Occurrence counting	O(m)
Longest common substring	O(n + m)

where m is the query length.

These bounds make suffix automata among the most powerful linear-time string structures.

Suffix Automata vs Suffix Arrays

Feature	Suffix Array	Suffix Automaton
Construction	O(n log n) typical	O(n)
Memory	O(n)	O(n)
Substring lookup	O(m log n)	O(m)
Distinct substring count	Requires LCP	Direct
Occurrence counting	Additional structures	Natural
Implementation difficulty	Moderate	High

Suffix arrays are often simpler to understand and debug.

Suffix automata provide more direct support for substring-oriented queries.

The choice depends on workload and implementation constraints.

Correctness

Every path from the start state corresponds to a substring of the original text.

Every substring of the original text corresponds to a path in the automaton.

State merging preserves continuation behavior, ensuring no information is lost.

Suffix links maintain relationships between substrings and their longest proper suffixes.

Together, these invariants guarantee that the automaton represents exactly the set of all substrings.

Common Pitfalls

Do not think of states as individual substrings.

A state may represent many substrings simultaneously.

Be careful when implementing cloning logic. Most bugs arise there.

Remember that occurrence counts usually require a separate propagation phase after construction.

Avoid storing explicit substring data inside states. The automaton's efficiency depends on representing behavior rather than materialized strings.

Takeaway

A suffix automaton is one of the most elegant structures in string processing. It represents every substring of a text in linear space while supporting a surprisingly large collection of queries. Where suffix arrays organize suffixes and LCP arrays measure similarity, a suffix automaton captures the entire substring universe of a text as a compact graph. For substring-centric problems, it is often the most powerful tool available.