16.8 Suffix Arrays

Problem

Suppose you need to answer many substring queries against the same text.

For one query, a linear scan is acceptable:

Text:    bananabandana
Pattern: ana

But if the text is large and the number of queries is high, scanning from the beginning each time is wasteful.

You want a structure that preprocesses the text once and then answers substring queries quickly.

Solution

A suffix array stores all suffixes of a string in sorted order.

For the string:

banana

the suffixes are:

Index	Suffix
0	banana
1	anana
2	nana
3	ana
4	na
5	a

Sorted lexicographically:

Rank	Start	Suffix
0	5	a
1	3	ana
2	1	anana
3	0	banana
4	4	na
5	2	nana

The suffix array is therefore:

[5, 3, 1, 0, 4, 2]

Each number is the starting position of a suffix in the original text.

Why This Helps

A pattern occurs in the text if it is a prefix of some suffix.

Searching for:

ana

means looking for suffixes that begin with:

ana

In the sorted suffix array, all matching suffixes are adjacent:

a
ana       ← match
anana     ← match
banana
na
nana

This means pattern search becomes binary search over sorted suffixes.

Naive Construction

The simplest construction is:

Generate every suffix.
Sort the suffixes.
Store their starting positions.

In Lean-like pseudocode:

def suffixes (s : String) : List (Nat × String) :=
  (List.range s.length).map
    (fun i => (i, s.drop i))

def naiveSuffixArray (s : String) : List Nat :=
  ((suffixes s).mergeSort
    (fun a b => a.snd < b.snd)).map
    (fun pair => pair.fst)

This is easy to understand but inefficient. Each suffix may be length O(n), and sorting n suffixes can cost O(n² log n) if comparisons repeatedly inspect long strings.

For teaching and testing, however, this version is useful. It gives a reference implementation that is hard to get wrong.

Pattern Search

Once the suffix array exists, search uses binary search.

At each midpoint, compare the pattern with the suffix beginning at that suffix-array position.

Text: banana
SA:   [5, 3, 1, 0, 4, 2]

Search: ana

The matching suffixes occupy a contiguous interval. One binary search finds the left boundary. Another finds the right boundary.

The result is all starting positions inside that interval.

Prefix Comparison

You do not need to compare the pattern with the full suffix. You only need to compare against the first m characters, where m is the pattern length.

Pattern: ana
Suffix:  anana
Compare: ana

The pattern matches because the suffix begins with the pattern.

This detail matters. Comparing full suffixes does extra work and can produce incorrect ordering logic when the pattern is shorter than the suffix.

A practical comparison returns one of three results:

suffix-prefix < pattern
suffix-prefix = pattern
suffix-prefix > pattern

Lean Sketch

A compact comparison helper can be described this way:

inductive Cmp where
  | lt
  | eq
  | gt

def comparePrefix
  (text pattern : String)
  (start : Nat)
  : Cmp :=
by
  -- Compare pattern with text[start .. start + pattern.length).
  -- Return eq only when the whole pattern matches.
  sorry

Then the binary search can use comparePrefix as its ordering predicate.

def suffixArraySearch
  (text pattern : String)
  (sa : Array Nat)
  : List Nat :=
by
  -- Find the lower and upper suffix-array bounds
  -- whose suffixes begin with pattern.
  sorry

The important design point is not the syntax. It is the interface: the suffix array is independent of the query pattern, and the search logic only needs indexed prefix comparisons.

Example

Text:

mississippi

Suffixes include:

mississippi
ississippi
ssissippi
sissippi
issippi
ssippi
sippi
ippi
ppi
pi
i

A suffix array places these in lexicographic order.

Searching for:

issi

finds suffixes beginning at positions:

1
4

because:

mississippi
 ^^^^
    ^^^^

Both occurrences are discovered through the same sorted suffix structure.

Faster Construction

Production suffix arrays are not built by explicitly sorting full suffix strings.

Common construction methods include:

Method	Idea
Prefix doubling	Sort suffixes by increasingly large prefixes
SA-IS	Linear-time induced sorting
DC3 / skew	Divide suffixes into classes and merge
Radix-based methods	Exploit integer ranks for fast sorting

A common educational implementation uses prefix doubling.

The idea is to assign each suffix a rank based on its first character, then repeatedly sort by pairs:

(rank[i], rank[i + k])

where k doubles each round:

1, 2, 4, 8, ...

After enough rounds, the ranks describe the full suffix order.

Prefix Doubling Outline

For each suffix starting at i, maintain:

rank[i]

At step k, sort suffixes by:

(rank[i], rank[i + k])

Then assign new ranks based on sorted order.

Example progression:

k = 1  compare first 2 characters
k = 2  compare first 4 characters
k = 4  compare first 8 characters

After O(log n) rounds, every suffix has a final rank.

With efficient sorting, this gives:

O(n log n)

construction.

Complexity Analysis

Let n be the text length and m be the pattern length.

Operation	Complexity
Naive construction	O(n² log n)
Prefix-doubling construction	O(n log n)
Linear construction algorithms	O(n)
Pattern search	O(m log n)
Reporting matches	O(z)
Storage	O(n)

Here z is the number of matches reported.

The common query cost is:

O(m log n + z)

With an LCP structure, some repeated comparison work can be reduced further.

Correctness

A suffix array contains every suffix of the text exactly once.

A pattern occurs at position i exactly when the suffix starting at i begins with that pattern.

Because suffixes are sorted lexicographically, all suffixes with a common prefix form one contiguous interval.

Binary search finds the boundaries of that interval. Reporting the suffix starts inside the interval gives exactly the occurrence positions.

When to Use It

Use a suffix array when:

The text is fixed.
Many substring queries are expected.
You need all occurrence positions.
Memory matters more than pointer-heavy suffix trees.
You want a practical full-text index.

Suffix arrays are especially useful for search systems, compression tools, bioinformatics, plagiarism detection, and repeated substring analysis.

Common Pitfalls

Do not store full suffix strings in production. Store starting positions.

Do not compare entire suffixes during query search. Compare only the pattern-length prefix.

Handle empty patterns explicitly.

Be careful with Unicode. A suffix array over bytes, code points, and grapheme clusters gives different semantics.

Keep the suffix array sorted by the same representation used during search.

Takeaway

A suffix array turns substring search into ordered lookup. Instead of scanning the text for every query, you sort all suffixes once, then use binary search to find the interval of suffixes that begin with the pattern. It is one of the most useful bridges between simple string matching and full-text indexing.