SA IS is a linear-time suffix array construction algorithm. It sorts a carefully chosen subset of suffixes, called LMS suffixes, then induces the order of all remaining suffixes from them.
It is one of the standard practical algorithms for building suffix arrays in linear time.
Problem
Given a string of length , construct its suffix array:
A sentinel character smaller than every other character is usually appended to the string.
Idea
SA IS works in four main phases:
- classify suffixes as S-type or L-type
- identify LMS positions
- sort LMS suffixes using induced sorting
- reduce the LMS order to a smaller string and recurse if needed
After LMS suffixes are correctly ordered, the full suffix array can be induced.
Types
A position is S-type if:
It is L-type if:
For equal adjacent characters, the type is inherited from the next position.
classify(S):
type[n - 1] = S
for i from n - 2 down to 0:
if S[i] < S[i + 1]:
type[i] = S
else if S[i] > S[i + 1]:
type[i] = L
else:
type[i] = type[i + 1]LMS Positions
A position is LMS if:
and
These positions mark leftmost S-type suffixes.
Algorithm
sa_is(S):
classify all positions as S or L
find LMS positions
place LMS suffixes at bucket ends
induce L-type suffixes
induce S-type suffixes
extract LMS suffix order
name LMS substrings
if names are not unique:
recurse on the reduced string
place LMS suffixes in final LMS order
induce L-type suffixes
induce S-type suffixes
return SAExample
Let:
Suffixes:
| index | suffix |
|---|---|
| 0 | banana$ |
| 1 | anana$ |
| 2 | nana$ |
| 3 | ana$ |
| 4 | na$ |
| 5 | a$ |
| 6 | $ |
The final suffix array is:
Correctness
SA IS first determines the relative order of LMS suffixes. LMS suffixes are enough to divide the string into substrings whose names form a smaller suffix array problem.
If all LMS substrings have unique names, their order is known directly. Otherwise, recursion computes their order. Once LMS suffixes are placed correctly, induced sorting places L-type and S-type suffixes in lexicographic order using bucket boundaries.
Thus every suffix receives the same position it would have under direct lexicographic sorting.
Complexity
Let be the length of the string.
Time:
Each induced sorting phase scans arrays linearly. Recursive reduction shrinks the problem to the number of LMS substrings, which is at most about .
Space:
Properties
| property | value |
|---|---|
| stable | yes, within induced phases |
| in place | no |
| comparison based | no |
| time | linear |
| output | suffix array |
When to Use
Use SA IS when building suffix arrays for large strings where linear-time construction matters. It is suitable for text indexing, compressed indexes, bioinformatics strings, and suffix-based search systems.
Use suffix array doubling instead when implementation simplicity matters more than asymptotic performance.
Implementation Notes
- Append a sentinel smaller than all other characters
- Compress alphabet symbols to integer ranks
- Maintain bucket starts for L-type induction
- Maintain bucket ends for S-type induction
- Compare LMS substrings carefully when assigning names