Linear time suffix array construction using induced sorting of LMS substrings.
SA-IS constructs a suffix array in linear time by classifying characters and recursively sorting a reduced problem of LMS substrings.
It is one of the most efficient practical algorithms for suffix array construction.
Problem
Given a string of length , construct its suffix array:
such that suffixes of are in lexicographic order.
Key Idea
Each position in the string is classified as:
| type | meaning |
|---|---|
| S-type | suffix is lexicographically smaller than next suffix |
| L-type | suffix is lexicographically larger than next suffix |
| LMS | Leftmost S-type preceded by L-type |
SA-IS works by:
- classifying suffix types
- extracting LMS substrings
- sorting LMS substrings
- inducing order of all suffixes
Algorithm
Step 1: Classify S and L types
S[i] = (T[i] < T[i+1]) or (T[i] == T[i+1] and S[i+1] == true)
L[i] = not S[i]Step 2: Identify LMS positions
A position is LMS if:
- it is S-type
- previous position is L-type
Step 3: Induced sorting
Place LMS substrings into buckets, sort them, then induce L and S suffixes.
SA-IS(T):
classify S and L types
find LMS positions
place LMS suffixes into buckets
induce sort L-type suffixes
induce sort S-type suffixes
if LMS names are not unique:
recurse on reduced string
return SAExample
Text: T = "banana$"
Suffixes:
| suffix |
|---|
| $ |
| a$ |
| ana$ |
| anana$ |
| banana$ |
| na$ |
| nana$ |
SA-IS processes:
- classify types
- extract LMS substrings
- sort LMS
- induce full SA
Final suffix array:
| SA index | suffix |
|---|---|
| 0 | $ |
| 1 | a$ |
| 2 | ana$ |
| 3 | anana$ |
| 4 | banana$ |
| 5 | na$ |
| 6 | nana$ |
Correctness
SA-IS relies on three invariants:
- LMS substrings uniquely encode structural boundaries in suffix ordering
- induced sorting preserves relative order of L-type and S-type suffixes
- recursion ensures correctness when LMS names are not unique
The final induced array contains all suffixes in lexicographic order because every suffix is inserted exactly once in a position consistent with its lexicographic constraints.
Complexity
Let .
| phase | cost |
|---|---|
| classification | |
| induced sorting | |
| recursion depth | worst, typically small |
Total time:
Space complexity:
When to Use
SA-IS is used when:
- linear time suffix array construction is required
- large scale text indexing is needed
- memory usage must remain linear
- high performance string processing is required
It is widely used in modern suffix array libraries due to its speed and practical efficiency.
Implementation
SA-IS implementations are complex and typically optimized in systems code. A simplified structure is shown below.
def sais(T):
# Placeholder structure for SA-IS pipeline
def classify_types(T):
pass
def find_lms(T):
pass
def induced_sort(T, lms):
pass
S, L = classify_types(T)
lms = find_lms(T)
SA = induced_sort(T, lms)
return SAfunc SAIS(text []int) []int {
// Simplified skeleton only
classify := func() {}
findLMS := func() []int { return nil }
induce := func() []int { return nil }
classify()
lms := findLMS()
_ = lms
sa := induce()
return sa
}