Skip to content

SA-IS Suffix Array Construction

Linear time suffix array construction using induced sorting of LMS substrings.

SA-IS constructs a suffix array in linear time by classifying characters and recursively sorting a reduced problem of LMS substrings.

It is one of the most efficient practical algorithms for suffix array construction.

Problem

Given a string TT of length nn, construct its suffix array:

SA[0n1] SA[0 \dots n-1]

such that suffixes of TT are in lexicographic order.

Key Idea

Each position in the string is classified as:

typemeaning
S-typesuffix is lexicographically smaller than next suffix
L-typesuffix is lexicographically larger than next suffix
LMSLeftmost S-type preceded by L-type

SA-IS works by:

  1. classifying suffix types
  2. extracting LMS substrings
  3. sorting LMS substrings
  4. inducing order of all suffixes

Algorithm

Step 1: Classify S and L types

S[i] = (T[i] < T[i+1]) or (T[i] == T[i+1] and S[i+1] == true)
L[i] = not S[i]

Step 2: Identify LMS positions

A position is LMS if:

  • it is S-type
  • previous position is L-type

Step 3: Induced sorting

Place LMS substrings into buckets, sort them, then induce L and S suffixes.

SA-IS(T):
    classify S and L types
    find LMS positions

    place LMS suffixes into buckets
    induce sort L-type suffixes
    induce sort S-type suffixes

    if LMS names are not unique:
        recurse on reduced string

    return SA

Example

Text: T = "banana$"

Suffixes:

suffix
$
a$
ana$
anana$
banana$
na$
nana$

SA-IS processes:

  1. classify types
  2. extract LMS substrings
  3. sort LMS
  4. induce full SA

Final suffix array:

SA indexsuffix
0$
1a$
2ana$
3anana$
4banana$
5na$
6nana$

Correctness

SA-IS relies on three invariants:

  1. LMS substrings uniquely encode structural boundaries in suffix ordering
  2. induced sorting preserves relative order of L-type and S-type suffixes
  3. recursion ensures correctness when LMS names are not unique

The final induced array contains all suffixes in lexicographic order because every suffix is inserted exactly once in a position consistent with its lexicographic constraints.

Complexity

Let n=Tn = |T|.

phasecost
classificationO(n)O(n)
induced sortingO(n)O(n)
recursion depthO(n)O(n) worst, typically small

Total time:

O(n) O(n)

Space complexity:

O(n) O(n)

When to Use

SA-IS is used when:

  • linear time suffix array construction is required
  • large scale text indexing is needed
  • memory usage must remain linear
  • high performance string processing is required

It is widely used in modern suffix array libraries due to its speed and practical efficiency.

Implementation

SA-IS implementations are complex and typically optimized in systems code. A simplified structure is shown below.

def sais(T):
    # Placeholder structure for SA-IS pipeline

    def classify_types(T):
        pass

    def find_lms(T):
        pass

    def induced_sort(T, lms):
        pass

    S, L = classify_types(T)
    lms = find_lms(T)

    SA = induced_sort(T, lms)

    return SA
func SAIS(text []int) []int {
	// Simplified skeleton only

	classify := func() {}
	findLMS := func() []int { return nil }
	induce := func() []int { return nil }

	classify()
	lms := findLMS()
	_ = lms

	sa := induce()

	return sa
}