# SA-IS Suffix Array Construction

# SA-IS Suffix Array Construction

SA-IS constructs a suffix array in linear time by classifying characters and recursively sorting a reduced problem of LMS substrings.

It is one of the most efficient practical algorithms for suffix array construction.

## Problem

Given a string $T$ of length $n$, construct its suffix array:

$$
SA[0 \dots n-1]
$$

such that suffixes of $T$ are in lexicographic order.

## Key Idea

Each position in the string is classified as:

| type   | meaning                                              |
| ------ | ---------------------------------------------------- |
| S-type | suffix is lexicographically smaller than next suffix |
| L-type | suffix is lexicographically larger than next suffix  |
| LMS    | Leftmost S-type preceded by L-type                   |

SA-IS works by:

1. classifying suffix types
2. extracting LMS substrings
3. sorting LMS substrings
4. inducing order of all suffixes

## Algorithm

### Step 1: Classify S and L types

```text id="sais0k"
S[i] = (T[i] < T[i+1]) or (T[i] == T[i+1] and S[i+1] == true)
L[i] = not S[i]
```

### Step 2: Identify LMS positions

A position is LMS if:

* it is S-type
* previous position is L-type

### Step 3: Induced sorting

Place LMS substrings into buckets, sort them, then induce L and S suffixes.

```text id="sais1m"
SA-IS(T):
    classify S and L types
    find LMS positions

    place LMS suffixes into buckets
    induce sort L-type suffixes
    induce sort S-type suffixes

    if LMS names are not unique:
        recurse on reduced string

    return SA
```

## Example

Text: `T = "banana$"`

Suffixes:

| suffix  |
| ------- |
| $       |
| a$      |
| ana$    |
| anana$  |
| banana$ |
| na$     |
| nana$   |

SA-IS processes:

1. classify types
2. extract LMS substrings
3. sort LMS
4. induce full SA

Final suffix array:

| SA index | suffix  |
| -------: | ------- |
|        0 | $       |
|        1 | a$      |
|        2 | ana$    |
|        3 | anana$  |
|        4 | banana$ |
|        5 | na$     |
|        6 | nana$   |

## Correctness

SA-IS relies on three invariants:

1. LMS substrings uniquely encode structural boundaries in suffix ordering
2. induced sorting preserves relative order of L-type and S-type suffixes
3. recursion ensures correctness when LMS names are not unique

The final induced array contains all suffixes in lexicographic order because every suffix is inserted exactly once in a position consistent with its lexicographic constraints.

## Complexity

Let $n = |T|$.

| phase           | cost                          |
| --------------- | ----------------------------- |
| classification  | $O(n)$                        |
| induced sorting | $O(n)$                        |
| recursion depth | $O(n)$ worst, typically small |

Total time:

$$
O(n)
$$

Space complexity:

$$
O(n)
$$

## When to Use

SA-IS is used when:

* linear time suffix array construction is required
* large scale text indexing is needed
* memory usage must remain linear
* high performance string processing is required

It is widely used in modern suffix array libraries due to its speed and practical efficiency.

## Implementation

SA-IS implementations are complex and typically optimized in systems code. A simplified structure is shown below.

```python id="sais2n"
def sais(T):
    # Placeholder structure for SA-IS pipeline

    def classify_types(T):
        pass

    def find_lms(T):
        pass

    def induced_sort(T, lms):
        pass

    S, L = classify_types(T)
    lms = find_lms(T)

    SA = induced_sort(T, lms)

    return SA
```

```go id="sais3v"
func SAIS(text []int) []int {
	// Simplified skeleton only

	classify := func() {}
	findLMS := func() []int { return nil }
	induce := func() []int { return nil }

	classify()
	lms := findLMS()
	_ = lms

	sa := induce()

	return sa
}
```

