# SA IS

# SA IS

SA IS is a linear-time suffix array construction algorithm. It sorts a carefully chosen subset of suffixes, called LMS suffixes, then induces the order of all remaining suffixes from them.

It is one of the standard practical algorithms for building suffix arrays in linear time.

## Problem

Given a string $S$ of length $n$, construct its suffix array:

$$
SA = \text{sorted indices of all suffixes of } S
$$

A sentinel character smaller than every other character is usually appended to the string.

## Idea

SA IS works in four main phases:

1. classify suffixes as S-type or L-type
2. identify LMS positions
3. sort LMS suffixes using induced sorting
4. reduce the LMS order to a smaller string and recurse if needed

After LMS suffixes are correctly ordered, the full suffix array can be induced.

## Types

A position $i$ is S-type if:

$$
S[i:] < S[i+1:]
$$

It is L-type if:

$$
S[i:] > S[i+1:]
$$

For equal adjacent characters, the type is inherited from the next position.

```text id="q8s2km"
classify(S):
    type[n - 1] = S

    for i from n - 2 down to 0:
        if S[i] < S[i + 1]:
            type[i] = S
        else if S[i] > S[i + 1]:
            type[i] = L
        else:
            type[i] = type[i + 1]
```

## LMS Positions

A position $i$ is LMS if:

$$
type[i] = S
$$

and

$$
type[i - 1] = L
$$

These positions mark leftmost S-type suffixes.

## Algorithm

```text id="z5k7pa"
sa_is(S):
    classify all positions as S or L
    find LMS positions

    place LMS suffixes at bucket ends
    induce L-type suffixes
    induce S-type suffixes

    extract LMS suffix order
    name LMS substrings

    if names are not unique:
        recurse on the reduced string

    place LMS suffixes in final LMS order
    induce L-type suffixes
    induce S-type suffixes

    return SA
```

## Example

Let:

$$
S = \text{"banana\$"}
$$

Suffixes:

| index | suffix  |
| ----- | ------- |
| 0     | banana$ |
| 1     | anana$  |
| 2     | nana$   |
| 3     | ana$    |
| 4     | na$     |
| 5     | a$      |
| 6     | $       |

The final suffix array is:

$$
[6, 5, 3, 1, 0, 4, 2]
$$

## Correctness

SA IS first determines the relative order of LMS suffixes. LMS suffixes are enough to divide the string into substrings whose names form a smaller suffix array problem.

If all LMS substrings have unique names, their order is known directly. Otherwise, recursion computes their order. Once LMS suffixes are placed correctly, induced sorting places L-type and S-type suffixes in lexicographic order using bucket boundaries.

Thus every suffix receives the same position it would have under direct lexicographic sorting.

## Complexity

Let $n$ be the length of the string.

Time:

$$
O(n)
$$

Each induced sorting phase scans arrays linearly. Recursive reduction shrinks the problem to the number of LMS substrings, which is at most about $n/2$.

Space:

$$
O(n)
$$

## Properties

| property         | value                      |
| ---------------- | -------------------------- |
| stable           | yes, within induced phases |
| in place         | no                         |
| comparison based | no                         |
| time             | linear                     |
| output           | suffix array               |

## When to Use

Use SA IS when building suffix arrays for large strings where linear-time construction matters. It is suitable for text indexing, compressed indexes, bioinformatics strings, and suffix-based search systems.

Use suffix array doubling instead when implementation simplicity matters more than asymptotic performance.

## Implementation Notes

* Append a sentinel smaller than all other characters
* Compress alphabet symbols to integer ranks
* Maintain bucket starts for L-type induction
* Maintain bucket ends for S-type induction
* Compare LMS substrings carefully when assigning names

