# Suffix Array Binary Search

# Suffix Array Binary Search

Suffix array binary search finds a pattern in a text by searching over the sorted list of suffixes.

A suffix array stores the starting positions of all suffixes of a text in lexicographic order. Since the suffixes are sorted, binary search can locate the range of suffixes that begin with a given pattern.

## Problem

Given a text $T$, its suffix array `SA`, and a pattern $P$, determine whether $P$ occurs in $T$.

Return true if at least one suffix starts with $P$, otherwise false.

## Structure

For a text $T$ of length $n$, the suffix array is an array of indices:

$$
SA[0], SA[1], \ldots, SA[n - 1]
$$

such that

$$
T[SA[0]..] < T[SA[1]..] < \cdots < T[SA[n - 1]..]
$$

Each entry points to one suffix of the text.

## Algorithm

Binary search over suffixes. At each midpoint, compare the pattern with the suffix beginning at `SA[mid]`.

```text id="sa0k8q"
suffix_array_contains(T, SA, P):
    lo = 0
    hi = length(SA)

    while lo < hi:
        mid = (lo + hi) // 2

        if suffix_at(T, SA[mid]) < P:
            lo = mid + 1
        else:
            hi = mid

    if lo == length(SA):
        return false

    return suffix_at(T, SA[lo]) starts with P
```

## Example

Text:

$$
T = "banana"
$$

Suffixes in sorted order:

| suffix array index | text index | suffix |
| -----------------: | ---------: | ------ |
|                  0 |          5 | a      |
|                  1 |          3 | ana    |
|                  2 |          1 | banana |
|                  3 |          0 | banana |
|                  4 |          4 | na     |
|                  5 |          2 | nana   |

For the usual suffixes of `"banana"`, the corrected sorted order is:

| suffix array index | text index | suffix |
| -----------------: | ---------: | ------ |
|                  0 |          5 | a      |
|                  1 |          3 | ana    |
|                  2 |          1 | anana  |
|                  3 |          0 | banana |
|                  4 |          4 | na     |
|                  5 |          2 | nana   |

Search for pattern `"ana"`.

The lower bound over suffixes returns the suffix `"ana"` at text index `3`. Since that suffix starts with `"ana"`, the pattern occurs.

## Correctness

The suffix array is sorted lexicographically. All suffixes that begin with pattern $P$ form one contiguous interval in this sorted order.

Binary search finds the first suffix that is not lexicographically smaller than $P$. If that suffix starts with $P$, then $P$ occurs in the text. If it does not, no later suffix can start with $P$ without violating lexicographic order of the lower bound.

Therefore the algorithm returns true exactly when the pattern appears as a substring of the text.

## Complexity

Let $n = |T|$ and $m = |P|$.

Each binary search step may compare up to $m$ characters.

| operation                    | time            |
| ---------------------------- | --------------- |
| search                       | $O(m \log n)$   |
| search with LCP acceleration | $O(m + \log n)$ |

Space used by the suffix array:

$$
O(n)
$$

Extra space for iterative search:

$$
O(1)
$$

## When to Use

Suffix array binary search is useful when:

* many substring queries are run against the same text
* memory use should be lower than a suffix tree
* the text is static
* sorted suffix order is useful for related queries

Compared with suffix trees, suffix arrays are simpler and more compact, but basic search costs an extra logarithmic factor.

## Implementation

```python id="sa1m7v"
def compare_suffix_with_pattern(text, start, pattern):
    i = 0

    while start + i < len(text) and i < len(pattern):
        if text[start + i] < pattern[i]:
            return -1
        if text[start + i] > pattern[i]:
            return 1
        i += 1

    if i == len(pattern):
        return 0

    return -1

def suffix_array_contains(text, sa, pattern):
    lo = 0
    hi = len(sa)

    while lo < hi:
        mid = (lo + hi) // 2
        cmp = compare_suffix_with_pattern(text, sa[mid], pattern)

        if cmp < 0:
            lo = mid + 1
        else:
            hi = mid

    if lo == len(sa):
        return False

    return compare_suffix_with_pattern(text, sa[lo], pattern) == 0
```

```go id="sa2q9n"
func compareSuffixWithPattern(text string, start int, pattern string) int {
	i := 0

	for start+i < len(text) && i < len(pattern) {
		if text[start+i] < pattern[i] {
			return -1
		}
		if text[start+i] > pattern[i] {
			return 1
		}
		i++
	}

	if i == len(pattern) {
		return 0
	}

	return -1
}

func SuffixArrayContains(text string, sa []int, pattern string) bool {
	lo, hi := 0, len(sa)

	for lo < hi {
		mid := lo + (hi-lo)/2
		cmp := compareSuffixWithPattern(text, sa[mid], pattern)

		if cmp < 0 {
			lo = mid + 1
		} else {
			hi = mid
		}
	}

	if lo == len(sa) {
		return false
	}

	return compareSuffixWithPattern(text, sa[lo], pattern) == 0
}
```

