Skip to content

Suffix Array Binary Search

Search for a pattern in a text by binary searching the lexicographically sorted suffix array.

Suffix array binary search finds a pattern in a text by searching over the sorted list of suffixes.

A suffix array stores the starting positions of all suffixes of a text in lexicographic order. Since the suffixes are sorted, binary search can locate the range of suffixes that begin with a given pattern.

Problem

Given a text TT, its suffix array SA, and a pattern PP, determine whether PP occurs in TT.

Return true if at least one suffix starts with PP, otherwise false.

Structure

For a text TT of length nn, the suffix array is an array of indices:

SA[0],SA[1],,SA[n1] SA[0], SA[1], \ldots, SA[n - 1]

such that

T[SA[0]..]<T[SA[1]..]<<T[SA[n1]..] T[SA[0]..] < T[SA[1]..] < \cdots < T[SA[n - 1]..]

Each entry points to one suffix of the text.

Algorithm

Binary search over suffixes. At each midpoint, compare the pattern with the suffix beginning at SA[mid].

suffix_array_contains(T, SA, P):
    lo = 0
    hi = length(SA)

    while lo < hi:
        mid = (lo + hi) // 2

        if suffix_at(T, SA[mid]) < P:
            lo = mid + 1
        else:
            hi = mid

    if lo == length(SA):
        return false

    return suffix_at(T, SA[lo]) starts with P

Example

Text:

T="banana" T = "banana"

Suffixes in sorted order:

suffix array indextext indexsuffix
05a
13ana
21banana
30banana
44na
52nana

For the usual suffixes of "banana", the corrected sorted order is:

suffix array indextext indexsuffix
05a
13ana
21anana
30banana
44na
52nana

Search for pattern "ana".

The lower bound over suffixes returns the suffix "ana" at text index 3. Since that suffix starts with "ana", the pattern occurs.

Correctness

The suffix array is sorted lexicographically. All suffixes that begin with pattern PP form one contiguous interval in this sorted order.

Binary search finds the first suffix that is not lexicographically smaller than PP. If that suffix starts with PP, then PP occurs in the text. If it does not, no later suffix can start with PP without violating lexicographic order of the lower bound.

Therefore the algorithm returns true exactly when the pattern appears as a substring of the text.

Complexity

Let n=Tn = |T| and m=Pm = |P|.

Each binary search step may compare up to mm characters.

operationtime
searchO(mlogn)O(m \log n)
search with LCP accelerationO(m+logn)O(m + \log n)

Space used by the suffix array:

O(n) O(n)

Extra space for iterative search:

O(1) O(1)

When to Use

Suffix array binary search is useful when:

  • many substring queries are run against the same text
  • memory use should be lower than a suffix tree
  • the text is static
  • sorted suffix order is useful for related queries

Compared with suffix trees, suffix arrays are simpler and more compact, but basic search costs an extra logarithmic factor.

Implementation

def compare_suffix_with_pattern(text, start, pattern):
    i = 0

    while start + i < len(text) and i < len(pattern):
        if text[start + i] < pattern[i]:
            return -1
        if text[start + i] > pattern[i]:
            return 1
        i += 1

    if i == len(pattern):
        return 0

    return -1

def suffix_array_contains(text, sa, pattern):
    lo = 0
    hi = len(sa)

    while lo < hi:
        mid = (lo + hi) // 2
        cmp = compare_suffix_with_pattern(text, sa[mid], pattern)

        if cmp < 0:
            lo = mid + 1
        else:
            hi = mid

    if lo == len(sa):
        return False

    return compare_suffix_with_pattern(text, sa[lo], pattern) == 0
func compareSuffixWithPattern(text string, start int, pattern string) int {
	i := 0

	for start+i < len(text) && i < len(pattern) {
		if text[start+i] < pattern[i] {
			return -1
		}
		if text[start+i] > pattern[i] {
			return 1
		}
		i++
	}

	if i == len(pattern) {
		return 0
	}

	return -1
}

func SuffixArrayContains(text string, sa []int, pattern string) bool {
	lo, hi := 0, len(sa)

	for lo < hi {
		mid := lo + (hi-lo)/2
		cmp := compareSuffixWithPattern(text, sa[mid], pattern)

		if cmp < 0 {
			lo = mid + 1
		} else {
			hi = mid
		}
	}

	if lo == len(sa) {
		return false
	}

	return compareSuffixWithPattern(text, sa[lo], pattern) == 0
}