Search for a pattern in a text by binary searching the lexicographically sorted suffix array.
Suffix array binary search finds a pattern in a text by searching over the sorted list of suffixes.
A suffix array stores the starting positions of all suffixes of a text in lexicographic order. Since the suffixes are sorted, binary search can locate the range of suffixes that begin with a given pattern.
Problem
Given a text , its suffix array SA, and a pattern , determine whether occurs in .
Return true if at least one suffix starts with , otherwise false.
Structure
For a text of length , the suffix array is an array of indices:
such that
Each entry points to one suffix of the text.
Algorithm
Binary search over suffixes. At each midpoint, compare the pattern with the suffix beginning at SA[mid].
suffix_array_contains(T, SA, P):
lo = 0
hi = length(SA)
while lo < hi:
mid = (lo + hi) // 2
if suffix_at(T, SA[mid]) < P:
lo = mid + 1
else:
hi = mid
if lo == length(SA):
return false
return suffix_at(T, SA[lo]) starts with PExample
Text:
Suffixes in sorted order:
| suffix array index | text index | suffix |
|---|---|---|
| 0 | 5 | a |
| 1 | 3 | ana |
| 2 | 1 | banana |
| 3 | 0 | banana |
| 4 | 4 | na |
| 5 | 2 | nana |
For the usual suffixes of "banana", the corrected sorted order is:
| suffix array index | text index | suffix |
|---|---|---|
| 0 | 5 | a |
| 1 | 3 | ana |
| 2 | 1 | anana |
| 3 | 0 | banana |
| 4 | 4 | na |
| 5 | 2 | nana |
Search for pattern "ana".
The lower bound over suffixes returns the suffix "ana" at text index 3. Since that suffix starts with "ana", the pattern occurs.
Correctness
The suffix array is sorted lexicographically. All suffixes that begin with pattern form one contiguous interval in this sorted order.
Binary search finds the first suffix that is not lexicographically smaller than . If that suffix starts with , then occurs in the text. If it does not, no later suffix can start with without violating lexicographic order of the lower bound.
Therefore the algorithm returns true exactly when the pattern appears as a substring of the text.
Complexity
Let and .
Each binary search step may compare up to characters.
| operation | time |
|---|---|
| search | |
| search with LCP acceleration |
Space used by the suffix array:
Extra space for iterative search:
When to Use
Suffix array binary search is useful when:
- many substring queries are run against the same text
- memory use should be lower than a suffix tree
- the text is static
- sorted suffix order is useful for related queries
Compared with suffix trees, suffix arrays are simpler and more compact, but basic search costs an extra logarithmic factor.
Implementation
def compare_suffix_with_pattern(text, start, pattern):
i = 0
while start + i < len(text) and i < len(pattern):
if text[start + i] < pattern[i]:
return -1
if text[start + i] > pattern[i]:
return 1
i += 1
if i == len(pattern):
return 0
return -1
def suffix_array_contains(text, sa, pattern):
lo = 0
hi = len(sa)
while lo < hi:
mid = (lo + hi) // 2
cmp = compare_suffix_with_pattern(text, sa[mid], pattern)
if cmp < 0:
lo = mid + 1
else:
hi = mid
if lo == len(sa):
return False
return compare_suffix_with_pattern(text, sa[lo], pattern) == 0func compareSuffixWithPattern(text string, start int, pattern string) int {
i := 0
for start+i < len(text) && i < len(pattern) {
if text[start+i] < pattern[i] {
return -1
}
if text[start+i] > pattern[i] {
return 1
}
i++
}
if i == len(pattern) {
return 0
}
return -1
}
func SuffixArrayContains(text string, sa []int, pattern string) bool {
lo, hi := 0, len(sa)
for lo < hi {
mid := lo + (hi-lo)/2
cmp := compareSuffixWithPattern(text, sa[mid], pattern)
if cmp < 0 {
lo = mid + 1
} else {
hi = mid
}
}
if lo == len(sa) {
return false
}
return compareSuffixWithPattern(text, sa[lo], pattern) == 0
}