16.9 Longest Common Prefix (LCP) Arrays
In the previous recipe, you built a suffix array and used it to perform efficient substring searches.
16.9 Longest Common Prefix (LCP) Arrays
Problem
In the previous recipe, you built a suffix array and used it to perform efficient substring searches.
Consider the suffix array for:
banana
| Rank | Position | Suffix |
|---|---|---|
| 0 | 5 | a |
| 1 | 3 | ana |
| 2 | 1 | anana |
| 3 | 0 | banana |
| 4 | 4 | na |
| 5 | 2 | nana |
Suppose you want to answer questions such as:
- What is the longest repeated substring?
- How many distinct substrings exist?
- Which suffixes share the longest common prefix?
- What is the longest common substring between related texts?
Using only the suffix array, these questions require repeated character comparisons.
The challenge is to avoid recomputing the same prefix matches over and over again.
Solution
The Longest Common Prefix (LCP) array stores the length of the common prefix between adjacent suffixes in suffix-array order.
For:
banana
the sorted suffixes are:
a
ana
anana
banana
na
nana
Comparing adjacent pairs:
a vs ana → 1
ana vs anana → 3
anana vs banana → 0
banana vs na → 0
na vs nana → 2
The LCP array becomes:
[0, 1, 3, 0, 0, 2]
The first value is conventionally zero because no preceding suffix exists.
Once computed, the LCP array provides information that would otherwise require repeated string comparisons.
Why Adjacent Suffixes Matter
Suppose two suffixes share a long prefix:
anana
ana
They appear next to each other in sorted order.
This is not an accident.
Lexicographic sorting guarantees that suffixes with similar prefixes become neighbors.
As a result, many string-analysis problems reduce to examining adjacent suffixes and their LCP values.
This observation is one of the key insights behind modern text indexing.
Visualizing the Array
For:
banana
we have:
| Rank | Suffix | LCP |
|---|---|---|
| 0 | a | 0 |
| 1 | ana | 1 |
| 2 | anana | 3 |
| 3 | banana | 0 |
| 4 | na | 0 |
| 5 | nana | 2 |
Notice the large value:
3
between:
ana
anana
This immediately reveals a repeated substring:
ana
Without the LCP array, this information would require explicit comparison.
Naive Construction
The simplest construction compares neighboring suffixes directly.
def commonPrefixLength
(a b : String)
: Nat :=
by
-- Compare characters until mismatch.
sorry
Then:
def naiveLCP
(text : String)
(sa : Array Nat)
: Array Nat :=
by
-- Compare adjacent suffixes.
sorry
For each adjacent pair:
suffix[i-1]
suffix[i]
compute the common prefix length.
Although easy to understand, this approach may require:
O(n²)
character comparisons.
Example
Text:
mississippi
Some sorted suffixes:
i
ippi
issippi
ississippi
mississippi
pi
ppi
...
LCP values:
0
1
1
4
0
1
0
...
The value:
4
indicates that:
issippi
ississippi
share:
issi
as a common prefix.
Kasai's Algorithm
A remarkable result is that the entire LCP array can be constructed in linear time.
The most widely used method is Kasai's algorithm.
The key idea is simple:
If two suffixes share a common prefix of length:
k
then the suffixes beginning one character later must share at least:
k - 1
characters.
Example:
banana
anana
share:
0
characters.
But:
anana
nana
do not need to restart comparisons from zero every time.
Previously computed information can be reused.
This idea transforms construction from quadratic to linear.
Rank Array
Kasai's algorithm first constructs an inverse mapping.
Given:
SA = [5, 3, 1, 0, 4, 2]
the rank array becomes:
Rank[5] = 0
Rank[3] = 1
Rank[1] = 2
Rank[0] = 3
Rank[4] = 4
Rank[2] = 5
Or:
[3, 2, 5, 1, 4, 0]
The rank array answers:
Where does this suffix appear in sorted order?
in constant time.
Core Insight
Suppose:
suffix(i)
has LCP:
k
with its predecessor.
Then:
suffix(i+1)
already shares at least:
k-1
characters with its own predecessor.
Therefore the next comparison begins from:
k-1
instead of:
0
This dramatically reduces work.
Across the entire construction, each character is compared only a small number of times.
Lean Sketch
A simplified structure:
def kasai
(text : String)
(sa : Array Nat)
: Array Nat :=
by
-- Build rank array.
-- Reuse previous prefix lengths.
-- Produce LCP values.
sorry
The implementation details are secondary.
The important invariant is:
Current LCP ≥ Previous LCP - 1
This allows previously discovered matches to be carried forward.
Longest Repeated Substring
One of the most useful applications is finding repeated substrings.
Observation:
The maximum value in the LCP array equals the length of the longest repeated substring.
Example:
banana
LCP:
[0, 1, 3, 0, 0, 2]
Maximum:
3
Corresponding substring:
ana
No repeated substring is longer.
The answer appears immediately after LCP construction.
Counting Distinct Substrings
A string of length:
n
contains:
n(n+1)/2
total substrings.
Many are duplicates.
The LCP array tells us how many duplicates exist.
Formula:
Distinct Substrings =
n(n+1)/2 - ΣLCP
Example:
banana
Total substrings:
21
Sum of LCP values:
6
Distinct substrings:
15
Without LCP, this computation is significantly more difficult.
Range Queries
The LCP array also supports efficient suffix comparisons.
Suppose you want the common prefix length between:
suffix(i)
suffix(j)
The answer becomes a range minimum query over the LCP array.
This connection leads directly to advanced text-indexing structures.
Many modern suffix-array implementations augment LCP with:
- Sparse tables
- Segment trees
- RMQ structures
to answer such queries in constant or logarithmic time.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Text length |
| Operation | Complexity |
|---|---|
| Naive construction | O(n²) |
| Kasai construction | O(n) |
| Longest repeated substring | O(n) |
| Distinct substring count | O(n) |
| Storage | O(n) |
The linear construction cost is one of the reasons suffix arrays are practical for large texts.
Correctness
For every position in suffix-array order, the LCP value records the exact common prefix length shared with the preceding suffix.
Kasai's algorithm never skips a possible match. It merely begins comparisons from information already known to be correct.
Every character that contributes to an LCP value is eventually counted, and every mismatch terminates the comparison at the correct point.
Therefore the resulting array contains exactly the longest common prefix lengths between adjacent suffixes.
Common Pitfalls
Do not confuse:
Text position
with:
Suffix-array rank
These are different coordinate systems.
Be careful with the first suffix. It has no predecessor, so its LCP value is conventionally zero.
Avoid recomputing common prefixes directly when Kasai's algorithm is available.
When using LCP for repeated substring analysis, remember that the maximum LCP value identifies a length. Additional information is needed to recover the actual substring.
Takeaway
A suffix array organizes suffixes. An LCP array measures how similar neighboring suffixes are. Together they form one of the most important combinations in string processing.
The suffix array provides order. The LCP array provides structure. Many seemingly difficult problems, including longest repeated substrings, distinct substring counting, suffix comparison, and advanced indexing, become straightforward once both structures are available. In practice, most serious suffix-array applications are built on top of the suffix-array-plus-LCP pair rather than the suffix array alone.