16.9 Longest Common Prefix (LCP) Arrays

Problem

In the previous recipe, you built a suffix array and used it to perform efficient substring searches.

Consider the suffix array for:

banana

Rank	Position	Suffix
0	5	a
1	3	ana
2	1	anana
3	0	banana
4	4	na
5	2	nana

Suppose you want to answer questions such as:

What is the longest repeated substring?
How many distinct substrings exist?
Which suffixes share the longest common prefix?
What is the longest common substring between related texts?

Using only the suffix array, these questions require repeated character comparisons.

The challenge is to avoid recomputing the same prefix matches over and over again.

Solution

The Longest Common Prefix (LCP) array stores the length of the common prefix between adjacent suffixes in suffix-array order.

For:

banana

the sorted suffixes are:

a
ana
anana
banana
na
nana

Comparing adjacent pairs:

a      vs ana      → 1
ana    vs anana    → 3
anana  vs banana   → 0
banana vs na       → 0
na     vs nana     → 2

The LCP array becomes:

[0, 1, 3, 0, 0, 2]

The first value is conventionally zero because no preceding suffix exists.

Once computed, the LCP array provides information that would otherwise require repeated string comparisons.

Why Adjacent Suffixes Matter

Suppose two suffixes share a long prefix:

anana
ana

They appear next to each other in sorted order.

This is not an accident.

Lexicographic sorting guarantees that suffixes with similar prefixes become neighbors.

As a result, many string-analysis problems reduce to examining adjacent suffixes and their LCP values.

This observation is one of the key insights behind modern text indexing.

Visualizing the Array

For:

banana

we have:

Rank	Suffix	LCP
0	a	0
1	ana	1
2	anana	3
3	banana	0
4	na	0
5	nana	2

Notice the large value:

between:

ana
anana

This immediately reveals a repeated substring:

ana

Without the LCP array, this information would require explicit comparison.

Naive Construction

The simplest construction compares neighboring suffixes directly.

def commonPrefixLength
  (a b : String)
  : Nat :=
by
  -- Compare characters until mismatch.
  sorry

Then:

def naiveLCP
  (text : String)
  (sa : Array Nat)
  : Array Nat :=
by
  -- Compare adjacent suffixes.
  sorry

For each adjacent pair:

suffix[i-1]
suffix[i]

compute the common prefix length.

Although easy to understand, this approach may require:

O(n²)

character comparisons.

Example

Text:

mississippi

Some sorted suffixes:

i
ippi
issippi
ississippi
mississippi
pi
ppi
...

LCP values:

The value:

indicates that:

issippi
ississippi

issi

as a common prefix.

Kasai's Algorithm

A remarkable result is that the entire LCP array can be constructed in linear time.

The most widely used method is Kasai's algorithm.

The key idea is simple:

If two suffixes share a common prefix of length:

then the suffixes beginning one character later must share at least:

k - 1

characters.

Example:

banana
anana

characters.

But:

anana
nana

do not need to restart comparisons from zero every time.

Previously computed information can be reused.

This idea transforms construction from quadratic to linear.

Rank Array

Kasai's algorithm first constructs an inverse mapping.

Given:

SA = [5, 3, 1, 0, 4, 2]

the rank array becomes:

Rank[5] = 0
Rank[3] = 1
Rank[1] = 2
Rank[0] = 3
Rank[4] = 4
Rank[2] = 5

Or:

[3, 2, 5, 1, 4, 0]

The rank array answers:

Where does this suffix appear in sorted order?

in constant time.

Core Insight

Suppose:

suffix(i)

has LCP:

with its predecessor.

Then:

suffix(i+1)

already shares at least:

k-1

characters with its own predecessor.

Therefore the next comparison begins from:

k-1

instead of:

This dramatically reduces work.

Across the entire construction, each character is compared only a small number of times.

Lean Sketch

A simplified structure:

def kasai
  (text : String)
  (sa : Array Nat)
  : Array Nat :=
by
  -- Build rank array.
  -- Reuse previous prefix lengths.
  -- Produce LCP values.
  sorry

The implementation details are secondary.

The important invariant is:

Current LCP ≥ Previous LCP - 1

This allows previously discovered matches to be carried forward.

Longest Repeated Substring

One of the most useful applications is finding repeated substrings.

Observation:

The maximum value in the LCP array equals the length of the longest repeated substring.

Example:

banana

LCP:

[0, 1, 3, 0, 0, 2]

Maximum:

Corresponding substring:

ana

No repeated substring is longer.

The answer appears immediately after LCP construction.

Counting Distinct Substrings

A string of length:

contains:

n(n+1)/2

total substrings.

Many are duplicates.

The LCP array tells us how many duplicates exist.

Formula:

Distinct Substrings =
n(n+1)/2 - ΣLCP

Example:

banana

Total substrings:

Sum of LCP values:

Distinct substrings:

Without LCP, this computation is significantly more difficult.

Range Queries

The LCP array also supports efficient suffix comparisons.

Suppose you want the common prefix length between:

suffix(i)
suffix(j)

The answer becomes a range minimum query over the LCP array.

This connection leads directly to advanced text-indexing structures.

Many modern suffix-array implementations augment LCP with:

Sparse tables
Segment trees
RMQ structures

to answer such queries in constant or logarithmic time.

Complexity Analysis

Let:

Symbol	Meaning
n	Text length

Operation	Complexity
Naive construction	O(n²)
Kasai construction	O(n)
Longest repeated substring	O(n)
Distinct substring count	O(n)
Storage	O(n)

The linear construction cost is one of the reasons suffix arrays are practical for large texts.

Correctness

For every position in suffix-array order, the LCP value records the exact common prefix length shared with the preceding suffix.

Kasai's algorithm never skips a possible match. It merely begins comparisons from information already known to be correct.

Every character that contributes to an LCP value is eventually counted, and every mismatch terminates the comparison at the correct point.

Therefore the resulting array contains exactly the longest common prefix lengths between adjacent suffixes.

Common Pitfalls

Do not confuse:

Text position

with:

Suffix-array rank

These are different coordinate systems.

Be careful with the first suffix. It has no predecessor, so its LCP value is conventionally zero.

Avoid recomputing common prefixes directly when Kasai's algorithm is available.

When using LCP for repeated substring analysis, remember that the maximum LCP value identifies a length. Additional information is needed to recover the actual substring.

Takeaway

A suffix array organizes suffixes. An LCP array measures how similar neighboring suffixes are. Together they form one of the most important combinations in string processing.

The suffix array provides order. The LCP array provides structure. Many seemingly difficult problems, including longest repeated substrings, distinct substring counting, suffix comparison, and advanced indexing, become straightforward once both structures are available. In practice, most serious suffix-array applications are built on top of the suffix-array-plus-LCP pair rather than the suffix array alone.