16.14 Longest Common Substring

Given two strings, find the longest contiguous block of characters that appears in both.

16.14 Longest Common Substring

Problem

Given two strings, find the longest contiguous block of characters that appears in both.

For example:

A = banana
B = ananas

The longest common substring is:

anana

This differs from the longest common subsequence. A substring must be contiguous. A subsequence may skip characters.

substring:   consecutive characters
subsequence: characters in order, with gaps allowed

The challenge is to find the longest shared contiguous region efficiently.

Solution

Use dynamic programming.

Define:

dp[i][j]

as the length of the longest common suffix of:

A[0..i)
B[0..j)

That means dp[i][j] answers:

How many characters match ending at A[i-1] and B[j-1]?

If the characters match:

A[i-1] = B[j-1]

then the common suffix extends by one:

dp[i][j] = dp[i-1][j-1] + 1

If they differ, the common suffix ends:

dp[i][j] = 0

The answer is the maximum value anywhere in the table.

Why the State Uses Suffixes

For substrings, contiguity matters.

Suppose:

A = abcd
B = xbcy

The characters:

b
c

match contiguously, so the common substring is:

bc

When c matches c, the previous characters must also have matched immediately before them. That is why the recurrence uses only the diagonal cell:

dp[i-1][j-1]

Unlike edit distance, there is no insert/delete/substitute choice. A mismatch resets the current contiguous match to zero.

Example

Compare:

A = banana
B = ananas

The shared substring:

anana

appears at:

banana
 ^^^^^

ananas
^^^^^

The dynamic programming table contains increasing diagonal values:

a  n  a  n  a
1  2  3  4  5

The maximum value is:

5

So the longest common substring has length 5.

Recovering the Substring

The maximum value gives the length. To recover the substring, also store where the maximum ended.

If:

bestLength = 5
bestEnd    = 6

in string A, then the substring is:

A[bestEnd - bestLength .. bestEnd)

For banana, that gives:

anana

Lean Sketch

A direct implementation uses a two-dimensional table.

def longestCommonSubstring
  (a b : String)
  : String :=
by
  -- Convert strings to arrays of characters.
  -- Fill dp[i][j].
  -- Track best length and ending position.
  sorry

For algorithm design, the recurrence is the important part:

if a[i-1] = b[j-1]:
    dp[i][j] = dp[i-1][j-1] + 1
else:
    dp[i][j] = 0

The table is filled row by row.

Space Optimization

The recurrence depends only on the previous row.

Therefore the full table is unnecessary if you only need the answer.

Use two rows:

previous
current

or one row updated carefully from right to left.

Memory drops from:

O(nm)

to:

O(min(n,m))

The runtime remains:

O(nm)

Suffix Automaton Alternative

For large strings, dynamic programming may be too expensive.

A suffix automaton gives a linear-time approach.

Build a suffix automaton for A. Then scan B through the automaton, maintaining the length of the current match.

When a transition exists, extend the match. When it fails, follow suffix links until a valid transition is found.

This computes the longest common substring in:

O(n + m)

time.

The suffix automaton version is more complex to implement, but it is the preferred approach when both strings are large.

Suffix Array Alternative

Another approach is to concatenate the strings with a separator:

A + "$" + B

Build a suffix array and LCP array.

The longest common substring is the largest LCP value between adjacent suffixes that originate from different strings.

Example:

banana$ananas

Adjacent suffixes from different sides reveal shared prefixes. The maximum such prefix is the answer.

This method is useful when suffix arrays are already available.

Complexity Analysis

Let:

Symbol Meaning
n Length of first string
m Length of second string
Method Time Space
Dynamic programming O(nm) O(nm)
DP with rolling rows O(nm) O(min(n,m))
Suffix automaton O(n + m) O(n)
Suffix array plus LCP O((n+m) log(n+m)) typical O(n+m)

Correctness

The dynamic programming state records the exact length of the longest common suffix ending at each pair of positions.

If the current characters match, any common substring ending there must extend the common suffix ending at the previous pair of positions.

If the characters differ, no nonempty common substring can end at that pair, so the value is zero.

Taking the maximum over all positions therefore yields the longest common substring.

Common Pitfalls

Do not confuse longest common substring with longest common subsequence.

For:

A = abcdef
B = ace

the longest common subsequence is:

ace

but the longest common substring has length 1.

Reset the DP cell to zero on mismatch. Carrying values from neighboring cells solves a different problem.

Use a separator character that cannot occur in either input when using suffix arrays.

Define string indexing carefully for Unicode text.

Takeaway

Longest common substring is the contiguous version of string similarity. The dynamic programming solution is simple and reliable for moderate inputs. Suffix automata and suffix arrays provide faster alternatives when the strings are large or when many related queries must be answered.