13.8 Longest Common Subsequence
The Longest Common Subsequence (LCS) problem is one of the most influential dynamic programming problems ever studied.
13.8 Longest Common Subsequence
The Longest Common Subsequence (LCS) problem is one of the most influential dynamic programming problems ever studied. It appears in version control systems, file comparison tools, DNA sequence analysis, natural language processing, plagiarism detection, and data synchronization systems.
Unlike knapsack, which focuses on resource allocation, LCS focuses on structural similarity. Given two sequences, the goal is to identify the longest sequence of elements that appears in both while preserving relative order.
LCS is particularly valuable because it introduces a recurring dynamic programming pattern: solving problems involving two sequences simultaneously. Many advanced string algorithms, including edit distance, sequence alignment, and diff engines, are direct descendants of the ideas developed here.
Problem
Given two sequences:
A = ABCBDAB
B = BDCABA
Find the longest subsequence that appears in both sequences.
A subsequence is formed by removing zero or more elements without changing the order of the remaining elements.
For example:
BCBA
is a valid subsequence of both strings.
The answer length is:
4
Subsequence Versus Substring
A common source of confusion is the difference between subsequences and substrings.
Subsequence:
A B C D E
may become:
A C E
Characters may be skipped.
Substring:
A B C D E
may become:
B C D
Characters must remain contiguous.
LCS deals with subsequences.
This flexibility makes the problem significantly more challenging.
Why Brute Force Fails
Suppose a string contains:
n
characters.
Every character may either:
- appear in a subsequence
- not appear in a subsequence
The number of possible subsequences becomes:
$$ 2^n $$
For two strings, comparing all subsequences quickly becomes infeasible.
A more systematic approach is required.
Designing the State
The key observation is that the problem becomes smaller when prefixes are considered.
Define:
$$ dp[i][j] $$
as:
The length of the longest common subsequence between the first
icharacters of string A and the firstjcharacters of string B.
For example:
dp[4][3]
asks:
Longest common subsequence between:
A[0...3]
B[0...2]
This creates a family of smaller problems that closely resemble the original problem.
Understanding the Final Decision
Consider the last characters:
A[i-1]
B[j-1]
Two situations are possible.
Characters Match
Suppose:
A[i-1] == B[j-1]
Example:
ABCD
ABCD
Both strings end with:
D
Any common subsequence can safely include this character.
The problem reduces to:
first i-1 characters
first j-1 characters
Recurrence:
$$ dp[i][j] = dp[i-1][j-1] + 1 $$
The matching character extends an existing solution.
Characters Differ
Suppose:
A[i-1] != B[j-1]
Example:
ABCD
ABCE
The final character cannot belong to both sequences simultaneously.
One of them must be discarded.
Two possibilities exist:
Remove the last character from A:
$$ dp[i-1][j] $$
Remove the last character from B:
$$ dp[i][j-1] $$
Choose the better result:
$$ dp[i][j] = \max ( dp[i-1][j], dp[i][j-1] ) $$
This exhausts all possibilities.
Complete Recurrence
Combining both cases:
$$ dp[i][j] = \begin{cases} dp[i-1][j-1]+1 & \text{if } A[i-1]=B[j-1] \\n\max(dp[i-1][j],dp[i][j-1]) & \text{otherwise} \end{cases} $$
This recurrence completely characterizes the problem.
Base Cases
If either string is empty:
no common subsequence exists
Therefore:
$$ dp[i][0]=0 $$
and
$$ dp[0][j]=0 $$
for all valid values of (i) and (j).
The first row and first column of the table contain zeros.
Building the Table
Consider:
A = ABCD
B = ACBD
The table begins:
| Ø | A | C | B | D | |
|---|---|---|---|---|---|
| Ø | 0 | 0 | 0 | 0 | 0 |
| A | 0 | 1 | 1 | 1 | 1 |
| B | 0 | 1 | 1 | 2 | 2 |
| C | 0 | 1 | 2 | 2 | 2 |
| D | 0 | 1 | 2 | 2 | 3 |
The final answer appears in:
$$ dp[n][m] $$
which equals:
$$ 3 $$
One valid LCS is:
ABD
Implementation
func LCS(a, b string) int {
n := len(a)
m := len(b)
dp := make([][]int, n+1)
for i := range dp {
dp[i] = make([]int, m+1)
}
for i := 1; i <= n; i++ {
for j := 1; j <= m; j++ {
if a[i-1] == b[j-1] {
dp[i][j] =
dp[i-1][j-1] + 1
} else {
dp[i][j] =
max(
dp[i-1][j],
dp[i][j-1],
)
}
}
}
return dp[n][m]
}
The code follows the mathematical recurrence directly.
Reconstructing the Subsequence
Often the actual subsequence is more useful than its length.
Start from:
$$ dp[n][m] $$
and walk backward.
Match Case
If:
A[i-1] == B[j-1]
then:
character belongs to the answer
Move diagonally:
(i-1, j-1)
Upward Move
If:
$$ dp[i-1][j]
dp[i][j-1] $$
move upward:
(i-1, j)
Left Move
Otherwise move left:
(i, j-1)
Collected characters, when reversed, produce one valid longest common subsequence.
Visualizing Dependencies
Each cell depends on:
up
left
upper-left
Graphically:
↖ ↑
← current
This dependency pattern appears repeatedly in sequence alignment algorithms.
Whenever two sequences are compared, a similar structure often emerges.
Space Optimization
Observe the recurrence.
Each row depends only on:
- current row
- previous row
The entire table is unnecessary if only the length is required.
Instead:
previous := make([]int, m+1)
current := make([]int, m+1)
Memory becomes:
$$ O(m) $$
instead of:
$$ O(nm) $$
This optimization is common when processing very large strings.
Applications
LCS appears in many practical systems.
Version Control
Tools such as diff and merge systems compare file versions using concepts closely related to LCS.
Bioinformatics
DNA and protein sequences are compared through alignment algorithms derived from LCS principles.
Data Synchronization
Synchronization engines identify shared structure between versions of documents and datasets.
Plagiarism Detection
Common subsequences can reveal structural similarity between texts.
Natural Language Processing
Sequence comparison frequently appears in translation and text analysis pipelines.
Common Mistakes
Confusing Subsequence and Substring
The recurrence for longest common substring is completely different.
Always verify the problem statement carefully.
Incorrect Indexing
The table uses:
n+1
m+1
dimensions.
Mixing character indices and DP indices creates frequent off-by-one errors.
Wrong Match Logic
The match case uses:
$$ dp[i-1][j-1] $$
not:
$$ dp[i-1][j] $$
or:
$$ dp[i][j-1] $$
A small indexing mistake changes the algorithm entirely.
Forgetting Reconstruction Requirements
Space optimization removes information needed for reconstruction.
If the actual subsequence is required, retain sufficient state.
Complexity Analysis
State count:
$$ (n+1)(m+1) $$
Each state performs constant work.
Time complexity:
$$ O(nm) $$
Memory complexity:
$$ O(nm) $$
or:
$$ O(m) $$
with row compression.
Takeaway
The Longest Common Subsequence problem demonstrates how dynamic programming can transform an exponential search problem into a polynomial-time solution by identifying overlapping subproblems among string prefixes. The resulting two-dimensional state space, dependency structure, and reconstruction process form the foundation for many important sequence-processing algorithms. Once you understand LCS, more advanced problems such as edit distance, sequence alignment, and diff generation become natural extensions of the same core ideas.