13.8 Longest Common Subsequence

The Longest Common Subsequence (LCS) problem is one of the most influential dynamic programming problems ever studied.

13.8 Longest Common Subsequence

The Longest Common Subsequence (LCS) problem is one of the most influential dynamic programming problems ever studied. It appears in version control systems, file comparison tools, DNA sequence analysis, natural language processing, plagiarism detection, and data synchronization systems.

Unlike knapsack, which focuses on resource allocation, LCS focuses on structural similarity. Given two sequences, the goal is to identify the longest sequence of elements that appears in both while preserving relative order.

LCS is particularly valuable because it introduces a recurring dynamic programming pattern: solving problems involving two sequences simultaneously. Many advanced string algorithms, including edit distance, sequence alignment, and diff engines, are direct descendants of the ideas developed here.

Problem

Given two sequences:

A = ABCBDAB
B = BDCABA

Find the longest subsequence that appears in both sequences.

A subsequence is formed by removing zero or more elements without changing the order of the remaining elements.

For example:

BCBA

is a valid subsequence of both strings.

The answer length is:

4

Subsequence Versus Substring

A common source of confusion is the difference between subsequences and substrings.

Subsequence:

A B C D E

may become:

A C E

Characters may be skipped.

Substring:

A B C D E

may become:

B C D

Characters must remain contiguous.

LCS deals with subsequences.

This flexibility makes the problem significantly more challenging.

Why Brute Force Fails

Suppose a string contains:

n

characters.

Every character may either:

  • appear in a subsequence
  • not appear in a subsequence

The number of possible subsequences becomes:

$$ 2^n $$

For two strings, comparing all subsequences quickly becomes infeasible.

A more systematic approach is required.

Designing the State

The key observation is that the problem becomes smaller when prefixes are considered.

Define:

$$ dp[i][j] $$

as:

The length of the longest common subsequence between the first i characters of string A and the first j characters of string B.

For example:

dp[4][3]

asks:

Longest common subsequence between:

A[0...3]
B[0...2]

This creates a family of smaller problems that closely resemble the original problem.

Understanding the Final Decision

Consider the last characters:

A[i-1]
B[j-1]

Two situations are possible.

Characters Match

Suppose:

A[i-1] == B[j-1]

Example:

ABCD
ABCD

Both strings end with:

D

Any common subsequence can safely include this character.

The problem reduces to:

first i-1 characters
first j-1 characters

Recurrence:

$$ dp[i][j] = dp[i-1][j-1] + 1 $$

The matching character extends an existing solution.

Characters Differ

Suppose:

A[i-1] != B[j-1]

Example:

ABCD
ABCE

The final character cannot belong to both sequences simultaneously.

One of them must be discarded.

Two possibilities exist:

Remove the last character from A:

$$ dp[i-1][j] $$

Remove the last character from B:

$$ dp[i][j-1] $$

Choose the better result:

$$ dp[i][j] = \max ( dp[i-1][j], dp[i][j-1] ) $$

This exhausts all possibilities.

Complete Recurrence

Combining both cases:

$$ dp[i][j] = \begin{cases} dp[i-1][j-1]+1 & \text{if } A[i-1]=B[j-1] \\n\max(dp[i-1][j],dp[i][j-1]) & \text{otherwise} \end{cases} $$

This recurrence completely characterizes the problem.

Base Cases

If either string is empty:

no common subsequence exists

Therefore:

$$ dp[i][0]=0 $$

and

$$ dp[0][j]=0 $$

for all valid values of (i) and (j).

The first row and first column of the table contain zeros.

Building the Table

Consider:

A = ABCD
B = ACBD

The table begins:

Ø A C B D
Ø 0 0 0 0 0
A 0 1 1 1 1
B 0 1 1 2 2
C 0 1 2 2 2
D 0 1 2 2 3

The final answer appears in:

$$ dp[n][m] $$

which equals:

$$ 3 $$

One valid LCS is:

ABD

Implementation

func LCS(a, b string) int {

    n := len(a)
    m := len(b)

    dp := make([][]int, n+1)

    for i := range dp {
        dp[i] = make([]int, m+1)
    }

    for i := 1; i <= n; i++ {

        for j := 1; j <= m; j++ {

            if a[i-1] == b[j-1] {

                dp[i][j] =
                    dp[i-1][j-1] + 1

            } else {

                dp[i][j] =
                    max(
                        dp[i-1][j],
                        dp[i][j-1],
                    )
            }
        }
    }

    return dp[n][m]
}

The code follows the mathematical recurrence directly.

Reconstructing the Subsequence

Often the actual subsequence is more useful than its length.

Start from:

$$ dp[n][m] $$

and walk backward.

Match Case

If:

A[i-1] == B[j-1]

then:

character belongs to the answer

Move diagonally:

(i-1, j-1)

Upward Move

If:

$$ dp[i-1][j]

dp[i][j-1] $$

move upward:

(i-1, j)

Left Move

Otherwise move left:

(i, j-1)

Collected characters, when reversed, produce one valid longest common subsequence.

Visualizing Dependencies

Each cell depends on:

up
left
upper-left

Graphically:

↖  ↑

←  current

This dependency pattern appears repeatedly in sequence alignment algorithms.

Whenever two sequences are compared, a similar structure often emerges.

Space Optimization

Observe the recurrence.

Each row depends only on:

  • current row
  • previous row

The entire table is unnecessary if only the length is required.

Instead:

previous := make([]int, m+1)
current  := make([]int, m+1)

Memory becomes:

$$ O(m) $$

instead of:

$$ O(nm) $$

This optimization is common when processing very large strings.

Applications

LCS appears in many practical systems.

Version Control

Tools such as diff and merge systems compare file versions using concepts closely related to LCS.

Bioinformatics

DNA and protein sequences are compared through alignment algorithms derived from LCS principles.

Data Synchronization

Synchronization engines identify shared structure between versions of documents and datasets.

Plagiarism Detection

Common subsequences can reveal structural similarity between texts.

Natural Language Processing

Sequence comparison frequently appears in translation and text analysis pipelines.

Common Mistakes

Confusing Subsequence and Substring

The recurrence for longest common substring is completely different.

Always verify the problem statement carefully.

Incorrect Indexing

The table uses:

n+1
m+1

dimensions.

Mixing character indices and DP indices creates frequent off-by-one errors.

Wrong Match Logic

The match case uses:

$$ dp[i-1][j-1] $$

not:

$$ dp[i-1][j] $$

or:

$$ dp[i][j-1] $$

A small indexing mistake changes the algorithm entirely.

Forgetting Reconstruction Requirements

Space optimization removes information needed for reconstruction.

If the actual subsequence is required, retain sufficient state.

Complexity Analysis

State count:

$$ (n+1)(m+1) $$

Each state performs constant work.

Time complexity:

$$ O(nm) $$

Memory complexity:

$$ O(nm) $$

or:

$$ O(m) $$

with row compression.

Takeaway

The Longest Common Subsequence problem demonstrates how dynamic programming can transform an exponential search problem into a polynomial-time solution by identifying overlapping subproblems among string prefixes. The resulting two-dimensional state space, dependency structure, and reconstruction process form the foundation for many important sequence-processing algorithms. Once you understand LCS, more advanced problems such as edit distance, sequence alignment, and diff generation become natural extensions of the same core ideas.