16.13 Edit Distance
Exact matching assumes that strings must be identical.
16.13 Edit Distance
Problem
Exact matching assumes that strings must be identical.
Consider:
kitten
sitting
An exact matcher reports failure immediately because the strings differ.
In many applications, however, similarity matters more than equality.
Examples include:
- Spell checking
- Search suggestions
- OCR correction
- DNA sequence alignment
- Plagiarism detection
- Data deduplication
- Natural language processing
The challenge is to measure how different two strings are and to do so efficiently.
Solution
Edit distance measures the minimum number of editing operations required to transform one string into another.
The most common form is the Levenshtein distance, which allows:
- Insertion
- Deletion
- Substitution
Each operation has cost 1.
For example:
kitten
sitten
requires:
k → s
one substitution.
Then:
sitten
sittin
requires:
e → i
another substitution.
Finally:
sittin
sitting
requires:
insert g
Total cost:
3
Therefore:
distance(kitten, sitting) = 3
Dynamic Programming Formulation
Suppose:
A = kitten
B = sitting
Define:
dp[i][j]
as the minimum edit distance between:
A[0..i)
and:
B[0..j)
The answer becomes:
dp[n][m]
where:
n = length(A)
m = length(B)
This formulation converts the problem into a grid.
Base Cases
Transforming an empty string into a string of length j requires:
j
insertions.
Therefore:
dp[0][j] = j
Similarly:
dp[i][0] = i
because deleting all characters requires i operations.
For example:
| A | B | Distance |
|---|---|---|
| "" | abc | 3 |
| abc | "" | 3 |
These boundaries initialize the table.
Recurrence Relation
At position:
dp[i][j]
three possibilities exist.
Delete
Remove a character from A.
dp[i-1][j] + 1
Insert
Insert a character into A.
dp[i][j-1] + 1
Substitute
Replace one character.
dp[i-1][j-1] + cost
where:
cost =
0 if characters match
1 otherwise
Therefore:
dp[i][j] =
min(
delete,
insert,
substitute
)
This single recurrence drives the entire algorithm.
Example
Compute:
cat
cut
The table begins:
| "" | c | u | t | |
|---|---|---|---|---|
| "" | 0 | 1 | 2 | 3 |
| c | 1 | ? | ? | ? |
| a | 2 | ? | ? | ? |
| t | 3 | ? | ? | ? |
Filling the table:
| "" | c | u | t | |
|---|---|---|---|---|
| "" | 0 | 1 | 2 | 3 |
| c | 1 | 0 | 1 | 2 |
| a | 2 | 1 | 1 | 2 |
| t | 3 | 2 | 2 | 1 |
Result:
distance(cat, cut) = 1
Only one substitution is required.
Visualizing the Grid
The table can be interpreted as a graph.
Each cell represents a state:
(i, j)
Transitions correspond to editing operations.
delete ↓
insert →
substitute ↘
Finding the edit distance becomes a shortest-path problem in this implicit graph.
This perspective is useful when extending the algorithm.
Lean Implementation
A straightforward dynamic programming implementation:
def editDistance
(a b : String)
: Nat :=
by
let n := a.length
let m := b.length
-- Build DP table.
-- Initialize boundaries.
-- Fill using recurrence.
sorry
The implementation details are routine once the recurrence is understood.
The important part is identifying the three possible operations.
Reconstructing the Edits
The final distance tells us the cost.
Often we also want the sequence of edits.
For:
kitten
sitting
one valid transformation is:
kitten
↓
sitten
↓
sittin
↓
sitting
Backtracking through the DP table recovers this path.
The process is similar to reconstructing a shortest path after running a graph algorithm.
Space Optimization
The recurrence uses only:
previous row
current row
at any moment.
Therefore the full table is unnecessary.
Instead of:
O(nm)
space, we can use:
O(min(n,m))
space.
This optimization is important when processing long strings.
The time complexity remains unchanged.
Damerau-Levenshtein Distance
Some applications treat adjacent transpositions as a single edit.
Example:
form
from
Standard edit distance:
2
(transpose requires delete + insert or two substitutions)
Damerau-Levenshtein:
1
because:
or
ro
is treated as a single operation.
This variant often produces better results for typographical errors.
Weighted Edit Distance
Not all edits have equal importance.
Example:
colour
color
Removing:
u
may be considered less significant than replacing several unrelated characters.
Weighted versions assign custom costs:
insert cost
delete cost
substitute cost
The dynamic programming framework remains identical.
Only the recurrence changes.
Applications
Spell Checking
Find dictionary entries with minimum edit distance.
algoritm
suggests:
algorithm
because the distance is small.
Search Suggestions
Search engines tolerate minor typing mistakes.
gogle
may match:
google
OCR Correction
Misread characters are repaired using nearby words.
DNA Alignment
Insertions, deletions, and mutations correspond naturally to edit operations.
Record Matching
Detect near-duplicate names:
Jon Smith
John Smith
Plagiarism Detection
Similarity measures often begin with edit-distance concepts.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Length of first string |
| m | Length of second string |
| Operation | Complexity |
|---|---|
| DP table construction | O(nm) |
| Space (full table) | O(nm) |
| Space (optimized) | O(min(n,m)) |
The quadratic runtime is acceptable for moderate strings but becomes expensive for very large inputs.
Later algorithms and indexing techniques address this limitation.
Edit Distance vs Longest Common Subsequence
These two problems are closely related.
For insertions and deletions only:
distance =
n + m - 2 × LCS
where:
LCS
is the length of the longest common subsequence.
Both problems rely on nearly identical dynamic programming structures.
The difference lies in what the recurrence measures.
Correctness
Every transformation from one string to another ends with one of three operations:
- Insert
- Delete
- Substitute (or match)
The recurrence considers all three possibilities and chooses the minimum cost.
Because every valid edit sequence corresponds to a path through the table, and every table entry records the optimal cost for its prefixes, the final value represents the minimum possible number of edits.
This is a classic optimal-substructure argument.
Common Pitfalls
Do not confuse substrings with subsequences.
Edit distance operates on complete prefixes during dynamic programming.
Be careful when indexing strings. Most implementation bugs arise from off-by-one errors in table construction.
Remember that edit distance is not always symmetric when custom weights are introduced.
For Unicode text, define whether operations apply to bytes, code points, or grapheme clusters.
Different choices produce different distances.
Takeaway
Edit distance replaces the binary notion of match versus mismatch with a quantitative notion of similarity. By modeling insertions, deletions, and substitutions as transitions in a dynamic programming table, it provides a principled way to compare strings that are almost, but not exactly, the same. This idea appears throughout information retrieval, computational biology, machine learning, and natural language processing, where imperfect matches are often more important than exact ones.