16.13 Edit Distance

Exact matching assumes that strings must be identical.

16.13 Edit Distance

Problem

Exact matching assumes that strings must be identical.

Consider:

kitten
sitting

An exact matcher reports failure immediately because the strings differ.

In many applications, however, similarity matters more than equality.

Examples include:

  • Spell checking
  • Search suggestions
  • OCR correction
  • DNA sequence alignment
  • Plagiarism detection
  • Data deduplication
  • Natural language processing

The challenge is to measure how different two strings are and to do so efficiently.

Solution

Edit distance measures the minimum number of editing operations required to transform one string into another.

The most common form is the Levenshtein distance, which allows:

  1. Insertion
  2. Deletion
  3. Substitution

Each operation has cost 1.

For example:

kitten
sitten

requires:

k → s

one substitution.

Then:

sitten
sittin

requires:

e → i

another substitution.

Finally:

sittin
sitting

requires:

insert g

Total cost:

3

Therefore:

distance(kitten, sitting) = 3

Dynamic Programming Formulation

Suppose:

A = kitten
B = sitting

Define:

dp[i][j]

as the minimum edit distance between:

A[0..i)

and:

B[0..j)

The answer becomes:

dp[n][m]

where:

n = length(A)
m = length(B)

This formulation converts the problem into a grid.

Base Cases

Transforming an empty string into a string of length j requires:

j

insertions.

Therefore:

dp[0][j] = j

Similarly:

dp[i][0] = i

because deleting all characters requires i operations.

For example:

A B Distance
"" abc 3
abc "" 3

These boundaries initialize the table.

Recurrence Relation

At position:

dp[i][j]

three possibilities exist.

Delete

Remove a character from A.

dp[i-1][j] + 1

Insert

Insert a character into A.

dp[i][j-1] + 1

Substitute

Replace one character.

dp[i-1][j-1] + cost

where:

cost =
0 if characters match
1 otherwise

Therefore:

dp[i][j] =
min(
    delete,
    insert,
    substitute
)

This single recurrence drives the entire algorithm.

Example

Compute:

cat
cut

The table begins:

"" c u t
"" 0 1 2 3
c 1 ? ? ?
a 2 ? ? ?
t 3 ? ? ?

Filling the table:

"" c u t
"" 0 1 2 3
c 1 0 1 2
a 2 1 1 2
t 3 2 2 1

Result:

distance(cat, cut) = 1

Only one substitution is required.

Visualizing the Grid

The table can be interpreted as a graph.

Each cell represents a state:

(i, j)

Transitions correspond to editing operations.

delete      ↓
insert      →
substitute  ↘

Finding the edit distance becomes a shortest-path problem in this implicit graph.

This perspective is useful when extending the algorithm.

Lean Implementation

A straightforward dynamic programming implementation:

def editDistance
  (a b : String)
  : Nat :=
by
  let n := a.length
  let m := b.length

  -- Build DP table.
  -- Initialize boundaries.
  -- Fill using recurrence.
  sorry

The implementation details are routine once the recurrence is understood.

The important part is identifying the three possible operations.

Reconstructing the Edits

The final distance tells us the cost.

Often we also want the sequence of edits.

For:

kitten
sitting

one valid transformation is:

kitten
↓
sitten
↓
sittin
↓
sitting

Backtracking through the DP table recovers this path.

The process is similar to reconstructing a shortest path after running a graph algorithm.

Space Optimization

The recurrence uses only:

previous row
current row

at any moment.

Therefore the full table is unnecessary.

Instead of:

O(nm)

space, we can use:

O(min(n,m))

space.

This optimization is important when processing long strings.

The time complexity remains unchanged.

Damerau-Levenshtein Distance

Some applications treat adjacent transpositions as a single edit.

Example:

form
from

Standard edit distance:

2

(transpose requires delete + insert or two substitutions)

Damerau-Levenshtein:

1

because:

or
ro

is treated as a single operation.

This variant often produces better results for typographical errors.

Weighted Edit Distance

Not all edits have equal importance.

Example:

colour
color

Removing:

u

may be considered less significant than replacing several unrelated characters.

Weighted versions assign custom costs:

insert cost
delete cost
substitute cost

The dynamic programming framework remains identical.

Only the recurrence changes.

Applications

Spell Checking

Find dictionary entries with minimum edit distance.

algoritm

suggests:

algorithm

because the distance is small.

Search Suggestions

Search engines tolerate minor typing mistakes.

gogle

may match:

google

OCR Correction

Misread characters are repaired using nearby words.

DNA Alignment

Insertions, deletions, and mutations correspond naturally to edit operations.

Record Matching

Detect near-duplicate names:

Jon Smith
John Smith

Plagiarism Detection

Similarity measures often begin with edit-distance concepts.

Complexity Analysis

Let:

Symbol Meaning
n Length of first string
m Length of second string
Operation Complexity
DP table construction O(nm)
Space (full table) O(nm)
Space (optimized) O(min(n,m))

The quadratic runtime is acceptable for moderate strings but becomes expensive for very large inputs.

Later algorithms and indexing techniques address this limitation.

Edit Distance vs Longest Common Subsequence

These two problems are closely related.

For insertions and deletions only:

distance =
n + m - 2 × LCS

where:

LCS

is the length of the longest common subsequence.

Both problems rely on nearly identical dynamic programming structures.

The difference lies in what the recurrence measures.

Correctness

Every transformation from one string to another ends with one of three operations:

  • Insert
  • Delete
  • Substitute (or match)

The recurrence considers all three possibilities and chooses the minimum cost.

Because every valid edit sequence corresponds to a path through the table, and every table entry records the optimal cost for its prefixes, the final value represents the minimum possible number of edits.

This is a classic optimal-substructure argument.

Common Pitfalls

Do not confuse substrings with subsequences.

Edit distance operates on complete prefixes during dynamic programming.

Be careful when indexing strings. Most implementation bugs arise from off-by-one errors in table construction.

Remember that edit distance is not always symmetric when custom weights are introduced.

For Unicode text, define whether operations apply to bytes, code points, or grapheme clusters.

Different choices produce different distances.

Takeaway

Edit distance replaces the binary notion of match versus mismatch with a quantitative notion of similarity. By modeling insertions, deletions, and substitutions as transitions in a dynamic programming table, it provides a principled way to compare strings that are almost, but not exactly, the same. This idea appears throughout information retrieval, computational biology, machine learning, and natural language processing, where imperfect matches are often more important than exact ones.