20.5 Building a Text Diff Tool

Design a text diff tool that compares two versions of a document and reports what changed.

20.5 Building a Text Diff Tool

Problem

Design a text diff tool that compares two versions of a document and reports what changed.

A diff tool appears in version control systems, code review tools, document editors, database migration systems, configuration management tools, and synchronization engines. The user wants a readable answer to a simple question:

What changed between version A and version B?

Given two texts:

A:
the quick brown fox
jumps over the dog

B:
the quick brown fox
jumps over the lazy dog

A useful diff should report:

 the quick brown fox
-jumps over the dog
+jumps over the lazy dog

The algorithmic problem is sequence comparison. The engineering problem is readability. A technically minimal edit script may still produce confusing output if it splits lines, tokens, or blocks in unnatural places.

Solution

Model each document as a sequence. For a line-oriented diff, each line is an item. Compute the longest common subsequence, then use it to identify inserted and deleted lines.

The longest common subsequence, or LCS, finds the longest ordered sequence that appears in both inputs.

Example:

A: A B C D
B: A C E D

One LCS is:

A C D

This means B was deleted from the first sequence, and E was inserted in the second sequence.

Splitting Text into Lines

Start with line-level comparison.

def split_lines(text):
    return text.splitlines()

Example:

old = """the quick brown fox
jumps over the dog"""

new = """the quick brown fox
jumps over the lazy dog"""

print(split_lines(old))
print(split_lines(new))

Output:

['the quick brown fox', 'jumps over the dog']
['the quick brown fox', 'jumps over the lazy dog']

Line-level diff is simple, stable, and usually readable for source code and prose.

Dynamic Programming for LCS

Let a and b be two sequences. Define:

dp[i][j] = length of the LCS of a[i:] and b[j:]

Recurrence:

if a[i] == b[j]:
    dp[i][j] = 1 + dp[i + 1][j + 1]
else:
    dp[i][j] = max(dp[i + 1][j], dp[i][j + 1])

Implementation:

def lcs_table(a, b):
    rows = len(a)
    cols = len(b)

    dp = [
        [0] * (cols + 1)
        for _ in range(rows + 1)
    ]

    for i in range(rows - 1, -1, -1):
        for j in range(cols - 1, -1, -1):
            if a[i] == b[j]:
                dp[i][j] = 1 + dp[i + 1][j + 1]
            else:
                dp[i][j] = max(
                    dp[i + 1][j],
                    dp[i][j + 1],
                )

    return dp

This table records enough information to reconstruct a diff.

Building an Edit Script

Walk through the table from the top-left corner.

def diff_sequences(a, b):
    dp = lcs_table(a, b)

    i = 0
    j = 0
    operations = []

    while i < len(a) and j < len(b):
        if a[i] == b[j]:
            operations.append(("equal", a[i]))
            i += 1
            j += 1
        elif dp[i + 1][j] >= dp[i][j + 1]:
            operations.append(("delete", a[i]))
            i += 1
        else:
            operations.append(("insert", b[j]))
            j += 1

    while i < len(a):
        operations.append(("delete", a[i]))
        i += 1

    while j < len(b):
        operations.append(("insert", b[j]))
        j += 1

    return operations

Format the result:

def format_unified_style(operations):
    lines = []

    for kind, value in operations:
        if kind == "equal":
            lines.append(" " + value)
        elif kind == "delete":
            lines.append("-" + value)
        elif kind == "insert":
            lines.append("+" + value)

    return "\n".join(lines)

Use it:

old_lines = split_lines(old)
new_lines = split_lines(new)

ops = diff_sequences(old_lines, new_lines)

print(format_unified_style(ops))

Output:

 the quick brown fox
-jumps over the dog
+jumps over the lazy dog

Complexity

The basic LCS dynamic programming algorithm uses:

Time:  O(nm)
Space: O(nm)

where n and m are the lengths of the two sequences.

This is acceptable for small documents. It becomes expensive for large files, especially if the diff is line-level over tens of thousands of lines.

To reduce memory, store only two rows when computing the LCS length. To reconstruct the full edit script with lower memory, use Hirschberg's algorithm. Many production diff tools use variants of Myers' diff algorithm, which is efficient when the edit distance is small.

Why LCS Gives a Diff

A diff can be seen as preserving the common parts and marking everything else as inserted or deleted.

The LCS gives the largest preserved subsequence. Once those common items are fixed, items appearing only in the old sequence become deletions, and items appearing only in the new sequence become insertions.

For example:

old: A B C D
new: A C E D

LCS:

A C D

Diff:

 A
-B
 C
+E
 D

The LCS does not directly model substitution. A changed line appears as one deletion plus one insertion. This is appropriate for text diff because a changed line has both an old form and a new form.

Adding Context

Large diffs should not print every unchanged line. Instead, show a few unchanged lines around changes.

def add_context(operations, context=2):
    keep = set()

    for index, (kind, _) in enumerate(operations):
        if kind == "equal":
            continue

        start = max(0, index - context)
        end = min(len(operations), index + context + 1)

        for i in range(start, end):
            keep.add(i)

    result = []
    hidden = False

    for index, operation in enumerate(operations):
        if index in keep:
            if hidden:
                result.append(("skip", "..."))
                hidden = False
            result.append(operation)
        else:
            hidden = True

    return result

Update the formatter:

def format_with_context(operations):
    lines = []

    for kind, value in operations:
        if kind == "equal":
            lines.append(" " + value)
        elif kind == "delete":
            lines.append("-" + value)
        elif kind == "insert":
            lines.append("+" + value)
        elif kind == "skip":
            lines.append(" " + value)

    return "\n".join(lines)

Context keeps output readable while preserving nearby evidence.

Word-Level Diff

Line-level diff can hide small edits inside a long line.

Old:

jumps over the dog

New:

jumps over the lazy dog

A word-level diff shows the inserted word:

jumps over the +lazy dog

Reuse the same sequence algorithm, but split by words:

def split_words(text):
    return text.split()

Then run:

old_words = split_words("jumps over the dog")
new_words = split_words("jumps over the lazy dog")

ops = diff_sequences(old_words, new_words)

print(ops)

Output:

[('equal', 'jumps'), ('equal', 'over'), ('equal', 'the'), ('insert', 'lazy'), ('equal', 'dog')]

A practical diff tool often uses two levels:

  1. Compare lines.
  2. For changed line pairs, compare words or characters.

This gives readable output without performing expensive fine-grained comparison across the whole document.

Pairing Changed Lines

When a deletion is followed by an insertion, it often represents a modified line.

-old line
+new line

You can run a word-level diff on that pair.

def refine_changed_lines(operations):
    refined = []
    i = 0

    while i < len(operations):
        if (
            i + 1 < len(operations)
            and operations[i][0] == "delete"
            and operations[i + 1][0] == "insert"
        ):
            old_line = operations[i][1]
            new_line = operations[i + 1][1]

            word_ops = diff_sequences(
                split_words(old_line),
                split_words(new_line),
            )

            refined.append(("replace_words", word_ops))
            i += 2
        else:
            refined.append(operations[i])
            i += 1

    return refined

This layered approach gives a more helpful result for edited prose and source code.

Stable Output

A diff tool should produce stable output. Small unrelated changes should not cause the entire file to appear different.

Common stability techniques:

Technique Purpose
Line-level matching first Preserves structure
Context windows Avoids huge unchanged output
Tie-breaking rules Makes output deterministic
Token refinement Shows small changes clearly
Whitespace options Avoids noise when formatting changes

Tie-breaking matters in LCS. When both dp[i + 1][j] and dp[i][j + 1] are equal, choosing deletion first produces different output than choosing insertion first. Pick one policy and keep it consistent.

Whitespace Handling

Users often want options:

normal diff
ignore trailing whitespace
ignore all whitespace
ignore blank lines

Normalize lines before comparison, but keep original lines for output.

def normalize_line(line, mode):
    if mode == "exact":
        return line

    if mode == "ignore_trailing":
        return line.rstrip()

    if mode == "ignore_all":
        return "".join(line.split())

    raise ValueError(f"unknown whitespace mode: {mode}")

Build comparison sequences:

def prepare_lines(text, whitespace_mode):
    original = split_lines(text)

    normalized = [
        normalize_line(line, whitespace_mode)
        for line in original
    ]

    return original, normalized

When normalized lines match, display original lines according to your formatting policy.

Move Detection

LCS-based diff treats moved blocks as deletion plus insertion.

Old:

A
B
C
D

New:

A
D
B
C

Diff may show:

 A
+D
 B
 C
-D

or a similar edit script.

Move detection requires an additional pass. One approach:

  1. Find deleted blocks.
  2. Find inserted blocks.
  3. Hash block contents.
  4. Match equal deleted and inserted blocks as moves.

This is useful in code review tools, but it complicates the model. Keep move detection optional.

Testing

A diff tool needs tests for both correctness and readability.

Case Expected behavior
Equal files All lines equal or empty diff
Pure insertion Only inserted lines marked
Pure deletion Only deleted lines marked
One-line replacement Delete old, insert new
Repeated lines Stable deterministic output
Empty old file All new lines inserted
Empty new file All old lines deleted
Whitespace-only change Depends on whitespace mode
Large unchanged blocks Context hides unchanged sections
Moved block Delete plus insert unless move detection is enabled

Repeated lines are especially important because they create multiple valid LCS choices.

Example:

old: A B A C
new: A A B C

There may be more than one valid common subsequence. Your tie-breaking rule controls the displayed diff.

Common Bugs

The most common LCS implementation bug is incorrect table boundaries. The table should have one extra row and one extra column so that dp[i + 1][j] and dp[i][j + 1] are always valid near the end.

Another common bug is losing final insertions or deletions after the main loop. Always drain the remaining items from both sequences.

Word-level refinement can accidentally remove whitespace. If exact formatting matters, tokenize with separators preserved rather than using simple split.

Context rendering can print repeated skip markers. Track whether you are already inside a hidden region.

Whitespace normalization can produce misleading output if normalized lines match but original lines differ. Decide whether ignored changes should be hidden or shown as low-priority changes.

Recipe

Build the tool in layers.

Start with line splitting. Compute an LCS table. Walk the table to produce equal, delete, and insert operations. Format operations with +, -, and space prefixes. Add context windows for large files. Add word-level refinement for changed lines. Add whitespace modes. Add move detection only after the basic output is stable.

The main algorithmic lesson is that diff is sequence alignment. The main engineering lesson is that the best edit script is not always the most readable edit script, so the implementation should optimize for stable and understandable output, not only for minimality.