20.5 Building a Text Diff Tool
Design a text diff tool that compares two versions of a document and reports what changed.
20.5 Building a Text Diff Tool
Problem
Design a text diff tool that compares two versions of a document and reports what changed.
A diff tool appears in version control systems, code review tools, document editors, database migration systems, configuration management tools, and synchronization engines. The user wants a readable answer to a simple question:
What changed between version A and version B?
Given two texts:
A:
the quick brown fox
jumps over the dog
B:
the quick brown fox
jumps over the lazy dog
A useful diff should report:
the quick brown fox
-jumps over the dog
+jumps over the lazy dog
The algorithmic problem is sequence comparison. The engineering problem is readability. A technically minimal edit script may still produce confusing output if it splits lines, tokens, or blocks in unnatural places.
Solution
Model each document as a sequence. For a line-oriented diff, each line is an item. Compute the longest common subsequence, then use it to identify inserted and deleted lines.
The longest common subsequence, or LCS, finds the longest ordered sequence that appears in both inputs.
Example:
A: A B C D
B: A C E D
One LCS is:
A C D
This means B was deleted from the first sequence, and E was inserted in the second sequence.
Splitting Text into Lines
Start with line-level comparison.
def split_lines(text):
return text.splitlines()
Example:
old = """the quick brown fox
jumps over the dog"""
new = """the quick brown fox
jumps over the lazy dog"""
print(split_lines(old))
print(split_lines(new))
Output:
['the quick brown fox', 'jumps over the dog']
['the quick brown fox', 'jumps over the lazy dog']
Line-level diff is simple, stable, and usually readable for source code and prose.
Dynamic Programming for LCS
Let a and b be two sequences. Define:
dp[i][j] = length of the LCS of a[i:] and b[j:]
Recurrence:
if a[i] == b[j]:
dp[i][j] = 1 + dp[i + 1][j + 1]
else:
dp[i][j] = max(dp[i + 1][j], dp[i][j + 1])
Implementation:
def lcs_table(a, b):
rows = len(a)
cols = len(b)
dp = [
[0] * (cols + 1)
for _ in range(rows + 1)
]
for i in range(rows - 1, -1, -1):
for j in range(cols - 1, -1, -1):
if a[i] == b[j]:
dp[i][j] = 1 + dp[i + 1][j + 1]
else:
dp[i][j] = max(
dp[i + 1][j],
dp[i][j + 1],
)
return dp
This table records enough information to reconstruct a diff.
Building an Edit Script
Walk through the table from the top-left corner.
def diff_sequences(a, b):
dp = lcs_table(a, b)
i = 0
j = 0
operations = []
while i < len(a) and j < len(b):
if a[i] == b[j]:
operations.append(("equal", a[i]))
i += 1
j += 1
elif dp[i + 1][j] >= dp[i][j + 1]:
operations.append(("delete", a[i]))
i += 1
else:
operations.append(("insert", b[j]))
j += 1
while i < len(a):
operations.append(("delete", a[i]))
i += 1
while j < len(b):
operations.append(("insert", b[j]))
j += 1
return operations
Format the result:
def format_unified_style(operations):
lines = []
for kind, value in operations:
if kind == "equal":
lines.append(" " + value)
elif kind == "delete":
lines.append("-" + value)
elif kind == "insert":
lines.append("+" + value)
return "\n".join(lines)
Use it:
old_lines = split_lines(old)
new_lines = split_lines(new)
ops = diff_sequences(old_lines, new_lines)
print(format_unified_style(ops))
Output:
the quick brown fox
-jumps over the dog
+jumps over the lazy dog
Complexity
The basic LCS dynamic programming algorithm uses:
Time: O(nm)
Space: O(nm)
where n and m are the lengths of the two sequences.
This is acceptable for small documents. It becomes expensive for large files, especially if the diff is line-level over tens of thousands of lines.
To reduce memory, store only two rows when computing the LCS length. To reconstruct the full edit script with lower memory, use Hirschberg's algorithm. Many production diff tools use variants of Myers' diff algorithm, which is efficient when the edit distance is small.
Why LCS Gives a Diff
A diff can be seen as preserving the common parts and marking everything else as inserted or deleted.
The LCS gives the largest preserved subsequence. Once those common items are fixed, items appearing only in the old sequence become deletions, and items appearing only in the new sequence become insertions.
For example:
old: A B C D
new: A C E D
LCS:
A C D
Diff:
A
-B
C
+E
D
The LCS does not directly model substitution. A changed line appears as one deletion plus one insertion. This is appropriate for text diff because a changed line has both an old form and a new form.
Adding Context
Large diffs should not print every unchanged line. Instead, show a few unchanged lines around changes.
def add_context(operations, context=2):
keep = set()
for index, (kind, _) in enumerate(operations):
if kind == "equal":
continue
start = max(0, index - context)
end = min(len(operations), index + context + 1)
for i in range(start, end):
keep.add(i)
result = []
hidden = False
for index, operation in enumerate(operations):
if index in keep:
if hidden:
result.append(("skip", "..."))
hidden = False
result.append(operation)
else:
hidden = True
return result
Update the formatter:
def format_with_context(operations):
lines = []
for kind, value in operations:
if kind == "equal":
lines.append(" " + value)
elif kind == "delete":
lines.append("-" + value)
elif kind == "insert":
lines.append("+" + value)
elif kind == "skip":
lines.append(" " + value)
return "\n".join(lines)
Context keeps output readable while preserving nearby evidence.
Word-Level Diff
Line-level diff can hide small edits inside a long line.
Old:
jumps over the dog
New:
jumps over the lazy dog
A word-level diff shows the inserted word:
jumps over the +lazy dog
Reuse the same sequence algorithm, but split by words:
def split_words(text):
return text.split()
Then run:
old_words = split_words("jumps over the dog")
new_words = split_words("jumps over the lazy dog")
ops = diff_sequences(old_words, new_words)
print(ops)
Output:
[('equal', 'jumps'), ('equal', 'over'), ('equal', 'the'), ('insert', 'lazy'), ('equal', 'dog')]
A practical diff tool often uses two levels:
- Compare lines.
- For changed line pairs, compare words or characters.
This gives readable output without performing expensive fine-grained comparison across the whole document.
Pairing Changed Lines
When a deletion is followed by an insertion, it often represents a modified line.
-old line
+new line
You can run a word-level diff on that pair.
def refine_changed_lines(operations):
refined = []
i = 0
while i < len(operations):
if (
i + 1 < len(operations)
and operations[i][0] == "delete"
and operations[i + 1][0] == "insert"
):
old_line = operations[i][1]
new_line = operations[i + 1][1]
word_ops = diff_sequences(
split_words(old_line),
split_words(new_line),
)
refined.append(("replace_words", word_ops))
i += 2
else:
refined.append(operations[i])
i += 1
return refined
This layered approach gives a more helpful result for edited prose and source code.
Stable Output
A diff tool should produce stable output. Small unrelated changes should not cause the entire file to appear different.
Common stability techniques:
| Technique | Purpose |
|---|---|
| Line-level matching first | Preserves structure |
| Context windows | Avoids huge unchanged output |
| Tie-breaking rules | Makes output deterministic |
| Token refinement | Shows small changes clearly |
| Whitespace options | Avoids noise when formatting changes |
Tie-breaking matters in LCS. When both dp[i + 1][j] and dp[i][j + 1] are equal, choosing deletion first produces different output than choosing insertion first. Pick one policy and keep it consistent.
Whitespace Handling
Users often want options:
normal diff
ignore trailing whitespace
ignore all whitespace
ignore blank lines
Normalize lines before comparison, but keep original lines for output.
def normalize_line(line, mode):
if mode == "exact":
return line
if mode == "ignore_trailing":
return line.rstrip()
if mode == "ignore_all":
return "".join(line.split())
raise ValueError(f"unknown whitespace mode: {mode}")
Build comparison sequences:
def prepare_lines(text, whitespace_mode):
original = split_lines(text)
normalized = [
normalize_line(line, whitespace_mode)
for line in original
]
return original, normalized
When normalized lines match, display original lines according to your formatting policy.
Move Detection
LCS-based diff treats moved blocks as deletion plus insertion.
Old:
A
B
C
D
New:
A
D
B
C
Diff may show:
A
+D
B
C
-D
or a similar edit script.
Move detection requires an additional pass. One approach:
- Find deleted blocks.
- Find inserted blocks.
- Hash block contents.
- Match equal deleted and inserted blocks as moves.
This is useful in code review tools, but it complicates the model. Keep move detection optional.
Testing
A diff tool needs tests for both correctness and readability.
| Case | Expected behavior |
|---|---|
| Equal files | All lines equal or empty diff |
| Pure insertion | Only inserted lines marked |
| Pure deletion | Only deleted lines marked |
| One-line replacement | Delete old, insert new |
| Repeated lines | Stable deterministic output |
| Empty old file | All new lines inserted |
| Empty new file | All old lines deleted |
| Whitespace-only change | Depends on whitespace mode |
| Large unchanged blocks | Context hides unchanged sections |
| Moved block | Delete plus insert unless move detection is enabled |
Repeated lines are especially important because they create multiple valid LCS choices.
Example:
old: A B A C
new: A A B C
There may be more than one valid common subsequence. Your tie-breaking rule controls the displayed diff.
Common Bugs
The most common LCS implementation bug is incorrect table boundaries. The table should have one extra row and one extra column so that dp[i + 1][j] and dp[i][j + 1] are always valid near the end.
Another common bug is losing final insertions or deletions after the main loop. Always drain the remaining items from both sequences.
Word-level refinement can accidentally remove whitespace. If exact formatting matters, tokenize with separators preserved rather than using simple split.
Context rendering can print repeated skip markers. Track whether you are already inside a hidden region.
Whitespace normalization can produce misleading output if normalized lines match but original lines differ. Decide whether ignored changes should be hidden or shown as low-priority changes.
Recipe
Build the tool in layers.
Start with line splitting. Compute an LCS table. Walk the table to produce equal, delete, and insert operations. Format operations with +, -, and space prefixes. Add context windows for large files. Add word-level refinement for changed lines. Add whitespace modes. Add move detection only after the basic output is stable.
The main algorithmic lesson is that diff is sequence alignment. The main engineering lesson is that the best edit script is not always the most readable edit script, so the implementation should optimize for stable and understandable output, not only for minimality.