20.10 Building a Spell Checker
Design a spell checker that detects misspelled words and suggests likely corrections.
20.10 Building a Spell Checker
Problem
Design a spell checker that detects misspelled words and suggests likely corrections.
Spell checking appears in search boxes, document editors, messaging apps, code editors, command-line tools, OCR cleanup systems, and data-entry workflows. The user types:
algoritm
The system should suggest:
algorithm
A useful spell checker must do more than find dictionary words. It must rank plausible corrections, handle common typing errors, and avoid replacing valid domain-specific words.
Solution
Use a dictionary for valid words, generate candidate corrections within a small edit distance, and rank candidates by word frequency.
The core idea is:
input word
-> generate nearby words
-> keep dictionary words
-> rank by likelihood
-> return suggestions
For a first implementation, use edit distance one and two. Most spelling mistakes are close to the intended word.
Dictionary and Frequency Table
A dictionary answers whether a word is known. A frequency table estimates how likely a word is.
from collections import Counter
def build_frequency_table(corpus):
words = []
for line in corpus:
for token in tokenize(line):
words.append(token)
return Counter(words)
Tokenization:
import re
WORD_RE = re.compile(r"[a-z]+")
def tokenize(text):
return WORD_RE.findall(text.lower())
Example:
corpus = [
"algorithm design",
"algorithm analysis",
"graph algorithm",
"dynamic programming",
]
frequencies = build_frequency_table(corpus)
print(frequencies)
Output:
Counter({'algorithm': 3, 'design': 1, 'analysis': 1, 'graph': 1, 'dynamic': 1, 'programming': 1})
The word algorithm should rank higher than a rare alternative if both are plausible corrections.
Generating Edits
An edit operation transforms one word into another.
Common edit operations:
| Operation | Example |
|---|---|
| Delete | algoritm -> algorim |
| Insert | algoritm -> algorithm |
| Replace | algoritm -> algorita |
| Transpose | algroithm -> algorithm |
Generate all words at edit distance one:
ALPHABET = "abcdefghijklmnopqrstuvwxyz"
def edits1(word):
splits = [
(word[:i], word[i:])
for i in range(len(word) + 1)
]
deletes = [
left + right[1:]
for left, right in splits
if right
]
transposes = [
left + right[1] + right[0] + right[2:]
for left, right in splits
if len(right) > 1
]
replaces = [
left + ch + right[1:]
for left, right in splits
if right
for ch in ALPHABET
]
inserts = [
left + ch + right
for left, right in splits
for ch in ALPHABET
]
return set(deletes + transposes + replaces + inserts)
This function is compact, but it can generate many candidates. For a word of length n, edit distance one produces O(26n) candidates.
Filtering Known Words
Only keep candidates that appear in the dictionary.
def known(words, frequencies):
return {
word
for word in words
if word in frequencies
}
Candidate generation:
def candidates(word, frequencies):
if word in frequencies:
return {word}
one_edit = known(edits1(word), frequencies)
if one_edit:
return one_edit
two_edits = known(
(
edit2
for edit1 in edits1(word)
for edit2 in edits1(edit1)
),
frequencies,
)
if two_edits:
return two_edits
return {word}
This follows a practical ranking order:
- Known original word.
- Known words one edit away.
- Known words two edits away.
- Original word if no correction is known.
Ranking Suggestions
Rank candidates by frequency.
def correction(word, frequencies):
return max(
candidates(word, frequencies),
key=lambda candidate: frequencies[candidate],
)
Return multiple suggestions:
def suggestions(word, frequencies, limit=5):
ranked = sorted(
candidates(word, frequencies),
key=lambda candidate: frequencies[candidate],
reverse=True,
)
return ranked[:limit]
Example:
corpus = [
"algorithm algorithm algorithm",
"algorithms",
"altruism",
"rhythm",
]
frequencies = build_frequency_table(corpus)
print(suggestions("algoritm", frequencies))
Expected output:
['algorithm']
This is the basic spell checker popularized by simple probabilistic spelling-correction examples: generate candidates, filter known words, and choose the most frequent.
Why Frequency Matters
Edit distance alone cannot choose between plausible words.
Input:
form
Possible corrections:
form
from
farm
firm
The correct answer depends on context. Without context, frequency is the simplest useful signal. If from appears far more often than farm, it may be the better correction for many inputs.
But frequency should not override exact matches. If the user typed a known word, keep it unless there is a strong reason to autocorrect.
Computing Edit Distance Directly
Candidate generation is fast for short words, but sometimes you need the actual edit distance between two words.
Use dynamic programming.
def edit_distance(a, b):
rows = len(a)
cols = len(b)
dp = [
[0] * (cols + 1)
for _ in range(rows + 1)
]
for i in range(rows + 1):
dp[i][0] = i
for j in range(cols + 1):
dp[0][j] = j
for i in range(1, rows + 1):
for j in range(1, cols + 1):
cost = 0 if a[i - 1] == b[j - 1] else 1
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i - 1][j - 1] + cost,
)
return dp[rows][cols]
Example:
print(edit_distance("algoritm", "algorithm"))
Output:
1
Direct edit distance is useful for ranking candidates from a restricted dictionary, but computing it against every dictionary word is expensive.
Avoiding Full Dictionary Scans
A naive spell checker compares the input word against every dictionary word.
def slow_suggestions(word, dictionary):
return [
candidate
for candidate in dictionary
if edit_distance(word, candidate) <= 2
]
This has poor scaling:
O(dictionary_size × word_length²)
Candidate generation reverses the direction. Instead of testing every dictionary word, generate nearby strings and check them in a hash set.
This makes lookup fast for short words.
BK-Trees
For larger dictionaries or non-English alphabets, a BK-tree can index words by edit distance.
A BK-tree stores each word as a node. Edges are labeled by edit distance from the parent word.
When searching for words within distance d, the triangle inequality prunes branches.
Simplified implementation:
class BKNode:
def __init__(self, word):
self.word = word
self.children = {}
class BKTree:
def __init__(self, distance):
self.root = None
self.distance = distance
def add(self, word):
if self.root is None:
self.root = BKNode(word)
return
node = self.root
while True:
dist = self.distance(word, node.word)
if dist in node.children:
node = node.children[dist]
else:
node.children[dist] = BKNode(word)
return
def search(self, word, max_distance):
results = []
def visit(node):
dist = self.distance(word, node.word)
if dist <= max_distance:
results.append((node.word, dist))
low = dist - max_distance
high = dist + max_distance
for edge_distance, child in node.children.items():
if low <= edge_distance <= high:
visit(child)
if self.root is not None:
visit(self.root)
return results
BK-trees are useful when direct candidate generation becomes awkward, especially for large alphabets or custom distance functions.
Context-Aware Correction
Single-word correction fails when the misspelled word is also a valid word.
Example:
I will reed the book.
The word reed is valid, but context suggests read.
A context-aware spell checker scores sequences rather than isolated words.
A simple bigram model estimates:
P(current word | previous word)
Example:
will read
will reed
If will read appears much more often in the corpus, choose read.
Build bigram counts:
def build_bigram_counts(corpus):
counts = Counter()
for line in corpus:
tokens = tokenize(line)
for i in range(len(tokens) - 1):
counts[(tokens[i], tokens[i + 1])] += 1
return counts
Score candidate by previous word:
def contextual_suggestions(previous, word, frequencies, bigrams, limit=5):
ranked = sorted(
candidates(word, frequencies),
key=lambda candidate: (
bigrams[(previous, candidate)],
frequencies[candidate],
),
reverse=True,
)
return ranked[:limit]
This is still crude, but it shows the structure: local candidate generation plus language-model ranking.
Domain Dictionaries
Generic dictionaries are not enough for technical text.
A programming document may contain:
Dijkstra
heapq
idempotency
PostgreSQL
Kubernetes
A generic spell checker may mark all of these as errors.
Support a custom dictionary:
def merge_dictionaries(base_frequencies, custom_words):
merged = Counter(base_frequencies)
for word in custom_words:
merged[word.lower()] += 1_000_000
return merged
Boosting custom words makes them valid and prevents unwanted correction.
For code editors, dictionaries often include language keywords, package names, identifiers from the current workspace, and words from comments.
Case Preservation
If the user types:
Algoritm
the correction should be:
Algorithm
If the user types:
ALGORITM
the correction may be:
ALGORITHM
Preserve casing:
def match_case(original, corrected):
if original.isupper():
return corrected.upper()
if original[:1].isupper():
return corrected.capitalize()
return corrected
Use it:
def correction_preserve_case(word, frequencies):
corrected = correction(word.lower(), frequencies)
return match_case(word, corrected)
Autocorrect vs Suggestions
Spell checkers have two modes.
| Mode | Behavior |
|---|---|
| Suggestion | Show likely corrections |
| Autocorrect | Replace automatically |
Autocorrect requires a higher confidence threshold. Replacing a user's valid word is worse than offering no correction.
A simple confidence rule:
def should_autocorrect(word, best, frequencies):
if word.lower() in frequencies:
return False
word_candidates = suggestions(word.lower(), frequencies, limit=2)
if len(word_candidates) < 2:
return True
first = frequencies[word_candidates[0]]
second = frequencies[word_candidates[1]]
return first >= 5 * second
This avoids automatic replacement when the top two candidates are close.
Tokenization and Punctuation
A spell checker should not destroy punctuation.
Input:
Algoritm, design!
Desired output:
Algorithm, design!
Separate words from punctuation, correct only word tokens, then preserve punctuation.
TOKEN_RE = re.compile(r"[A-Za-z]+|[^A-Za-z]+")
def correct_text(text, frequencies):
parts = []
for token in TOKEN_RE.findall(text):
if token.isalpha():
parts.append(correction_preserve_case(token, frequencies))
else:
parts.append(token)
return "".join(parts)
This keeps spaces, commas, periods, and other punctuation intact.
Complexity
For candidate generation:
| Operation | Complexity |
|---|---|
| Generate edit-distance-one candidates | O(a × n) |
| Dictionary membership check | O(1) average per candidate |
| Generate edit-distance-two candidates | Much larger, roughly O((a × n)²) |
| Rank candidates | O(k log k) |
Where:
ais alphabet size.nis word length.kis number of known candidates.
For most English words, this is acceptable. For long identifiers or large alphabets, use indexing strategies such as BK-trees, deletion indexes, n-gram indexes, or trie-based search.
Testing
Test exact known words first.
def test_known_word_is_kept():
frequencies = Counter({
"algorithm": 10,
"altruism": 1,
})
assert correction("algorithm", frequencies) == "algorithm"
Test simple correction.
def test_one_edit_correction():
frequencies = Counter({
"algorithm": 10,
})
assert correction("algoritm", frequencies) == "algorithm"
Test casing.
def test_case_preservation():
frequencies = Counter({
"algorithm": 10,
})
assert correction_preserve_case("Algoritm", frequencies) == "Algorithm"
Test punctuation preservation.
def test_punctuation_preservation():
frequencies = Counter({
"algorithm": 10,
"design": 5,
})
assert correct_text("Algoritm, design!", frequencies) == "Algorithm, design!"
These tests define the behavior users notice most.
Common Bugs
The most common bug is correcting words that are already valid. Exact dictionary matches should usually be preserved.
Another common bug is ranking only by edit distance. Without frequency or context, the system often chooses obscure words.
Candidate explosion can make edit-distance-two generation expensive. Do it only after edit-distance-one candidates fail.
Tokenization bugs can remove punctuation, change whitespace, or corrupt mixed text.
Case handling can produce strange output, especially for acronyms.
Domain words are often incorrectly marked as misspellings. Provide custom dictionaries and user dictionaries early.
Autocorrect can be harmful when confidence is low. Prefer suggestions unless the top candidate is clearly dominant.
Recipe
Build the spell checker in layers.
Start with a frequency dictionary. Generate one-edit candidates. Filter by known words. Rank by frequency. Add two-edit candidates only as fallback. Preserve casing and punctuation. Add custom dictionaries for domain terms. Add context-aware ranking with bigrams when isolated-word correction is not enough. Use conservative thresholds for autocorrect. Move to BK-trees or n-gram indexes when dictionary size or alphabet complexity requires it.
The main lesson is that spell checking is approximate search plus ranking. Edit distance generates plausible candidates, but frequency, context, domain knowledge, and confidence thresholds determine whether the result is useful.