20.10 Building a Spell Checker

Problem

Design a spell checker that detects misspelled words and suggests likely corrections.

Spell checking appears in search boxes, document editors, messaging apps, code editors, command-line tools, OCR cleanup systems, and data-entry workflows. The user types:

algoritm

The system should suggest:

algorithm

A useful spell checker must do more than find dictionary words. It must rank plausible corrections, handle common typing errors, and avoid replacing valid domain-specific words.

Solution

Use a dictionary for valid words, generate candidate corrections within a small edit distance, and rank candidates by word frequency.

The core idea is:

input word
  -> generate nearby words
  -> keep dictionary words
  -> rank by likelihood
  -> return suggestions

For a first implementation, use edit distance one and two. Most spelling mistakes are close to the intended word.

Dictionary and Frequency Table

A dictionary answers whether a word is known. A frequency table estimates how likely a word is.

from collections import Counter

def build_frequency_table(corpus):
    words = []

    for line in corpus:
        for token in tokenize(line):
            words.append(token)

    return Counter(words)

Tokenization:

import re

WORD_RE = re.compile(r"[a-z]+")

def tokenize(text):
    return WORD_RE.findall(text.lower())

Example:

corpus = [
    "algorithm design",
    "algorithm analysis",
    "graph algorithm",
    "dynamic programming",
]

frequencies = build_frequency_table(corpus)

print(frequencies)

Output:

Counter({'algorithm': 3, 'design': 1, 'analysis': 1, 'graph': 1, 'dynamic': 1, 'programming': 1})

The word algorithm should rank higher than a rare alternative if both are plausible corrections.

Generating Edits

An edit operation transforms one word into another.

Common edit operations:

Operation	Example
Delete	`algoritm` -> `algorim`
Insert	`algoritm` -> `algorithm`
Replace	`algoritm` -> `algorita`
Transpose	`algroithm` -> `algorithm`

Generate all words at edit distance one:

ALPHABET = "abcdefghijklmnopqrstuvwxyz"

def edits1(word):
    splits = [
        (word[:i], word[i:])
        for i in range(len(word) + 1)
    ]

    deletes = [
        left + right[1:]
        for left, right in splits
        if right
    ]

    transposes = [
        left + right[1] + right[0] + right[2:]
        for left, right in splits
        if len(right) > 1
    ]

    replaces = [
        left + ch + right[1:]
        for left, right in splits
        if right
        for ch in ALPHABET
    ]

    inserts = [
        left + ch + right
        for left, right in splits
        for ch in ALPHABET
    ]

    return set(deletes + transposes + replaces + inserts)

This function is compact, but it can generate many candidates. For a word of length n, edit distance one produces O(26n) candidates.

Filtering Known Words

Only keep candidates that appear in the dictionary.

def known(words, frequencies):
    return {
        word
        for word in words
        if word in frequencies
    }

Candidate generation:

def candidates(word, frequencies):
    if word in frequencies:
        return {word}

    one_edit = known(edits1(word), frequencies)

    if one_edit:
        return one_edit

    two_edits = known(
        (
            edit2
            for edit1 in edits1(word)
            for edit2 in edits1(edit1)
        ),
        frequencies,
    )

    if two_edits:
        return two_edits

    return {word}

This follows a practical ranking order:

Known original word.
Known words one edit away.
Known words two edits away.
Original word if no correction is known.

Ranking Suggestions

Rank candidates by frequency.

def correction(word, frequencies):
    return max(
        candidates(word, frequencies),
        key=lambda candidate: frequencies[candidate],
    )

Return multiple suggestions:

def suggestions(word, frequencies, limit=5):
    ranked = sorted(
        candidates(word, frequencies),
        key=lambda candidate: frequencies[candidate],
        reverse=True,
    )

    return ranked[:limit]

Example:

corpus = [
    "algorithm algorithm algorithm",
    "algorithms",
    "altruism",
    "rhythm",
]

frequencies = build_frequency_table(corpus)

print(suggestions("algoritm", frequencies))

Expected output:

['algorithm']

This is the basic spell checker popularized by simple probabilistic spelling-correction examples: generate candidates, filter known words, and choose the most frequent.

Why Frequency Matters

Edit distance alone cannot choose between plausible words.

Input:

form

Possible corrections:

form
from
farm
firm

The correct answer depends on context. Without context, frequency is the simplest useful signal. If from appears far more often than farm, it may be the better correction for many inputs.

But frequency should not override exact matches. If the user typed a known word, keep it unless there is a strong reason to autocorrect.

Computing Edit Distance Directly

Candidate generation is fast for short words, but sometimes you need the actual edit distance between two words.

Use dynamic programming.

def edit_distance(a, b):
    rows = len(a)
    cols = len(b)

    dp = [
        [0] * (cols + 1)
        for _ in range(rows + 1)
    ]

    for i in range(rows + 1):
        dp[i][0] = i

    for j in range(cols + 1):
        dp[0][j] = j

    for i in range(1, rows + 1):
        for j in range(1, cols + 1):
            cost = 0 if a[i - 1] == b[j - 1] else 1

            dp[i][j] = min(
                dp[i - 1][j] + 1,
                dp[i][j - 1] + 1,
                dp[i - 1][j - 1] + cost,
            )

    return dp[rows][cols]

Example:

print(edit_distance("algoritm", "algorithm"))

Output:

Direct edit distance is useful for ranking candidates from a restricted dictionary, but computing it against every dictionary word is expensive.

Avoiding Full Dictionary Scans

A naive spell checker compares the input word against every dictionary word.

def slow_suggestions(word, dictionary):
    return [
        candidate
        for candidate in dictionary
        if edit_distance(word, candidate) <= 2
    ]

This has poor scaling:

O(dictionary_size × word_length²)

Candidate generation reverses the direction. Instead of testing every dictionary word, generate nearby strings and check them in a hash set.

This makes lookup fast for short words.

BK-Trees

For larger dictionaries or non-English alphabets, a BK-tree can index words by edit distance.

A BK-tree stores each word as a node. Edges are labeled by edit distance from the parent word.

When searching for words within distance d, the triangle inequality prunes branches.

Simplified implementation:

class BKNode:
    def __init__(self, word):
        self.word = word
        self.children = {}

class BKTree:
    def __init__(self, distance):
        self.root = None
        self.distance = distance

    def add(self, word):
        if self.root is None:
            self.root = BKNode(word)
            return

        node = self.root

        while True:
            dist = self.distance(word, node.word)

            if dist in node.children:
                node = node.children[dist]
            else:
                node.children[dist] = BKNode(word)
                return

    def search(self, word, max_distance):
        results = []

        def visit(node):
            dist = self.distance(word, node.word)

            if dist <= max_distance:
                results.append((node.word, dist))

            low = dist - max_distance
            high = dist + max_distance

            for edge_distance, child in node.children.items():
                if low <= edge_distance <= high:
                    visit(child)

        if self.root is not None:
            visit(self.root)

        return results

BK-trees are useful when direct candidate generation becomes awkward, especially for large alphabets or custom distance functions.

Context-Aware Correction

Single-word correction fails when the misspelled word is also a valid word.

Example:

I will reed the book.

The word reed is valid, but context suggests read.

A context-aware spell checker scores sequences rather than isolated words.

A simple bigram model estimates:

P(current word | previous word)

Example:

will read
will reed

If will read appears much more often in the corpus, choose read.

Build bigram counts:

def build_bigram_counts(corpus):
    counts = Counter()

    for line in corpus:
        tokens = tokenize(line)

        for i in range(len(tokens) - 1):
            counts[(tokens[i], tokens[i + 1])] += 1

    return counts

Score candidate by previous word:

def contextual_suggestions(previous, word, frequencies, bigrams, limit=5):
    ranked = sorted(
        candidates(word, frequencies),
        key=lambda candidate: (
            bigrams[(previous, candidate)],
            frequencies[candidate],
        ),
        reverse=True,
    )

    return ranked[:limit]

This is still crude, but it shows the structure: local candidate generation plus language-model ranking.

Domain Dictionaries

Generic dictionaries are not enough for technical text.

A programming document may contain:

Dijkstra
heapq
idempotency
PostgreSQL
Kubernetes

A generic spell checker may mark all of these as errors.

Support a custom dictionary:

def merge_dictionaries(base_frequencies, custom_words):
    merged = Counter(base_frequencies)

    for word in custom_words:
        merged[word.lower()] += 1_000_000

    return merged

Boosting custom words makes them valid and prevents unwanted correction.

For code editors, dictionaries often include language keywords, package names, identifiers from the current workspace, and words from comments.

Case Preservation

If the user types:

Algoritm

the correction should be:

Algorithm

If the user types:

ALGORITM

the correction may be:

ALGORITHM

Preserve casing:

def match_case(original, corrected):
    if original.isupper():
        return corrected.upper()

    if original[:1].isupper():
        return corrected.capitalize()

    return corrected

Use it:

def correction_preserve_case(word, frequencies):
    corrected = correction(word.lower(), frequencies)
    return match_case(word, corrected)

Autocorrect vs Suggestions

Spell checkers have two modes.

Mode	Behavior
Suggestion	Show likely corrections
Autocorrect	Replace automatically

Autocorrect requires a higher confidence threshold. Replacing a user's valid word is worse than offering no correction.

A simple confidence rule:

def should_autocorrect(word, best, frequencies):
    if word.lower() in frequencies:
        return False

    word_candidates = suggestions(word.lower(), frequencies, limit=2)

    if len(word_candidates) < 2:
        return True

    first = frequencies[word_candidates[0]]
    second = frequencies[word_candidates[1]]

    return first >= 5 * second

This avoids automatic replacement when the top two candidates are close.

Tokenization and Punctuation

A spell checker should not destroy punctuation.

Input:

Algoritm, design!

Desired output:

Algorithm, design!

Separate words from punctuation, correct only word tokens, then preserve punctuation.

TOKEN_RE = re.compile(r"[A-Za-z]+|[^A-Za-z]+")

def correct_text(text, frequencies):
    parts = []

    for token in TOKEN_RE.findall(text):
        if token.isalpha():
            parts.append(correction_preserve_case(token, frequencies))
        else:
            parts.append(token)

    return "".join(parts)

This keeps spaces, commas, periods, and other punctuation intact.

Complexity

For candidate generation:

Operation	Complexity
Generate edit-distance-one candidates	O(a × n)
Dictionary membership check	O(1) average per candidate
Generate edit-distance-two candidates	Much larger, roughly O((a × n)²)
Rank candidates	O(k log k)

Where:

a is alphabet size.
n is word length.
k is number of known candidates.

For most English words, this is acceptable. For long identifiers or large alphabets, use indexing strategies such as BK-trees, deletion indexes, n-gram indexes, or trie-based search.

Testing

Test exact known words first.

def test_known_word_is_kept():
    frequencies = Counter({
        "algorithm": 10,
        "altruism": 1,
    })

    assert correction("algorithm", frequencies) == "algorithm"

Test simple correction.

def test_one_edit_correction():
    frequencies = Counter({
        "algorithm": 10,
    })

    assert correction("algoritm", frequencies) == "algorithm"

Test casing.

def test_case_preservation():
    frequencies = Counter({
        "algorithm": 10,
    })

    assert correction_preserve_case("Algoritm", frequencies) == "Algorithm"

Test punctuation preservation.

def test_punctuation_preservation():
    frequencies = Counter({
        "algorithm": 10,
        "design": 5,
    })

    assert correct_text("Algoritm, design!", frequencies) == "Algorithm, design!"

These tests define the behavior users notice most.

Common Bugs

The most common bug is correcting words that are already valid. Exact dictionary matches should usually be preserved.

Another common bug is ranking only by edit distance. Without frequency or context, the system often chooses obscure words.

Candidate explosion can make edit-distance-two generation expensive. Do it only after edit-distance-one candidates fail.

Tokenization bugs can remove punctuation, change whitespace, or corrupt mixed text.

Case handling can produce strange output, especially for acronyms.

Domain words are often incorrectly marked as misspellings. Provide custom dictionaries and user dictionaries early.

Autocorrect can be harmful when confidence is low. Prefer suggestions unless the top candidate is clearly dominant.

Recipe

Build the spell checker in layers.

Start with a frequency dictionary. Generate one-edit candidates. Filter by known words. Rank by frequency. Add two-edit candidates only as fallback. Preserve casing and punctuation. Add custom dictionaries for domain terms. Add context-aware ranking with bigrams when isolated-word correction is not enough. Use conservative thresholds for autocorrect. Move to BK-trees or n-gram indexes when dictionary size or alphabet complexity requires it.

The main lesson is that spell checking is approximate search plus ranking. Edit distance generates plausible candidates, but frequency, context, domain knowledge, and confidence thresholds determine whether the result is useful.