16.22 Building a Search Engine

Problem

Throughout this chapter, we studied individual string-processing techniques:

Tokenization
Tries
Hashing
Aho-Corasick
Suffix Arrays
Edit Distance
Text Indexes

Real systems rarely use these algorithms in isolation.

Consider a search engine.

Users expect:

Fast queries
Typo tolerance
Phrase search
Prefix completion
Relevant ranking
Large-scale indexing

No single algorithm provides all of these capabilities.

The challenge is to combine multiple string-processing techniques into a practical search architecture.

Solution

A search engine is fundamentally a pipeline.

Documents
    ↓
Preprocessing
    ↓
Index Construction
    ↓
Query Processing
    ↓
Ranking
    ↓
Results

Each stage applies one or more algorithms from this chapter.

The individual algorithms are important, but the architecture is what makes the system useful.

Step 1: Document Ingestion

Documents arrive from many sources:

Web pages
Books
Logs
Source code
PDF files
Emails

Raw text often contains:

HTML
Formatting
Metadata
Noise

The first step extracts textual content.

Example:

<h1>String Algorithms</h1>

becomes:

String Algorithms

This process is usually called document parsing.

Step 2: Unicode Normalization

Before indexing:

José

and:

José

must become consistent.

Apply:

Unicode normalization

typically NFC.

Pipeline:

Raw Text
   ↓
Unicode Normalize
   ↓
Canonical Text

Without normalization, equivalent words may be indexed separately.

Step 3: Tokenization

Convert text into tokens.

Input:

String algorithms are useful.

Output:

string
algorithms
are
useful

Possible transformations:

Lowercasing
Stemming
Stop-word removal
Number normalization

Example:

Running
Runs
Runner

may become:

run

depending on the indexing policy.

Step 4: Building the Dictionary

The vocabulary consists of all unique terms.

Example:

algorithm
search
string
tree
graph

The dictionary supports:

Lookup
Prefix search
Statistics

Common implementations:

Structure	Purpose
Hash table	Fast lookup
Sorted array	Range queries
Trie	Prefix search

A trie is particularly useful for autocomplete.

Step 5: Constructing the Inverted Index

The inverted index maps:

term

to:

documents

Example:

Term	Documents
algorithm	1, 4, 7
search	2, 4, 8
string	1, 2, 3

With positions:

Term	Postings
string	(1,4), (3,18), (8,2)

The index becomes the primary search structure.

Most query processing begins here.

Step 6: Compression

Indexes become large.

Suppose:

string

appears in:

Store gaps:

instead of full IDs.

Compression reduces:

Storage
Memory usage
Disk I/O

Many search systems spend significant engineering effort optimizing postings storage.

Step 7: Query Processing

A query arrives:

string algorithms

The query undergoes the same preprocessing pipeline:

Normalize
Tokenize
Lowercase
Stem

Consistency is critical.

The query must be transformed exactly as documents were transformed.

Otherwise matches are lost.

Single-Term Queries

Example:

algorithm

Lookup:

dictionary["algorithm"]

Result:

posting list

Complexity is effectively constant.

Boolean Queries

Example:

string AND algorithm

Retrieve:

posting(string)
posting(algorithm)

Intersect:

doc1
doc5
doc10

Similarly:

OR
NOT

become union and difference operations.

The inverted index naturally supports Boolean retrieval.

Phrase Queries

Example:

"string algorithm"

Documents must contain:

string
algorithm

adjacent to one another.

Positional postings enable this.

Example:

string    → [4]
algorithm → [5]

Since:

4 + 1 = 5

the phrase exists.

Phrase search is one reason positional indexes are valuable.

Prefix Queries

Example:

alg*

Use a trie:

alg
├── algebra
├── algorithm
├── algorithms
└── algae

Retrieve matching terms.

Merge their posting lists.

Autocomplete systems often use this approach.

Typo-Tolerant Search

Suppose a user enters:

algoritm

Dictionary lookup fails.

Apply edit distance.

Possible candidates:

algorithm
algorithms

Distance:

The closest candidate becomes the corrected query.

Large systems typically use:

Edit distance
N-gram indexes
Phonetic matching

together.

Substring Search

Query:

gram

inside:

algorithm

An inverted index alone cannot answer this efficiently.

Options:

Structure	Supports
N-gram index	Substrings
Suffix Array	Full substring search
Suffix Automaton	Advanced substring queries

Many search engines build specialized secondary indexes for this purpose.

Ranking Results

Finding matches is only the first step.

A search engine must decide:

Which result comes first?

Simple ranking:

Term Frequency

Documents containing the term more often receive higher scores.

Better ranking:

TF-IDF

where:

TF

measures frequency inside a document and:

IDF

rewards rare terms.

Rare terms are often more informative.

Example

Query:

suffix array

Document A:

suffix array suffix array suffix array

Document B:

suffix array

Both match.

Document A may receive a higher score because the query terms appear more frequently.

Ranking determines presentation order.

Fuzzy Ranking

Suppose:

algoritm

matches:

algorithm

and:

algorithms

Possible ranking signals:

Edit distance
Popularity
Frequency
User behavior

The search engine combines these signals into a final score.

Caching

Popular queries repeat.

Examples:

weather
news
python

Store previous results:

Query
   ↓
Cache
   ↓
Result

A cache avoids repeating expensive processing.

This can dramatically improve throughput.

Distributed Search

Large collections exceed the capacity of a single machine.

Documents are partitioned:

Shard 1
Shard 2
Shard 3

A query executes on every shard:

Search
   ↓
Shard Queries
   ↓
Merge Results

This architecture powers large-scale web search systems.

The underlying string-processing algorithms remain unchanged.

Where Chapter Algorithms Appear

Component	Algorithm
Tokenization	Token Streams
Normalization	Unicode Processing
Dictionary	Trie
Exact Search	KMP, Boyer-Moore
Multi-pattern Search	Aho-Corasick
Typo Correction	Edit Distance
Prefix Search	Trie
Substring Search	Suffix Array
Duplicate Detection	String Hashing
Compression	Compressed Strings

A search engine is essentially a composition of many string-processing techniques.

Lean Sketch

A simplified architecture:

structure SearchEngine where
  dictionary : Trie
  index : InvertedIndex

Indexing:

def addDocument
  (engine : SearchEngine)
  (text : String)
  : SearchEngine :=
by
  -- Normalize.
  -- Tokenize.
  -- Update dictionary.
  -- Update postings.
  sorry

Querying:

def search
  (engine : SearchEngine)
  (query : String)
  : List Nat :=
by
  -- Normalize.
  -- Tokenize.
  -- Retrieve postings.
  -- Rank results.
  sorry

The details may vary, but the overall architecture remains surprisingly stable across implementations.

Complexity Analysis

Let:

Symbol	Meaning
N	Total indexed tokens
V	Vocabulary size
P	Postings examined

Operation	Complexity
Index construction	O(N)
Dictionary lookup	O(1) average
Trie prefix lookup	O(prefix length)
Posting intersection	O(P)
Phrase verification	O(P)
Edit-distance correction	Depends on candidate set

Most query time is spent processing postings lists and ranking results.

Correctness

The search engine is correct if:

Documents and queries use identical normalization.
Every indexed term appears in the correct postings list.
Boolean operations correctly manipulate document sets.
Phrase queries verify positional adjacency.
Ranking only reorders matching results rather than inventing matches.

Each stage preserves the information needed by later stages.

Common Pitfalls

Do not normalize queries differently from documents.

Do not remove position information if phrase search is required.

Do not use edit distance over the entire vocabulary without filtering.

Do not rely on a single ranking signal.

Do not assume substring search can be handled efficiently by an inverted index alone.

Takeaway

A search engine is one of the best examples of applied string processing. Nearly every major technique in this chapter appears somewhere in the pipeline: normalization prepares text, tokenization creates structure, tries support prefixes, inverted indexes support retrieval, edit distance handles mistakes, suffix structures support substrings, and hashing accelerates comparison. The individual algorithms are valuable, but their real power emerges when they work together as a coherent system.