16.22 Building a Search Engine
Throughout this chapter, we studied individual string-processing techniques: - Tokenization - Tries - Hashing
16.22 Building a Search Engine
Problem
Throughout this chapter, we studied individual string-processing techniques:
- Tokenization
- Tries
- Hashing
- Aho-Corasick
- Suffix Arrays
- Edit Distance
- Text Indexes
Real systems rarely use these algorithms in isolation.
Consider a search engine.
Users expect:
Fast queries
Typo tolerance
Phrase search
Prefix completion
Relevant ranking
Large-scale indexing
No single algorithm provides all of these capabilities.
The challenge is to combine multiple string-processing techniques into a practical search architecture.
Solution
A search engine is fundamentally a pipeline.
Documents
↓
Preprocessing
↓
Index Construction
↓
Query Processing
↓
Ranking
↓
Results
Each stage applies one or more algorithms from this chapter.
The individual algorithms are important, but the architecture is what makes the system useful.
Step 1: Document Ingestion
Documents arrive from many sources:
Web pages
Books
Logs
Source code
PDF files
Emails
Raw text often contains:
HTML
Formatting
Metadata
Noise
The first step extracts textual content.
Example:
<h1>String Algorithms</h1>
becomes:
String Algorithms
This process is usually called document parsing.
Step 2: Unicode Normalization
Before indexing:
José
and:
José
must become consistent.
Apply:
Unicode normalization
typically NFC.
Pipeline:
Raw Text
↓
Unicode Normalize
↓
Canonical Text
Without normalization, equivalent words may be indexed separately.
Step 3: Tokenization
Convert text into tokens.
Input:
String algorithms are useful.
Output:
string
algorithms
are
useful
Possible transformations:
- Lowercasing
- Stemming
- Stop-word removal
- Number normalization
Example:
Running
Runs
Runner
may become:
run
depending on the indexing policy.
Step 4: Building the Dictionary
The vocabulary consists of all unique terms.
Example:
algorithm
search
string
tree
graph
The dictionary supports:
- Lookup
- Prefix search
- Statistics
Common implementations:
| Structure | Purpose |
|---|---|
| Hash table | Fast lookup |
| Sorted array | Range queries |
| Trie | Prefix search |
A trie is particularly useful for autocomplete.
Step 5: Constructing the Inverted Index
The inverted index maps:
term
to:
documents
Example:
| Term | Documents |
|---|---|
| algorithm | 1, 4, 7 |
| search | 2, 4, 8 |
| string | 1, 2, 3 |
With positions:
| Term | Postings |
|---|---|
| string | (1,4), (3,18), (8,2) |
The index becomes the primary search structure.
Most query processing begins here.
Step 6: Compression
Indexes become large.
Suppose:
string
appears in:
100
105
109
120
Store gaps:
100
5
4
11
instead of full IDs.
Compression reduces:
- Storage
- Memory usage
- Disk I/O
Many search systems spend significant engineering effort optimizing postings storage.
Step 7: Query Processing
A query arrives:
string algorithms
The query undergoes the same preprocessing pipeline:
Normalize
Tokenize
Lowercase
Stem
Consistency is critical.
The query must be transformed exactly as documents were transformed.
Otherwise matches are lost.
Single-Term Queries
Example:
algorithm
Lookup:
dictionary["algorithm"]
Result:
posting list
Complexity is effectively constant.
Boolean Queries
Example:
string AND algorithm
Retrieve:
posting(string)
posting(algorithm)
Intersect:
doc1
doc5
doc10
Similarly:
OR
NOT
become union and difference operations.
The inverted index naturally supports Boolean retrieval.
Phrase Queries
Example:
"string algorithm"
Documents must contain:
string
algorithm
adjacent to one another.
Positional postings enable this.
Example:
string → [4]
algorithm → [5]
Since:
4 + 1 = 5
the phrase exists.
Phrase search is one reason positional indexes are valuable.
Prefix Queries
Example:
alg*
Use a trie:
alg
├── algebra
├── algorithm
├── algorithms
└── algae
Retrieve matching terms.
Merge their posting lists.
Autocomplete systems often use this approach.
Typo-Tolerant Search
Suppose a user enters:
algoritm
Dictionary lookup fails.
Apply edit distance.
Possible candidates:
algorithm
algorithms
Distance:
1
The closest candidate becomes the corrected query.
Large systems typically use:
- Edit distance
- N-gram indexes
- Phonetic matching
together.
Substring Search
Query:
gram
inside:
algorithm
An inverted index alone cannot answer this efficiently.
Options:
| Structure | Supports |
|---|---|
| N-gram index | Substrings |
| Suffix Array | Full substring search |
| Suffix Automaton | Advanced substring queries |
Many search engines build specialized secondary indexes for this purpose.
Ranking Results
Finding matches is only the first step.
A search engine must decide:
Which result comes first?
Simple ranking:
Term Frequency
Documents containing the term more often receive higher scores.
Better ranking:
TF-IDF
where:
TF
measures frequency inside a document and:
IDF
rewards rare terms.
Rare terms are often more informative.
Example
Query:
suffix array
Document A:
suffix array suffix array suffix array
Document B:
suffix array
Both match.
Document A may receive a higher score because the query terms appear more frequently.
Ranking determines presentation order.
Fuzzy Ranking
Suppose:
algoritm
matches:
algorithm
and:
algorithms
Possible ranking signals:
- Edit distance
- Popularity
- Frequency
- User behavior
The search engine combines these signals into a final score.
Caching
Popular queries repeat.
Examples:
weather
news
python
Store previous results:
Query
↓
Cache
↓
Result
A cache avoids repeating expensive processing.
This can dramatically improve throughput.
Distributed Search
Large collections exceed the capacity of a single machine.
Documents are partitioned:
Shard 1
Shard 2
Shard 3
A query executes on every shard:
Search
↓
Shard Queries
↓
Merge Results
This architecture powers large-scale web search systems.
The underlying string-processing algorithms remain unchanged.
Where Chapter Algorithms Appear
| Component | Algorithm |
|---|---|
| Tokenization | Token Streams |
| Normalization | Unicode Processing |
| Dictionary | Trie |
| Exact Search | KMP, Boyer-Moore |
| Multi-pattern Search | Aho-Corasick |
| Typo Correction | Edit Distance |
| Prefix Search | Trie |
| Substring Search | Suffix Array |
| Duplicate Detection | String Hashing |
| Compression | Compressed Strings |
A search engine is essentially a composition of many string-processing techniques.
Lean Sketch
A simplified architecture:
structure SearchEngine where
dictionary : Trie
index : InvertedIndex
Indexing:
def addDocument
(engine : SearchEngine)
(text : String)
: SearchEngine :=
by
-- Normalize.
-- Tokenize.
-- Update dictionary.
-- Update postings.
sorry
Querying:
def search
(engine : SearchEngine)
(query : String)
: List Nat :=
by
-- Normalize.
-- Tokenize.
-- Retrieve postings.
-- Rank results.
sorry
The details may vary, but the overall architecture remains surprisingly stable across implementations.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| N | Total indexed tokens |
| V | Vocabulary size |
| P | Postings examined |
| Operation | Complexity |
|---|---|
| Index construction | O(N) |
| Dictionary lookup | O(1) average |
| Trie prefix lookup | O(prefix length) |
| Posting intersection | O(P) |
| Phrase verification | O(P) |
| Edit-distance correction | Depends on candidate set |
Most query time is spent processing postings lists and ranking results.
Correctness
The search engine is correct if:
- Documents and queries use identical normalization.
- Every indexed term appears in the correct postings list.
- Boolean operations correctly manipulate document sets.
- Phrase queries verify positional adjacency.
- Ranking only reorders matching results rather than inventing matches.
Each stage preserves the information needed by later stages.
Common Pitfalls
Do not normalize queries differently from documents.
Do not remove position information if phrase search is required.
Do not use edit distance over the entire vocabulary without filtering.
Do not rely on a single ranking signal.
Do not assume substring search can be handled efficiently by an inverted index alone.
Takeaway
A search engine is one of the best examples of applied string processing. Nearly every major technique in this chapter appears somewhere in the pipeline: normalization prepares text, tokenization creates structure, tries support prefixes, inverted indexes support retrieval, edit distance handles mistakes, suffix structures support substrings, and hashing accelerates comparison. The individual algorithms are valuable, but their real power emerges when they work together as a coherent system.