16.22 Building a Search Engine

Throughout this chapter, we studied individual string-processing techniques: - Tokenization - Tries - Hashing

16.22 Building a Search Engine

Problem

Throughout this chapter, we studied individual string-processing techniques:

  • Tokenization
  • Tries
  • Hashing
  • Aho-Corasick
  • Suffix Arrays
  • Edit Distance
  • Text Indexes

Real systems rarely use these algorithms in isolation.

Consider a search engine.

Users expect:

Fast queries
Typo tolerance
Phrase search
Prefix completion
Relevant ranking
Large-scale indexing

No single algorithm provides all of these capabilities.

The challenge is to combine multiple string-processing techniques into a practical search architecture.

Solution

A search engine is fundamentally a pipeline.

Documents
    ↓
Preprocessing
    ↓
Index Construction
    ↓
Query Processing
    ↓
Ranking
    ↓
Results

Each stage applies one or more algorithms from this chapter.

The individual algorithms are important, but the architecture is what makes the system useful.

Step 1: Document Ingestion

Documents arrive from many sources:

Web pages
Books
Logs
Source code
PDF files
Emails

Raw text often contains:

HTML
Formatting
Metadata
Noise

The first step extracts textual content.

Example:

<h1>String Algorithms</h1>

becomes:

String Algorithms

This process is usually called document parsing.

Step 2: Unicode Normalization

Before indexing:

José

and:

José

must become consistent.

Apply:

Unicode normalization

typically NFC.

Pipeline:

Raw Text
   ↓
Unicode Normalize
   ↓
Canonical Text

Without normalization, equivalent words may be indexed separately.

Step 3: Tokenization

Convert text into tokens.

Input:

String algorithms are useful.

Output:

string
algorithms
are
useful

Possible transformations:

  • Lowercasing
  • Stemming
  • Stop-word removal
  • Number normalization

Example:

Running
Runs
Runner

may become:

run

depending on the indexing policy.

Step 4: Building the Dictionary

The vocabulary consists of all unique terms.

Example:

algorithm
search
string
tree
graph

The dictionary supports:

  • Lookup
  • Prefix search
  • Statistics

Common implementations:

Structure Purpose
Hash table Fast lookup
Sorted array Range queries
Trie Prefix search

A trie is particularly useful for autocomplete.

Step 5: Constructing the Inverted Index

The inverted index maps:

term

to:

documents

Example:

Term Documents
algorithm 1, 4, 7
search 2, 4, 8
string 1, 2, 3

With positions:

Term Postings
string (1,4), (3,18), (8,2)

The index becomes the primary search structure.

Most query processing begins here.

Step 6: Compression

Indexes become large.

Suppose:

string

appears in:

100
105
109
120

Store gaps:

100
5
4
11

instead of full IDs.

Compression reduces:

  • Storage
  • Memory usage
  • Disk I/O

Many search systems spend significant engineering effort optimizing postings storage.

Step 7: Query Processing

A query arrives:

string algorithms

The query undergoes the same preprocessing pipeline:

Normalize
Tokenize
Lowercase
Stem

Consistency is critical.

The query must be transformed exactly as documents were transformed.

Otherwise matches are lost.

Single-Term Queries

Example:

algorithm

Lookup:

dictionary["algorithm"]

Result:

posting list

Complexity is effectively constant.

Boolean Queries

Example:

string AND algorithm

Retrieve:

posting(string)
posting(algorithm)

Intersect:

doc1
doc5
doc10

Similarly:

OR
NOT

become union and difference operations.

The inverted index naturally supports Boolean retrieval.

Phrase Queries

Example:

"string algorithm"

Documents must contain:

string
algorithm

adjacent to one another.

Positional postings enable this.

Example:

string    → [4]
algorithm → [5]

Since:

4 + 1 = 5

the phrase exists.

Phrase search is one reason positional indexes are valuable.

Prefix Queries

Example:

alg*

Use a trie:

alg
├── algebra
├── algorithm
├── algorithms
└── algae

Retrieve matching terms.

Merge their posting lists.

Autocomplete systems often use this approach.

Suppose a user enters:

algoritm

Dictionary lookup fails.

Apply edit distance.

Possible candidates:

algorithm
algorithms

Distance:

1

The closest candidate becomes the corrected query.

Large systems typically use:

  • Edit distance
  • N-gram indexes
  • Phonetic matching

together.

Query:

gram

inside:

algorithm

An inverted index alone cannot answer this efficiently.

Options:

Structure Supports
N-gram index Substrings
Suffix Array Full substring search
Suffix Automaton Advanced substring queries

Many search engines build specialized secondary indexes for this purpose.

Ranking Results

Finding matches is only the first step.

A search engine must decide:

Which result comes first?

Simple ranking:

Term Frequency

Documents containing the term more often receive higher scores.

Better ranking:

TF-IDF

where:

TF

measures frequency inside a document and:

IDF

rewards rare terms.

Rare terms are often more informative.

Example

Query:

suffix array

Document A:

suffix array suffix array suffix array

Document B:

suffix array

Both match.

Document A may receive a higher score because the query terms appear more frequently.

Ranking determines presentation order.

Fuzzy Ranking

Suppose:

algoritm

matches:

algorithm

and:

algorithms

Possible ranking signals:

  • Edit distance
  • Popularity
  • Frequency
  • User behavior

The search engine combines these signals into a final score.

Caching

Popular queries repeat.

Examples:

weather
news
python

Store previous results:

Query
   ↓
Cache
   ↓
Result

A cache avoids repeating expensive processing.

This can dramatically improve throughput.

Large collections exceed the capacity of a single machine.

Documents are partitioned:

Shard 1
Shard 2
Shard 3

A query executes on every shard:

Search
   ↓
Shard Queries
   ↓
Merge Results

This architecture powers large-scale web search systems.

The underlying string-processing algorithms remain unchanged.

Where Chapter Algorithms Appear

Component Algorithm
Tokenization Token Streams
Normalization Unicode Processing
Dictionary Trie
Exact Search KMP, Boyer-Moore
Multi-pattern Search Aho-Corasick
Typo Correction Edit Distance
Prefix Search Trie
Substring Search Suffix Array
Duplicate Detection String Hashing
Compression Compressed Strings

A search engine is essentially a composition of many string-processing techniques.

Lean Sketch

A simplified architecture:

structure SearchEngine where
  dictionary : Trie
  index : InvertedIndex

Indexing:

def addDocument
  (engine : SearchEngine)
  (text : String)
  : SearchEngine :=
by
  -- Normalize.
  -- Tokenize.
  -- Update dictionary.
  -- Update postings.
  sorry

Querying:

def search
  (engine : SearchEngine)
  (query : String)
  : List Nat :=
by
  -- Normalize.
  -- Tokenize.
  -- Retrieve postings.
  -- Rank results.
  sorry

The details may vary, but the overall architecture remains surprisingly stable across implementations.

Complexity Analysis

Let:

Symbol Meaning
N Total indexed tokens
V Vocabulary size
P Postings examined
Operation Complexity
Index construction O(N)
Dictionary lookup O(1) average
Trie prefix lookup O(prefix length)
Posting intersection O(P)
Phrase verification O(P)
Edit-distance correction Depends on candidate set

Most query time is spent processing postings lists and ranking results.

Correctness

The search engine is correct if:

  1. Documents and queries use identical normalization.
  2. Every indexed term appears in the correct postings list.
  3. Boolean operations correctly manipulate document sets.
  4. Phrase queries verify positional adjacency.
  5. Ranking only reorders matching results rather than inventing matches.

Each stage preserves the information needed by later stages.

Common Pitfalls

Do not normalize queries differently from documents.

Do not remove position information if phrase search is required.

Do not use edit distance over the entire vocabulary without filtering.

Do not rely on a single ranking signal.

Do not assume substring search can be handled efficiently by an inverted index alone.

Takeaway

A search engine is one of the best examples of applied string processing. Nearly every major technique in this chapter appears somewhere in the pipeline: normalization prepares text, tokenization creates structure, tries support prefixes, inverted indexes support retrieval, edit distance handles mistakes, suffix structures support substrings, and hashing accelerates comparison. The individual algorithms are valuable, but their real power emerges when they work together as a coherent system.