16.21 Text Indexing
A single pattern search scans one text once.
16.21 Text Indexing
Problem
A single pattern search scans one text once.
A text index solves a different problem: one or more large texts are stored, and many queries arrive later.
Examples:
Find all documents containing "algorithm"
Find every occurrence of "banana"
Find phrases beginning with "dynamic programming"
Find logs containing both "timeout" and "database"
Scanning every document for every query is too slow.
The challenge is to preprocess text so queries can be answered quickly.
Solution
Build an index.
A text index stores derived information about the input text. Instead of searching raw text directly, queries consult the index.
The simplest useful index is an inverted index:
term → list of documents or positions
For example:
algorithm → [(doc1, 14), (doc3, 88), (doc9, 5)]
search → [(doc2, 7), (doc3, 19)]
string → [(doc1, 4), (doc5, 31)]
This representation turns term lookup into a map query.
Inverted Index
An inverted index maps each token to the places where it appears.
Given documents:
doc1: string algorithms are useful
doc2: graph algorithms are useful
doc3: string search uses indexes
The index contains:
| Term | Postings |
|---|---|
| algorithms | doc1, doc2 |
| graph | doc2 |
| indexes | doc3 |
| search | doc3 |
| string | doc1, doc3 |
| useful | doc1, doc2 |
A query for:
string
returns:
doc1, doc3
without scanning every document.
Positional Indexes
For phrase queries, document-level postings are not enough.
Query:
string algorithms
requires the words to appear next to each other.
Store positions:
| Term | Postings |
|---|---|
| string | doc1:[0], doc3:[0] |
| algorithms | doc1:[1], doc2:[1] |
Now the phrase query checks whether:
position(string) + 1 = position(algorithms)
In doc1, this holds:
string algorithms
0 1
So doc1 matches.
Building an Index
A simple pipeline:
Document
↓
Normalize
↓
Tokenize
↓
Assign document ID
↓
Record term positions
↓
Store postings lists
The normalization and tokenization policy must match query processing. If documents are lowercased during indexing, queries must also be lowercased.
Lean Sketch
A minimal posting can be represented as:
structure Posting where
docId : Nat
positions : List Nat
deriving Repr
An inverted index:
abbrev InvertedIndex :=
Std.HashMap String (List Posting)
Indexing a token stream:
def addDocument
(index : InvertedIndex)
(docId : Nat)
(tokens : Array String)
: InvertedIndex :=
by
-- For each token position:
-- insert docId and position into postings.
sorry
The core invariant is straightforward:
For every term, its postings list records exactly the documents and positions where that term occurs.
Query Processing
A single-term query is direct:
lookup(term)
A multi-term query combines postings lists.
For:
string AND algorithms
compute the intersection:
postings(string) ∩ postings(algorithms)
For:
string OR algorithms
compute the union.
For:
string NOT algorithms
subtract the second postings list from the first.
Most search engines spend significant effort making these postings operations fast.
Intersecting Postings Lists
Postings lists are usually stored sorted by document ID.
Example:
string → [1, 3, 5, 10]
algorithm → [2, 3, 10, 11]
Intersection can be computed with two pointers:
1 < 2 advance first
3 > 2 advance second
3 = 3 emit 3
5 < 10 advance first
10 = 10 emit 10
Result:
[3, 10]
This is linear in the combined postings lengths.
Phrase Queries
For a phrase:
dynamic programming
first intersect documents containing both terms.
Then check positions.
dynamic → doc7:[4, 20]
programming → doc7:[5, 21]
Both positions satisfy adjacency:
4 + 1 = 5
20 + 1 = 21
So doc7 contains the phrase twice.
For longer phrases, the same principle extends across multiple term-position lists.
Prefix Queries
A query such as:
alg*
requires finding all indexed terms beginning with:
alg
A trie is a natural companion structure.
alg
├── algebra
├── algorithm
├── algorithms
└── algae
The trie returns matching terms. Their postings lists are then merged.
A lexicographically sorted term dictionary can also answer prefix queries using binary search over a contiguous range.
Substring Queries
Inverted indexes work well for token search, but not arbitrary substring search.
Query:
"gram"
inside:
programming
requires a different structure.
Options include:
| Query Type | Structure |
|---|---|
| Token lookup | Inverted index |
| Prefix lookup | Trie or sorted dictionary |
| Substring lookup | Suffix array, suffix automaton, n-gram index |
| Approximate lookup | Edit distance plus filtering |
A practical search system often combines several indexes.
N-Gram Indexes
An n-gram index stores fixed-length character fragments.
For:
algorithm
with trigrams:
alg
lgo
gor
ori
rit
ith
thm
The index maps each trigram to terms or documents containing it.
A query for:
gori
is decomposed into:
gor
ori
Candidate documents are found by intersecting postings, then verified.
N-gram indexes are useful for substring and approximate search.
Compression
Large indexes are dominated by postings lists.
A postings list such as:
[100, 105, 109, 130]
is often stored as gaps:
[100, 5, 4, 21]
Small gaps compress better than large absolute IDs.
Common compression techniques include:
- Variable-byte encoding
- Elias gamma coding
- Frame-of-reference encoding
- Bit packing
Compression reduces storage and often improves cache behavior.
Ranking
An index can return matching documents, but real search systems must rank them.
Common ranking signals include:
- Term frequency
- Inverse document frequency
- Field weights
- Phrase proximity
- Document freshness
- Link or authority scores
- User behavior signals
The algorithmic foundation remains the same: the index narrows the candidate set, then ranking orders the results.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| N | Total number of tokens |
| V | Vocabulary size |
| p | Total postings inspected |
| Operation | Complexity |
|---|---|
| Build inverted index | O(N) average |
| Single-term lookup | O(1) average map lookup |
| AND query | O(p) |
| Phrase query | O(p + position checks) |
| Prefix query | O(length(prefix) + matches) with trie |
| Storage | O(N) postings |
The exact performance depends on postings compression, memory layout, and query distribution.
Correctness
An inverted index is correct when each term maps to exactly the documents and positions where it occurs after normalization and tokenization.
Single-term lookup follows directly from this invariant.
Boolean queries are correct because union, intersection, and difference on postings lists correspond exactly to OR, AND, and NOT over document sets.
Phrase queries are correct because positional adjacency checks enforce contiguity in the original token stream.
Common Pitfalls
Do not index documents with one tokenizer and query with another.
Do not discard positions if phrase queries are required.
Do not use an inverted index for arbitrary substring search unless you also build an n-gram or suffix-based structure.
Do not ignore normalization. Case folding, Unicode normalization, stemming, and stop-word handling all change query semantics.
Do not rank before filtering. Use the index to find candidates first.
Takeaway
Text indexing moves work from query time to build time. An inverted index makes token search fast by mapping each term to the places where it appears. Positional postings support phrase queries. Tries and sorted dictionaries support prefix queries. N-grams and suffix structures support substring queries. The central design task is choosing the index that matches the query language you need to support.