16.21 Text Indexing

A single pattern search scans one text once.

16.21 Text Indexing

Problem

A single pattern search scans one text once.

A text index solves a different problem: one or more large texts are stored, and many queries arrive later.

Examples:

Find all documents containing "algorithm"
Find every occurrence of "banana"
Find phrases beginning with "dynamic programming"
Find logs containing both "timeout" and "database"

Scanning every document for every query is too slow.

The challenge is to preprocess text so queries can be answered quickly.

Solution

Build an index.

A text index stores derived information about the input text. Instead of searching raw text directly, queries consult the index.

The simplest useful index is an inverted index:

term → list of documents or positions

For example:

algorithm → [(doc1, 14), (doc3, 88), (doc9, 5)]
search    → [(doc2, 7), (doc3, 19)]
string    → [(doc1, 4), (doc5, 31)]

This representation turns term lookup into a map query.

Inverted Index

An inverted index maps each token to the places where it appears.

Given documents:

doc1: string algorithms are useful
doc2: graph algorithms are useful
doc3: string search uses indexes

The index contains:

Term Postings
algorithms doc1, doc2
graph doc2
indexes doc3
search doc3
string doc1, doc3
useful doc1, doc2

A query for:

string

returns:

doc1, doc3

without scanning every document.

Positional Indexes

For phrase queries, document-level postings are not enough.

Query:

string algorithms

requires the words to appear next to each other.

Store positions:

Term Postings
string doc1:[0], doc3:[0]
algorithms doc1:[1], doc2:[1]

Now the phrase query checks whether:

position(string) + 1 = position(algorithms)

In doc1, this holds:

string algorithms
0      1

So doc1 matches.

Building an Index

A simple pipeline:

Document
  ↓
Normalize
  ↓
Tokenize
  ↓
Assign document ID
  ↓
Record term positions
  ↓
Store postings lists

The normalization and tokenization policy must match query processing. If documents are lowercased during indexing, queries must also be lowercased.

Lean Sketch

A minimal posting can be represented as:

structure Posting where
  docId : Nat
  positions : List Nat
deriving Repr

An inverted index:

abbrev InvertedIndex :=
  Std.HashMap String (List Posting)

Indexing a token stream:

def addDocument
  (index : InvertedIndex)
  (docId : Nat)
  (tokens : Array String)
  : InvertedIndex :=
by
  -- For each token position:
  --   insert docId and position into postings.
  sorry

The core invariant is straightforward:

For every term, its postings list records exactly the documents and positions where that term occurs.

Query Processing

A single-term query is direct:

lookup(term)

A multi-term query combines postings lists.

For:

string AND algorithms

compute the intersection:

postings(string) ∩ postings(algorithms)

For:

string OR algorithms

compute the union.

For:

string NOT algorithms

subtract the second postings list from the first.

Most search engines spend significant effort making these postings operations fast.

Intersecting Postings Lists

Postings lists are usually stored sorted by document ID.

Example:

string     → [1, 3, 5, 10]
algorithm  → [2, 3, 10, 11]

Intersection can be computed with two pointers:

1 < 2  advance first
3 > 2  advance second
3 = 3  emit 3
5 < 10 advance first
10 = 10 emit 10

Result:

[3, 10]

This is linear in the combined postings lengths.

Phrase Queries

For a phrase:

dynamic programming

first intersect documents containing both terms.

Then check positions.

dynamic     → doc7:[4, 20]
programming → doc7:[5, 21]

Both positions satisfy adjacency:

4 + 1 = 5
20 + 1 = 21

So doc7 contains the phrase twice.

For longer phrases, the same principle extends across multiple term-position lists.

Prefix Queries

A query such as:

alg*

requires finding all indexed terms beginning with:

alg

A trie is a natural companion structure.

alg
├── algebra
├── algorithm
├── algorithms
└── algae

The trie returns matching terms. Their postings lists are then merged.

A lexicographically sorted term dictionary can also answer prefix queries using binary search over a contiguous range.

Substring Queries

Inverted indexes work well for token search, but not arbitrary substring search.

Query:

"gram"

inside:

programming

requires a different structure.

Options include:

Query Type Structure
Token lookup Inverted index
Prefix lookup Trie or sorted dictionary
Substring lookup Suffix array, suffix automaton, n-gram index
Approximate lookup Edit distance plus filtering

A practical search system often combines several indexes.

N-Gram Indexes

An n-gram index stores fixed-length character fragments.

For:

algorithm

with trigrams:

alg
lgo
gor
ori
rit
ith
thm

The index maps each trigram to terms or documents containing it.

A query for:

gori

is decomposed into:

gor
ori

Candidate documents are found by intersecting postings, then verified.

N-gram indexes are useful for substring and approximate search.

Compression

Large indexes are dominated by postings lists.

A postings list such as:

[100, 105, 109, 130]

is often stored as gaps:

[100, 5, 4, 21]

Small gaps compress better than large absolute IDs.

Common compression techniques include:

  • Variable-byte encoding
  • Elias gamma coding
  • Frame-of-reference encoding
  • Bit packing

Compression reduces storage and often improves cache behavior.

Ranking

An index can return matching documents, but real search systems must rank them.

Common ranking signals include:

  • Term frequency
  • Inverse document frequency
  • Field weights
  • Phrase proximity
  • Document freshness
  • Link or authority scores
  • User behavior signals

The algorithmic foundation remains the same: the index narrows the candidate set, then ranking orders the results.

Complexity Analysis

Let:

Symbol Meaning
N Total number of tokens
V Vocabulary size
p Total postings inspected
Operation Complexity
Build inverted index O(N) average
Single-term lookup O(1) average map lookup
AND query O(p)
Phrase query O(p + position checks)
Prefix query O(length(prefix) + matches) with trie
Storage O(N) postings

The exact performance depends on postings compression, memory layout, and query distribution.

Correctness

An inverted index is correct when each term maps to exactly the documents and positions where it occurs after normalization and tokenization.

Single-term lookup follows directly from this invariant.

Boolean queries are correct because union, intersection, and difference on postings lists correspond exactly to OR, AND, and NOT over document sets.

Phrase queries are correct because positional adjacency checks enforce contiguity in the original token stream.

Common Pitfalls

Do not index documents with one tokenizer and query with another.

Do not discard positions if phrase queries are required.

Do not use an inverted index for arbitrary substring search unless you also build an n-gram or suffix-based structure.

Do not ignore normalization. Case folding, Unicode normalization, stemming, and stop-word handling all change query semantics.

Do not rank before filtering. Use the index to find candidates first.

Takeaway

Text indexing moves work from query time to build time. An inverted index makes token search fast by mapping each term to the places where it appears. Positional postings support phrase queries. Tries and sorted dictionaries support prefix queries. N-grams and suffix structures support substring queries. The central design task is choosing the index that matches the query language you need to support.