20.2 Building a Search Ranking Pipeline

Problem

Design a search ranking pipeline that receives a user query, retrieves candidate documents, scores them, and returns a ranked result list.

Given a query:

graph shortest path

the system should return documents that discuss graph algorithms, shortest path algorithms, Dijkstra's algorithm, Bellman-Ford, A* search, and related topics. A naive implementation might scan every document and count query words, but that becomes too slow once the corpus grows.

A practical search system separates the work into stages:

query
  -> parse
  -> retrieve candidates
  -> score candidates
  -> rank
  -> return results

The core algorithmic idea is simple: avoid scoring every document. First use an index to find a much smaller candidate set, then rank only those candidates carefully.

Solution

Use an inverted index.

Instead of storing documents as the primary lookup structure, store a mapping from each term to the documents that contain it.

Example corpus:

D1: graph shortest path algorithms
D2: dynamic programming cookbook
D3: dijkstra shortest path graph
D4: sorting and searching algorithms

The inverted index is:

Term	Posting list
algorithms	D1, D4
cookbook	D2
dijkstra	D3
dynamic	D2
graph	D1, D3
path	D1, D3
programming	D2
searching	D4
shortest	D1, D3
sorting	D4

For the query:

graph shortest path

the search engine reads the posting lists for graph, shortest, and path, merges them, and produces candidate documents:

D1, D3

Only those documents need detailed scoring.

Building the Index

A minimal indexing pipeline tokenizes documents, normalizes terms, and records where each term occurs.

from collections import defaultdict

def tokenize(text):
    return [
        token.lower()
        for token in text.split()
        if token.strip()
    ]

def build_inverted_index(documents):
    index = defaultdict(list)

    for document_id, text in documents.items():
        seen = set()

        for token in tokenize(text):
            if token in seen:
                continue

            index[token].append(document_id)
            seen.add(token)

    return dict(index)

Use it like this:

documents = {
    "D1": "graph shortest path algorithms",
    "D2": "dynamic programming cookbook",
    "D3": "dijkstra shortest path graph",
    "D4": "sorting and searching algorithms",
}

index = build_inverted_index(documents)

print(index["graph"])
print(index["shortest"])

Output:

['D1', 'D3']
['D1', 'D3']

This simple index stores only document membership. A production index also stores term frequency, field information, positions, payloads, and compression metadata.

Retrieving Candidates

For an AND query, intersect posting lists.

def intersect_sorted(left, right):
    result = []
    i = 0
    j = 0

    while i < len(left) and j < len(right):
        if left[i] == right[j]:
            result.append(left[i])
            i += 1
            j += 1
        elif left[i] < right[j]:
            i += 1
        else:
            j += 1

    return result

For query terms:

def retrieve_and(index, query):
    terms = tokenize(query)

    if not terms:
        return []

    posting_lists = [
        index.get(term, [])
        for term in terms
    ]

    posting_lists.sort(key=len)

    candidates = posting_lists[0]

    for posting_list in posting_lists[1:]:
        candidates = intersect_sorted(candidates, posting_list)

    return candidates

Sorting posting lists by length matters. Intersecting the rarest term first usually reduces the candidate set quickly.

For an OR query, merge posting lists instead.

def retrieve_or(index, query):
    candidates = set()

    for term in tokenize(query):
        candidates.update(index.get(term, []))

    return sorted(candidates)

AND queries are precise. OR queries have higher recall. Many systems use a hybrid strategy: retrieve broadly, then rely on scoring to rank the best results first.

Adding Term Frequency

A term that appears several times in a document may indicate higher relevance.

Instead of storing:

graph -> [D1, D3]

store:

graph -> [(D1, 1), (D3, 1)]

For a document:

D5: graph graph graph shortest path

store:

graph -> [(D5, 3)]

Implementation:

from collections import Counter, defaultdict

def build_frequency_index(documents):
    index = defaultdict(list)

    for document_id, text in documents.items():
        counts = Counter(tokenize(text))

        for term, frequency in counts.items():
            index[term].append((document_id, frequency))

    return dict(index)

This prepares the system for relevance scoring.

Scoring Candidates

A simple scoring function adds term frequencies across query terms.

def score_candidates(index, query):
    scores = defaultdict(int)

    for term in tokenize(query):
        for document_id, frequency in index.get(term, []):
            scores[document_id] += frequency

    return dict(scores)

Example:

documents = {
    "D1": "graph shortest path algorithms",
    "D2": "dynamic programming cookbook",
    "D3": "dijkstra shortest path graph",
    "D5": "graph graph graph shortest path",
}

index = build_frequency_index(documents)
scores = score_candidates(index, "graph shortest path")

print(scores)

Output:

{'D1': 3, 'D3': 3, 'D5': 5}

D5 receives the highest score because graph appears three times. This can be useful, but raw term frequency is easy to abuse. Long documents and repetitive documents can dominate rankings.

Normalizing by Document Length

A long document naturally contains more terms. Without normalization, it may rank higher simply because it has more words.

Store document length:

def compute_lengths(documents):
    return {
        document_id: len(tokenize(text))
        for document_id, text in documents.items()
    }

Normalize scores:

def score_with_length_normalization(index, lengths, query):
    scores = defaultdict(float)

    for term in tokenize(query):
        for document_id, frequency in index.get(term, []):
            scores[document_id] += frequency / lengths[document_id]

    return dict(scores)

This rewards concentration rather than raw repetition.

Using Inverse Document Frequency

Some terms are more informative than others. In a cookbook corpus, words like algorithm, data, and method may appear everywhere. A rarer term like dijkstra carries more signal.

Use inverse document frequency:

import math

def compute_idf(index, document_count):
    idf = {}

    for term, postings in index.items():
        document_frequency = len(postings)
        idf[term] = math.log((document_count + 1) / (document_frequency + 1)) + 1

    return idf

Then combine term frequency and inverse document frequency.

def score_tfidf(index, lengths, idf, query):
    scores = defaultdict(float)

    for term in tokenize(query):
        for document_id, frequency in index.get(term, []):
            normalized_tf = frequency / lengths[document_id]
            scores[document_id] += normalized_tf * idf[term]

    return dict(scores)

This gives high weight to terms that are frequent in a document but rare across the corpus.

Ranking Results

Ranking sorts candidates by score.

def rank(scores, limit=10):
    return sorted(
        scores.items(),
        key=lambda item: item[1],
        reverse=True
    )[:limit]

End-to-end:

documents = {
    "D1": "graph shortest path algorithms",
    "D2": "dynamic programming cookbook",
    "D3": "dijkstra shortest path graph",
    "D4": "sorting and searching algorithms",
    "D5": "graph graph graph shortest path",
}

index = build_frequency_index(documents)
lengths = compute_lengths(documents)
idf = compute_idf(index, len(documents))

scores = score_tfidf(index, lengths, idf, "dijkstra graph shortest path")
results = rank(scores)

print(results)

The document containing dijkstra should receive a strong boost because dijkstra is rare.

Adding Phrase Matching

Users often care about phrase order.

Query:

shortest path

should prefer documents containing the exact phrase over documents where shortest and path occur far apart.

To support this, store positions.

def build_positional_index(documents):
    index = defaultdict(lambda: defaultdict(list))

    for document_id, text in documents.items():
        for position, token in enumerate(tokenize(text)):
            index[token][document_id].append(position)

    return {
        term: dict(postings)
        for term, postings in index.items()
    }

Example:

D1: graph shortest path algorithms

Positions:

Term	Positions
graph	0
shortest	1
path	2
algorithms	3

Phrase detection checks whether positions are adjacent.

def contains_phrase(positional_index, document_id, phrase):
    terms = tokenize(phrase)

    if not terms:
        return False

    position_sets = [
        set(positional_index.get(term, {}).get(document_id, []))
        for term in terms
    ]

    if any(not positions for positions in position_sets):
        return False

    for start in position_sets[0]:
        if all((start + offset) in position_sets[offset]
               for offset in range(1, len(terms))):
            return True

    return False

Use phrase matches as a ranking boost:

def apply_phrase_boost(scores, positional_index, phrase, boost):
    boosted = dict(scores)

    for document_id in scores:
        if contains_phrase(positional_index, document_id, phrase):
            boosted[document_id] += boost

    return boosted

Field-Aware Ranking

Documents usually have fields:

title
description
body
tags

A match in the title should count more than a match buried in the body.

Example weights:

Field	Weight
title	5.0
tags	3.0
description	2.0
body	1.0

Represent each document as structured data:

documents = {
    "D1": {
        "title": "Shortest Path Algorithms",
        "body": "Dijkstra and Bellman-Ford solve graph path problems.",
        "tags": "graph algorithms"
    }
}

Index each field separately or store field metadata in postings.

A simple scorer:

FIELD_WEIGHTS = {
    "title": 5.0,
    "tags": 3.0,
    "body": 1.0,
}

def score_structured_documents(documents, query):
    terms = tokenize(query)
    scores = defaultdict(float)

    for document_id, fields in documents.items():
        for field_name, field_text in fields.items():
            weight = FIELD_WEIGHTS.get(field_name, 1.0)
            counts = Counter(tokenize(field_text))

            for term in terms:
                scores[document_id] += weight * counts.get(term, 0)

    return dict(scores)

This version scans all documents, so it is useful for understanding field weights but not suitable for large corpora. A production implementation pushes field weights into the inverted index.

Query Processing

A query should be normalized before retrieval.

Common query processing steps:

Step	Example
Lowercase	`Graph` -> `graph`
Punctuation cleanup	`path?` -> `path`
Stopword removal	`the graph path` -> `graph path`
Stemming	`searching` -> `search`
Synonym expansion	`shortest route` -> `shortest path`
Spell correction	`dijsktra` -> `dijkstra`

Be careful with aggressive normalization. Removing too much information can harm precision.

For example, stopword removal may break phrase queries:

to be or not to be

In general search systems, stopwords may be useful. In code search, symbols and punctuation may be essential.

Candidate Generation and Re-Ranking

A common production pipeline uses two ranking stages.

First stage: fast candidate generation.

query -> inverted index -> top 1000 candidates

Second stage: expensive re-ranking.

top 1000 -> feature extraction -> ranker -> top 10

The first stage optimizes recall and speed. The second stage optimizes precision.

Re-ranking features may include:

Feature	Meaning
BM25 score	Lexical relevance
Exact phrase match	Query phrase appears directly
Title match	Query terms appear in title
Freshness	Recently updated document
Popularity	Clicks or views
Authority	Links or citations
Personalization	User-specific preference
Embedding similarity	Semantic similarity

This is where classic algorithms meet machine learning and product logic.

Complexity Summary

Let:

N be the number of documents.
V be the vocabulary size.
q be the number of query terms.
P be the total size of relevant posting lists.
K be the number of candidates.

Operation	Complexity
Build inverted index	O(total tokens)
Lookup query terms	O(q) average
Merge posting lists	O(P)
Score candidates	O(Kq) or O(P)
Sort candidates	O(K log K)
Top-k selection	O(K log k)

For large systems, avoid sorting all candidates when only the top results are needed. Use a heap or specialized top-k retrieval algorithm.

Common Bugs

A search pipeline often fails in subtle ways.

Duplicate document IDs in posting lists cause inflated scores.

Unsorted posting lists break intersection algorithms.

Tokenization mismatch between indexing and querying causes missing results.

Raw term frequency over-ranks long or spammy documents.

Over-aggressive stemming merges unrelated words.

Phrase matching fails when punctuation or stopwords are removed inconsistently.

Ranking changes without test queries make relevance regressions hard to detect.

Recipe

Build a search ranking pipeline in layers.

Start with an inverted index. Add term frequencies. Normalize by document length. Add inverse document frequency. Store positions for phrase matching. Add field weights for structured documents. Split retrieval and ranking into separate stages. Cache frequent queries. Measure ranking quality with fixed test queries.

The essential pattern is candidate reduction followed by increasingly precise scoring. Search becomes scalable when cheap algorithms quickly remove irrelevant documents and expensive ranking logic is reserved for a small candidate set.