Information Retrieval

Information retrieval is the task of finding relevant items from a collection in response to a query. The collection may contain web pages, documents, passages, emails, tickets, code files, papers, products, images, or database records. The query may be a few keywords, a natural language question, a document, or an embedding.

In natural language processing systems, information retrieval is used for search, question answering, recommendation, document discovery, retrieval-augmented generation, duplicate detection, and semantic navigation.

A retrieval system learns or defines a scoring function

s(q,d)

where $q$ is the query and $d$ is a candidate document. The system ranks documents by score and returns the top results.

Retrieval as Ranking

Unlike classification, retrieval usually chooses from a large collection. The model does not merely assign one of a fixed set of labels. It must rank many candidates.

Given a query $q$ and a document collection

\mathcal{D} = \{d_1,d_2,\ldots,d_N\},

retrieval returns an ordered list

d_{(1)}, d_{(2)}, \ldots, d_{(k)}

where $d_{(1)}$ should be the most relevant document.

The retrieval objective is not just to find any relevant result. The best results should appear early in the ranking. Users rarely inspect every result, and downstream systems usually consume only the top few passages.

Sparse Retrieval

Sparse retrieval represents text using high-dimensional sparse vectors. Each dimension usually corresponds to a term or token in the vocabulary.

A simple representation is bag-of-words. A document is represented by the terms it contains, ignoring word order.

The classic sparse scoring method is TF-IDF. It combines term frequency and inverse document frequency.

Term frequency measures how often a term appears in a document. Inverse document frequency gives higher weight to rare terms.

\text{idf}(t) = \log \frac{N}{\text{df}(t)}

where $N$ is the number of documents and $\text{df}(t)$ is the number of documents containing term $t$ .

A common TF-IDF score is

s(q,d) = \sum_{t \in q \cap d} \text{tf}(t,d)\cdot \text{idf}(t).

Sparse retrieval is strong when exact terms matter. It works well for names, identifiers, product codes, error messages, legal citations, scientific terms, and rare phrases.

BM25

BM25 is a widely used sparse retrieval function. It improves simple TF-IDF by adding term-frequency saturation and document-length normalization.

A simplified BM25 score is

s(q,d) = \sum_{t \in q} \text{idf}(t) \cdot \frac{ \text{tf}(t,d)(k_1+1) }{ \text{tf}(t,d) + k_1(1-b+b|d|/\text{avgdl}) }.

Here $k_1$ controls term-frequency saturation, $b$ controls length normalization, $|d|$ is document length, and $\text{avgdl}$ is average document length.

BM25 remains a strong baseline. Many neural retrieval systems are compared against it. In production search systems, BM25 is often used as a first-stage retriever because it is fast, interpretable, and effective.

Dense Retrieval

Dense retrieval represents queries and documents as dense vectors in a learned embedding space.

A query encoder maps the query to a vector:

h_q = f_\theta(q)

and a document encoder maps the document to a vector:

h_d = g_\theta(d).

The score is often a dot product:

s(q,d) = h_q^\top h_d.

If the vectors are normalized, the dot product is equivalent to cosine similarity.

Dense retrieval can find semantically related text even when the exact words differ. For example, a dense retriever may match:

query: how to reset my password
document: account recovery instructions

Sparse retrieval may miss this match if there is little lexical overlap. Dense retrieval is therefore useful for semantic search, natural language questions, paraphrases, and cross-lingual retrieval.

Bi-Encoders

Most dense retrieval systems use a bi-encoder. The query and document are encoded separately.

h_q = f_\theta(q), \quad h_d = g_\theta(d).

This design is efficient because document embeddings can be precomputed and stored in a vector index.

At query time, the system computes only the query embedding and searches for nearest document embeddings.

query_vec = query_encoder(query)      # [D]
doc_vecs = index.search(query_vec)    # top-k nearest vectors

The limitation is that query-document interaction is compressed into one vector per side. This can miss fine-grained relevance signals.

Cross-Encoders

A cross-encoder reads the query and document together.

[CLS] query [SEP] document [SEP]

The model outputs a relevance score:

s(q,d) = f_\theta(q,d).

Because the query and document attend to each other token by token, cross-encoders are often more accurate than bi-encoders. However, they are much more expensive. A cross-encoder must run once for every query-document pair.

For this reason, cross-encoders are commonly used as rerankers. A fast retriever first selects a few hundred candidates. The cross-encoder reranks those candidates.

Two-Stage Retrieval

A practical neural search system often uses multiple stages.

The first stage emphasizes recall. It retrieves candidates quickly from a large collection.

The second stage emphasizes precision. It reranks a smaller candidate set using a more expensive model.

A common pipeline is:

Use BM25 or dense retrieval to fetch top 100 to 1000 candidates.
Use a cross-encoder reranker to score the candidates.
Return the top 10 to 50 results.
Optionally pass top passages to a QA or generation model.

This architecture balances speed and quality. The first stage makes search scalable. The second stage improves ranking quality.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense methods.

Sparse retrieval is good at exact matching. Dense retrieval is good at semantic matching. Combining them often improves robustness.

A simple hybrid score is

s(q,d) = \alpha s_{\text{sparse}}(q,d) + (1-\alpha)s_{\text{dense}}(q,d),

where $\alpha$ controls the balance.

Another common method is rank fusion. Reciprocal rank fusion combines ranked lists without requiring comparable score scales:

\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}.

Here $R$ is the set of ranking systems and $k$ is a smoothing constant.

Hybrid retrieval is often the safest default for technical domains. Exact terms matter, but users also ask semantic questions.

Contrastive Training

Dense retrievers are commonly trained with contrastive objectives. The model receives positive query-document pairs and negative examples.

For a query $q_i$ , a positive document $d_i^+$ , and negative documents $d_{i1}^-, \ldots, d_{in}^-$ , the loss encourages the positive document to score higher than the negatives.

A common softmax contrastive loss is

\mathcal{L}_i = -\log \frac{ \exp(s(q_i,d_i^+)/\tau) }{ \exp(s(q_i,d_i^+)/\tau) + \sum_j \exp(s(q_i,d_{ij}^-)/\tau) }.

The temperature $\tau$ controls the sharpness of the distribution.

In-batch negatives are widely used. Other examples in the same batch serve as negatives, which makes training efficient.

import torch
import torch.nn.functional as F

def contrastive_retrieval_loss(query_emb, doc_emb, temperature=0.05):
    # query_emb: [B, D]
    # doc_emb: [B, D]
    query_emb = F.normalize(query_emb, dim=-1)
    doc_emb = F.normalize(doc_emb, dim=-1)

    scores = query_emb @ doc_emb.T          # [B, B]
    scores = scores / temperature

    labels = torch.arange(scores.size(0), device=scores.device)

    loss = F.cross_entropy(scores, labels)
    return loss

This assumes that query $i$ matches document $i$ , and all other documents in the batch are negatives.

Negative Sampling

The quality of negative examples strongly affects retrieval training.

Negative type	Description
Random negatives	Documents sampled randomly from the corpus
In-batch negatives	Other documents in the same training batch
Hard negatives	Irrelevant documents that look relevant
Mined negatives	Negatives found by an earlier retriever
Cross-encoder-filtered negatives	Candidate negatives checked by a stronger model

Random negatives are often too easy. The model learns little from them because they are obviously unrelated. Hard negatives are more useful because they force the model to learn fine distinctions.

Example:

Query: symptoms of vitamin B12 deficiency
Positive: article about B12 deficiency symptoms
Hard negative: article about vitamin D deficiency symptoms

The hard negative shares many terms and topic structure, but it does not answer the query.

Passage Retrieval

Long documents are often split into passages before indexing. A passage may be a paragraph, a fixed-size token window, or a semantic section.

Passage retrieval has several advantages:

Advantage	Reason
Better relevance	A short passage can match a query more precisely
Better generation grounding	RAG models receive focused context
Better vector search	Embeddings represent one local topic
Lower context cost	Fewer irrelevant tokens are passed downstream

The tradeoff is that the system must map passages back to documents. It must also avoid returning many near-duplicate passages from the same document.

A common structure is:

{
  "doc_id": "doc_123",
  "passage_id": "doc_123_004",
  "text": "The relevant paragraph...",
  "title": "Document title",
  "offset_start": 1830,
  "offset_end": 2410
}

Offsets allow the system to display snippets, citations, and highlights.

Vector Indexes

Dense retrieval requires nearest neighbor search over many vectors. Exact search compares the query vector with every document vector. This is expensive for large collections.

Approximate nearest neighbor, or ANN, indexes accelerate vector search.

Common ANN ideas include clustering, graph search, product quantization, and hierarchical navigable small-world graphs.

At a high level, an index stores document embeddings:

H = \{h_{d_1}, h_{d_2}, \ldots, h_{d_N}\}.

At query time, it returns approximate top matches:

\operatorname{ANN}(h_q, H) \rightarrow \{d_{(1)},\ldots,d_{(k)}\}.

ANN search trades some accuracy for speed and memory efficiency. For many systems, this tradeoff is necessary.

Evaluation Metrics

Retrieval systems are evaluated by ranked-list metrics.

Metric	Meaning
Recall@k	Fraction of queries with a relevant item in the top $k$
Precision@k	Fraction of top $k$ results that are relevant
MRR	Mean reciprocal rank of the first relevant result
MAP	Mean average precision across queries
nDCG	Ranking quality with graded relevance
Hit rate	Whether any relevant item appears in top $k$

Recall@k is important when retrieval feeds a downstream reader or generator. If the correct evidence is absent from top $k$ , the downstream model cannot answer correctly.

MRR is useful when the first correct result matters. nDCG is useful when relevance has grades, such as perfect, partial, and irrelevant.

Retrieval for RAG

Retrieval-augmented generation, or RAG, uses retrieved passages as context for a language model.

The retrieval system affects the final answer. Poor retrieval causes unsupported or incomplete generation. Good retrieval reduces hallucination by giving the model relevant evidence.

A typical RAG prompt contains:

Question:
...

Relevant passages:
[1] ...
[2] ...
[3] ...

Answer using only the passages above.

The retriever should optimize for answer support, not just keyword relevance. A passage can share words with the question but fail to contain the answer.

For RAG, useful retrieval properties include:

Property	Why it matters
High recall	Evidence must reach the generator
Low redundancy	Context budget should not be wasted
Source metadata	Citations require document IDs and offsets
Freshness	Answers may depend on current data
Permission filtering	Results must respect access control
Latency	Retrieval is on the user path

Failure Modes

Information retrieval systems fail in several common ways.

Failure mode	Description
Vocabulary mismatch	Relevant document uses different words
Semantic drift	Dense retriever returns topically similar but wrong documents
Exact-match failure	Dense retriever misses rare identifiers
Long-document dilution	Document embedding averages over too many topics
Stale index	New or updated documents absent from search
Duplicate flooding	Results dominated by near-duplicate passages
Poor chunking	Relevant evidence split across chunks
Metadata loss	Results lack source, date, or offsets
Permission leak	User sees documents they should not access

Hybrid search, good chunking, reranking, metadata filters, and evaluation sets reduce these failures.

Summary

Information retrieval ranks documents or passages for a query. Sparse methods such as TF-IDF and BM25 rely on lexical overlap. Dense methods use learned embeddings to retrieve semantically related text. Cross-encoders improve precision by scoring query-document pairs jointly, but they are too expensive for first-stage search over large collections.

Practical systems often use hybrid retrieval, passage indexing, ANN vector search, and reranking. For question answering and RAG, retrieval quality is a core model quality factor. If the evidence is not retrieved, the downstream model cannot reliably produce the right answer.