Skip to content

Information Retrieval

Information retrieval is the task of finding relevant items from a collection in response to a query.

Information retrieval is the task of finding relevant items from a collection in response to a query. The collection may contain web pages, documents, passages, emails, tickets, code files, papers, products, images, or database records. The query may be a few keywords, a natural language question, a document, or an embedding.

In natural language processing systems, information retrieval is used for search, question answering, recommendation, document discovery, retrieval-augmented generation, duplicate detection, and semantic navigation.

A retrieval system learns or defines a scoring function

s(q,d) s(q,d)

where qq is the query and dd is a candidate document. The system ranks documents by score and returns the top results.

Retrieval as Ranking

Unlike classification, retrieval usually chooses from a large collection. The model does not merely assign one of a fixed set of labels. It must rank many candidates.

Given a query qq and a document collection

D={d1,d2,,dN}, \mathcal{D} = \{d_1,d_2,\ldots,d_N\},

retrieval returns an ordered list

d(1),d(2),,d(k) d_{(1)}, d_{(2)}, \ldots, d_{(k)}

where d(1)d_{(1)} should be the most relevant document.

The retrieval objective is not just to find any relevant result. The best results should appear early in the ranking. Users rarely inspect every result, and downstream systems usually consume only the top few passages.

Sparse Retrieval

Sparse retrieval represents text using high-dimensional sparse vectors. Each dimension usually corresponds to a term or token in the vocabulary.

A simple representation is bag-of-words. A document is represented by the terms it contains, ignoring word order.

The classic sparse scoring method is TF-IDF. It combines term frequency and inverse document frequency.

Term frequency measures how often a term appears in a document. Inverse document frequency gives higher weight to rare terms.

idf(t)=logNdf(t) \text{idf}(t) = \log \frac{N}{\text{df}(t)}

where NN is the number of documents and df(t)\text{df}(t) is the number of documents containing term tt.

A common TF-IDF score is

s(q,d)=tqdtf(t,d)idf(t). s(q,d) = \sum_{t \in q \cap d} \text{tf}(t,d)\cdot \text{idf}(t).

Sparse retrieval is strong when exact terms matter. It works well for names, identifiers, product codes, error messages, legal citations, scientific terms, and rare phrases.

BM25

BM25 is a widely used sparse retrieval function. It improves simple TF-IDF by adding term-frequency saturation and document-length normalization.

A simplified BM25 score is

s(q,d)=tqidf(t)tf(t,d)(k1+1)tf(t,d)+k1(1b+bd/avgdl). s(q,d) = \sum_{t \in q} \text{idf}(t) \cdot \frac{ \text{tf}(t,d)(k_1+1) }{ \text{tf}(t,d) + k_1(1-b+b|d|/\text{avgdl}) }.

Here k1k_1 controls term-frequency saturation, bb controls length normalization, d|d| is document length, and avgdl\text{avgdl} is average document length.

BM25 remains a strong baseline. Many neural retrieval systems are compared against it. In production search systems, BM25 is often used as a first-stage retriever because it is fast, interpretable, and effective.

Dense Retrieval

Dense retrieval represents queries and documents as dense vectors in a learned embedding space.

A query encoder maps the query to a vector:

hq=fθ(q) h_q = f_\theta(q)

and a document encoder maps the document to a vector:

hd=gθ(d). h_d = g_\theta(d).

The score is often a dot product:

s(q,d)=hqhd. s(q,d) = h_q^\top h_d.

If the vectors are normalized, the dot product is equivalent to cosine similarity.

Dense retrieval can find semantically related text even when the exact words differ. For example, a dense retriever may match:

query: how to reset my password
document: account recovery instructions

Sparse retrieval may miss this match if there is little lexical overlap. Dense retrieval is therefore useful for semantic search, natural language questions, paraphrases, and cross-lingual retrieval.

Bi-Encoders

Most dense retrieval systems use a bi-encoder. The query and document are encoded separately.

hq=fθ(q),hd=gθ(d). h_q = f_\theta(q), \quad h_d = g_\theta(d).

This design is efficient because document embeddings can be precomputed and stored in a vector index.

At query time, the system computes only the query embedding and searches for nearest document embeddings.

query_vec = query_encoder(query)      # [D]
doc_vecs = index.search(query_vec)    # top-k nearest vectors

The limitation is that query-document interaction is compressed into one vector per side. This can miss fine-grained relevance signals.

Cross-Encoders

A cross-encoder reads the query and document together.

[CLS] query [SEP] document [SEP]

The model outputs a relevance score:

s(q,d)=fθ(q,d). s(q,d) = f_\theta(q,d).

Because the query and document attend to each other token by token, cross-encoders are often more accurate than bi-encoders. However, they are much more expensive. A cross-encoder must run once for every query-document pair.

For this reason, cross-encoders are commonly used as rerankers. A fast retriever first selects a few hundred candidates. The cross-encoder reranks those candidates.

Two-Stage Retrieval

A practical neural search system often uses multiple stages.

The first stage emphasizes recall. It retrieves candidates quickly from a large collection.

The second stage emphasizes precision. It reranks a smaller candidate set using a more expensive model.

A common pipeline is:

  1. Use BM25 or dense retrieval to fetch top 100 to 1000 candidates.
  2. Use a cross-encoder reranker to score the candidates.
  3. Return the top 10 to 50 results.
  4. Optionally pass top passages to a QA or generation model.

This architecture balances speed and quality. The first stage makes search scalable. The second stage improves ranking quality.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense methods.

Sparse retrieval is good at exact matching. Dense retrieval is good at semantic matching. Combining them often improves robustness.

A simple hybrid score is

s(q,d)=αssparse(q,d)+(1α)sdense(q,d), s(q,d) = \alpha s_{\text{sparse}}(q,d) + (1-\alpha)s_{\text{dense}}(q,d),

where α\alpha controls the balance.

Another common method is rank fusion. Reciprocal rank fusion combines ranked lists without requiring comparable score scales:

RRF(d)=rR1k+rankr(d). \text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}.

Here RR is the set of ranking systems and kk is a smoothing constant.

Hybrid retrieval is often the safest default for technical domains. Exact terms matter, but users also ask semantic questions.

Contrastive Training

Dense retrievers are commonly trained with contrastive objectives. The model receives positive query-document pairs and negative examples.

For a query qiq_i, a positive document di+d_i^+, and negative documents di1,,dind_{i1}^-, \ldots, d_{in}^-, the loss encourages the positive document to score higher than the negatives.

A common softmax contrastive loss is

Li=logexp(s(qi,di+)/τ)exp(s(qi,di+)/τ)+jexp(s(qi,dij)/τ). \mathcal{L}_i = -\log \frac{ \exp(s(q_i,d_i^+)/\tau) }{ \exp(s(q_i,d_i^+)/\tau) + \sum_j \exp(s(q_i,d_{ij}^-)/\tau) }.

The temperature τ\tau controls the sharpness of the distribution.

In-batch negatives are widely used. Other examples in the same batch serve as negatives, which makes training efficient.

import torch
import torch.nn.functional as F

def contrastive_retrieval_loss(query_emb, doc_emb, temperature=0.05):
    # query_emb: [B, D]
    # doc_emb: [B, D]
    query_emb = F.normalize(query_emb, dim=-1)
    doc_emb = F.normalize(doc_emb, dim=-1)

    scores = query_emb @ doc_emb.T          # [B, B]
    scores = scores / temperature

    labels = torch.arange(scores.size(0), device=scores.device)

    loss = F.cross_entropy(scores, labels)
    return loss

This assumes that query ii matches document ii, and all other documents in the batch are negatives.

Negative Sampling

The quality of negative examples strongly affects retrieval training.

Negative typeDescription
Random negativesDocuments sampled randomly from the corpus
In-batch negativesOther documents in the same training batch
Hard negativesIrrelevant documents that look relevant
Mined negativesNegatives found by an earlier retriever
Cross-encoder-filtered negativesCandidate negatives checked by a stronger model

Random negatives are often too easy. The model learns little from them because they are obviously unrelated. Hard negatives are more useful because they force the model to learn fine distinctions.

Example:

Query: symptoms of vitamin B12 deficiency
Positive: article about B12 deficiency symptoms
Hard negative: article about vitamin D deficiency symptoms

The hard negative shares many terms and topic structure, but it does not answer the query.

Passage Retrieval

Long documents are often split into passages before indexing. A passage may be a paragraph, a fixed-size token window, or a semantic section.

Passage retrieval has several advantages:

AdvantageReason
Better relevanceA short passage can match a query more precisely
Better generation groundingRAG models receive focused context
Better vector searchEmbeddings represent one local topic
Lower context costFewer irrelevant tokens are passed downstream

The tradeoff is that the system must map passages back to documents. It must also avoid returning many near-duplicate passages from the same document.

A common structure is:

{
  "doc_id": "doc_123",
  "passage_id": "doc_123_004",
  "text": "The relevant paragraph...",
  "title": "Document title",
  "offset_start": 1830,
  "offset_end": 2410
}

Offsets allow the system to display snippets, citations, and highlights.

Vector Indexes

Dense retrieval requires nearest neighbor search over many vectors. Exact search compares the query vector with every document vector. This is expensive for large collections.

Approximate nearest neighbor, or ANN, indexes accelerate vector search.

Common ANN ideas include clustering, graph search, product quantization, and hierarchical navigable small-world graphs.

At a high level, an index stores document embeddings:

H={hd1,hd2,,hdN}. H = \{h_{d_1}, h_{d_2}, \ldots, h_{d_N}\}.

At query time, it returns approximate top matches:

ANN(hq,H){d(1),,d(k)}. \operatorname{ANN}(h_q, H) \rightarrow \{d_{(1)},\ldots,d_{(k)}\}.

ANN search trades some accuracy for speed and memory efficiency. For many systems, this tradeoff is necessary.

Evaluation Metrics

Retrieval systems are evaluated by ranked-list metrics.

MetricMeaning
Recall@kFraction of queries with a relevant item in the top kk
Precision@kFraction of top kk results that are relevant
MRRMean reciprocal rank of the first relevant result
MAPMean average precision across queries
nDCGRanking quality with graded relevance
Hit rateWhether any relevant item appears in top kk

Recall@k is important when retrieval feeds a downstream reader or generator. If the correct evidence is absent from top kk, the downstream model cannot answer correctly.

MRR is useful when the first correct result matters. nDCG is useful when relevance has grades, such as perfect, partial, and irrelevant.

Retrieval for RAG

Retrieval-augmented generation, or RAG, uses retrieved passages as context for a language model.

The retrieval system affects the final answer. Poor retrieval causes unsupported or incomplete generation. Good retrieval reduces hallucination by giving the model relevant evidence.

A typical RAG prompt contains:

Question:
...

Relevant passages:
[1] ...
[2] ...
[3] ...

Answer using only the passages above.

The retriever should optimize for answer support, not just keyword relevance. A passage can share words with the question but fail to contain the answer.

For RAG, useful retrieval properties include:

PropertyWhy it matters
High recallEvidence must reach the generator
Low redundancyContext budget should not be wasted
Source metadataCitations require document IDs and offsets
FreshnessAnswers may depend on current data
Permission filteringResults must respect access control
LatencyRetrieval is on the user path

Failure Modes

Information retrieval systems fail in several common ways.

Failure modeDescription
Vocabulary mismatchRelevant document uses different words
Semantic driftDense retriever returns topically similar but wrong documents
Exact-match failureDense retriever misses rare identifiers
Long-document dilutionDocument embedding averages over too many topics
Stale indexNew or updated documents absent from search
Duplicate floodingResults dominated by near-duplicate passages
Poor chunkingRelevant evidence split across chunks
Metadata lossResults lack source, date, or offsets
Permission leakUser sees documents they should not access

Hybrid search, good chunking, reranking, metadata filters, and evaluation sets reduce these failures.

Summary

Information retrieval ranks documents or passages for a query. Sparse methods such as TF-IDF and BM25 rely on lexical overlap. Dense methods use learned embeddings to retrieve semantically related text. Cross-encoders improve precision by scoring query-document pairs jointly, but they are too expensive for first-stage search over large collections.

Practical systems often use hybrid retrieval, passage indexing, ANN vector search, and reranking. For question answering and RAG, retrieval quality is a core model quality factor. If the evidence is not retrieved, the downstream model cannot reliably produce the right answer.