Information retrieval is the task of finding relevant items from a collection in response to a query.
Information retrieval is the task of finding relevant items from a collection in response to a query. The collection may contain web pages, documents, passages, emails, tickets, code files, papers, products, images, or database records. The query may be a few keywords, a natural language question, a document, or an embedding.
In natural language processing systems, information retrieval is used for search, question answering, recommendation, document discovery, retrieval-augmented generation, duplicate detection, and semantic navigation.
A retrieval system learns or defines a scoring function
where is the query and is a candidate document. The system ranks documents by score and returns the top results.
Retrieval as Ranking
Unlike classification, retrieval usually chooses from a large collection. The model does not merely assign one of a fixed set of labels. It must rank many candidates.
Given a query and a document collection
retrieval returns an ordered list
where should be the most relevant document.
The retrieval objective is not just to find any relevant result. The best results should appear early in the ranking. Users rarely inspect every result, and downstream systems usually consume only the top few passages.
Sparse Retrieval
Sparse retrieval represents text using high-dimensional sparse vectors. Each dimension usually corresponds to a term or token in the vocabulary.
A simple representation is bag-of-words. A document is represented by the terms it contains, ignoring word order.
The classic sparse scoring method is TF-IDF. It combines term frequency and inverse document frequency.
Term frequency measures how often a term appears in a document. Inverse document frequency gives higher weight to rare terms.
where is the number of documents and is the number of documents containing term .
A common TF-IDF score is
Sparse retrieval is strong when exact terms matter. It works well for names, identifiers, product codes, error messages, legal citations, scientific terms, and rare phrases.
BM25
BM25 is a widely used sparse retrieval function. It improves simple TF-IDF by adding term-frequency saturation and document-length normalization.
A simplified BM25 score is
Here controls term-frequency saturation, controls length normalization, is document length, and is average document length.
BM25 remains a strong baseline. Many neural retrieval systems are compared against it. In production search systems, BM25 is often used as a first-stage retriever because it is fast, interpretable, and effective.
Dense Retrieval
Dense retrieval represents queries and documents as dense vectors in a learned embedding space.
A query encoder maps the query to a vector:
and a document encoder maps the document to a vector:
The score is often a dot product:
If the vectors are normalized, the dot product is equivalent to cosine similarity.
Dense retrieval can find semantically related text even when the exact words differ. For example, a dense retriever may match:
query: how to reset my password
document: account recovery instructionsSparse retrieval may miss this match if there is little lexical overlap. Dense retrieval is therefore useful for semantic search, natural language questions, paraphrases, and cross-lingual retrieval.
Bi-Encoders
Most dense retrieval systems use a bi-encoder. The query and document are encoded separately.
This design is efficient because document embeddings can be precomputed and stored in a vector index.
At query time, the system computes only the query embedding and searches for nearest document embeddings.
query_vec = query_encoder(query) # [D]
doc_vecs = index.search(query_vec) # top-k nearest vectorsThe limitation is that query-document interaction is compressed into one vector per side. This can miss fine-grained relevance signals.
Cross-Encoders
A cross-encoder reads the query and document together.
[CLS] query [SEP] document [SEP]The model outputs a relevance score:
Because the query and document attend to each other token by token, cross-encoders are often more accurate than bi-encoders. However, they are much more expensive. A cross-encoder must run once for every query-document pair.
For this reason, cross-encoders are commonly used as rerankers. A fast retriever first selects a few hundred candidates. The cross-encoder reranks those candidates.
Two-Stage Retrieval
A practical neural search system often uses multiple stages.
The first stage emphasizes recall. It retrieves candidates quickly from a large collection.
The second stage emphasizes precision. It reranks a smaller candidate set using a more expensive model.
A common pipeline is:
- Use BM25 or dense retrieval to fetch top 100 to 1000 candidates.
- Use a cross-encoder reranker to score the candidates.
- Return the top 10 to 50 results.
- Optionally pass top passages to a QA or generation model.
This architecture balances speed and quality. The first stage makes search scalable. The second stage improves ranking quality.
Hybrid Retrieval
Hybrid retrieval combines sparse and dense methods.
Sparse retrieval is good at exact matching. Dense retrieval is good at semantic matching. Combining them often improves robustness.
A simple hybrid score is
where controls the balance.
Another common method is rank fusion. Reciprocal rank fusion combines ranked lists without requiring comparable score scales:
Here is the set of ranking systems and is a smoothing constant.
Hybrid retrieval is often the safest default for technical domains. Exact terms matter, but users also ask semantic questions.
Contrastive Training
Dense retrievers are commonly trained with contrastive objectives. The model receives positive query-document pairs and negative examples.
For a query , a positive document , and negative documents , the loss encourages the positive document to score higher than the negatives.
A common softmax contrastive loss is
The temperature controls the sharpness of the distribution.
In-batch negatives are widely used. Other examples in the same batch serve as negatives, which makes training efficient.
import torch
import torch.nn.functional as F
def contrastive_retrieval_loss(query_emb, doc_emb, temperature=0.05):
# query_emb: [B, D]
# doc_emb: [B, D]
query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)
scores = query_emb @ doc_emb.T # [B, B]
scores = scores / temperature
labels = torch.arange(scores.size(0), device=scores.device)
loss = F.cross_entropy(scores, labels)
return lossThis assumes that query matches document , and all other documents in the batch are negatives.
Negative Sampling
The quality of negative examples strongly affects retrieval training.
| Negative type | Description |
|---|---|
| Random negatives | Documents sampled randomly from the corpus |
| In-batch negatives | Other documents in the same training batch |
| Hard negatives | Irrelevant documents that look relevant |
| Mined negatives | Negatives found by an earlier retriever |
| Cross-encoder-filtered negatives | Candidate negatives checked by a stronger model |
Random negatives are often too easy. The model learns little from them because they are obviously unrelated. Hard negatives are more useful because they force the model to learn fine distinctions.
Example:
Query: symptoms of vitamin B12 deficiency
Positive: article about B12 deficiency symptoms
Hard negative: article about vitamin D deficiency symptomsThe hard negative shares many terms and topic structure, but it does not answer the query.
Passage Retrieval
Long documents are often split into passages before indexing. A passage may be a paragraph, a fixed-size token window, or a semantic section.
Passage retrieval has several advantages:
| Advantage | Reason |
|---|---|
| Better relevance | A short passage can match a query more precisely |
| Better generation grounding | RAG models receive focused context |
| Better vector search | Embeddings represent one local topic |
| Lower context cost | Fewer irrelevant tokens are passed downstream |
The tradeoff is that the system must map passages back to documents. It must also avoid returning many near-duplicate passages from the same document.
A common structure is:
{
"doc_id": "doc_123",
"passage_id": "doc_123_004",
"text": "The relevant paragraph...",
"title": "Document title",
"offset_start": 1830,
"offset_end": 2410
}Offsets allow the system to display snippets, citations, and highlights.
Vector Indexes
Dense retrieval requires nearest neighbor search over many vectors. Exact search compares the query vector with every document vector. This is expensive for large collections.
Approximate nearest neighbor, or ANN, indexes accelerate vector search.
Common ANN ideas include clustering, graph search, product quantization, and hierarchical navigable small-world graphs.
At a high level, an index stores document embeddings:
At query time, it returns approximate top matches:
ANN search trades some accuracy for speed and memory efficiency. For many systems, this tradeoff is necessary.
Evaluation Metrics
Retrieval systems are evaluated by ranked-list metrics.
| Metric | Meaning |
|---|---|
| Recall@k | Fraction of queries with a relevant item in the top |
| Precision@k | Fraction of top results that are relevant |
| MRR | Mean reciprocal rank of the first relevant result |
| MAP | Mean average precision across queries |
| nDCG | Ranking quality with graded relevance |
| Hit rate | Whether any relevant item appears in top |
Recall@k is important when retrieval feeds a downstream reader or generator. If the correct evidence is absent from top , the downstream model cannot answer correctly.
MRR is useful when the first correct result matters. nDCG is useful when relevance has grades, such as perfect, partial, and irrelevant.
Retrieval for RAG
Retrieval-augmented generation, or RAG, uses retrieved passages as context for a language model.
The retrieval system affects the final answer. Poor retrieval causes unsupported or incomplete generation. Good retrieval reduces hallucination by giving the model relevant evidence.
A typical RAG prompt contains:
Question:
...
Relevant passages:
[1] ...
[2] ...
[3] ...
Answer using only the passages above.The retriever should optimize for answer support, not just keyword relevance. A passage can share words with the question but fail to contain the answer.
For RAG, useful retrieval properties include:
| Property | Why it matters |
|---|---|
| High recall | Evidence must reach the generator |
| Low redundancy | Context budget should not be wasted |
| Source metadata | Citations require document IDs and offsets |
| Freshness | Answers may depend on current data |
| Permission filtering | Results must respect access control |
| Latency | Retrieval is on the user path |
Failure Modes
Information retrieval systems fail in several common ways.
| Failure mode | Description |
|---|---|
| Vocabulary mismatch | Relevant document uses different words |
| Semantic drift | Dense retriever returns topically similar but wrong documents |
| Exact-match failure | Dense retriever misses rare identifiers |
| Long-document dilution | Document embedding averages over too many topics |
| Stale index | New or updated documents absent from search |
| Duplicate flooding | Results dominated by near-duplicate passages |
| Poor chunking | Relevant evidence split across chunks |
| Metadata loss | Results lack source, date, or offsets |
| Permission leak | User sees documents they should not access |
Hybrid search, good chunking, reranking, metadata filters, and evaluation sets reduce these failures.
Summary
Information retrieval ranks documents or passages for a query. Sparse methods such as TF-IDF and BM25 rely on lexical overlap. Dense methods use learned embeddings to retrieve semantically related text. Cross-encoders improve precision by scoring query-document pairs jointly, but they are too expensive for first-stage search over large collections.
Practical systems often use hybrid retrieval, passage indexing, ANN vector search, and reranking. For question answering and RAG, retrieval quality is a core model quality factor. If the evidence is not retrieved, the downstream model cannot reliably produce the right answer.