# Information Retrieval

Information retrieval is the task of finding relevant items from a collection in response to a query. The collection may contain web pages, documents, passages, emails, tickets, code files, papers, products, images, or database records. The query may be a few keywords, a natural language question, a document, or an embedding.

In natural language processing systems, information retrieval is used for search, question answering, recommendation, document discovery, retrieval-augmented generation, duplicate detection, and semantic navigation.

A retrieval system learns or defines a scoring function

$$
s(q,d)
$$

where $q$ is the query and $d$ is a candidate document. The system ranks documents by score and returns the top results.

### Retrieval as Ranking

Unlike classification, retrieval usually chooses from a large collection. The model does not merely assign one of a fixed set of labels. It must rank many candidates.

Given a query $q$ and a document collection

$$
\mathcal{D} = \{d_1,d_2,\ldots,d_N\},
$$

retrieval returns an ordered list

$$
d_{(1)}, d_{(2)}, \ldots, d_{(k)}
$$

where $d_{(1)}$ should be the most relevant document.

The retrieval objective is not just to find any relevant result. The best results should appear early in the ranking. Users rarely inspect every result, and downstream systems usually consume only the top few passages.

### Sparse Retrieval

Sparse retrieval represents text using high-dimensional sparse vectors. Each dimension usually corresponds to a term or token in the vocabulary.

A simple representation is bag-of-words. A document is represented by the terms it contains, ignoring word order.

The classic sparse scoring method is TF-IDF. It combines term frequency and inverse document frequency.

Term frequency measures how often a term appears in a document. Inverse document frequency gives higher weight to rare terms.

$$
\text{idf}(t) =
\log \frac{N}{\text{df}(t)}
$$

where $N$ is the number of documents and $\text{df}(t)$ is the number of documents containing term $t$.

A common TF-IDF score is

$$
s(q,d) =
\sum_{t \in q \cap d}
\text{tf}(t,d)\cdot \text{idf}(t).
$$

Sparse retrieval is strong when exact terms matter. It works well for names, identifiers, product codes, error messages, legal citations, scientific terms, and rare phrases.

### BM25

BM25 is a widely used sparse retrieval function. It improves simple TF-IDF by adding term-frequency saturation and document-length normalization.

A simplified BM25 score is

$$
s(q,d) =
\sum_{t \in q}
\text{idf}(t)
\cdot
\frac{
\text{tf}(t,d)(k_1+1)
}{
\text{tf}(t,d) + k_1(1-b+b|d|/\text{avgdl})
}.
$$

Here $k_1$ controls term-frequency saturation, $b$ controls length normalization, $|d|$ is document length, and $\text{avgdl}$ is average document length.

BM25 remains a strong baseline. Many neural retrieval systems are compared against it. In production search systems, BM25 is often used as a first-stage retriever because it is fast, interpretable, and effective.

### Dense Retrieval

Dense retrieval represents queries and documents as dense vectors in a learned embedding space.

A query encoder maps the query to a vector:

$$
h_q = f_\theta(q)
$$

and a document encoder maps the document to a vector:

$$
h_d = g_\theta(d).
$$

The score is often a dot product:

$$
s(q,d) = h_q^\top h_d.
$$

If the vectors are normalized, the dot product is equivalent to cosine similarity.

Dense retrieval can find semantically related text even when the exact words differ. For example, a dense retriever may match:

```text
query: how to reset my password
document: account recovery instructions
```

Sparse retrieval may miss this match if there is little lexical overlap. Dense retrieval is therefore useful for semantic search, natural language questions, paraphrases, and cross-lingual retrieval.

### Bi-Encoders

Most dense retrieval systems use a bi-encoder. The query and document are encoded separately.

$$
h_q = f_\theta(q), \quad h_d = g_\theta(d).
$$

This design is efficient because document embeddings can be precomputed and stored in a vector index.

At query time, the system computes only the query embedding and searches for nearest document embeddings.

```python
query_vec = query_encoder(query)      # [D]
doc_vecs = index.search(query_vec)    # top-k nearest vectors
```

The limitation is that query-document interaction is compressed into one vector per side. This can miss fine-grained relevance signals.

### Cross-Encoders

A cross-encoder reads the query and document together.

```text
[CLS] query [SEP] document [SEP]
```

The model outputs a relevance score:

$$
s(q,d) = f_\theta(q,d).
$$

Because the query and document attend to each other token by token, cross-encoders are often more accurate than bi-encoders. However, they are much more expensive. A cross-encoder must run once for every query-document pair.

For this reason, cross-encoders are commonly used as rerankers. A fast retriever first selects a few hundred candidates. The cross-encoder reranks those candidates.

### Two-Stage Retrieval

A practical neural search system often uses multiple stages.

The first stage emphasizes recall. It retrieves candidates quickly from a large collection.

The second stage emphasizes precision. It reranks a smaller candidate set using a more expensive model.

A common pipeline is:

1. Use BM25 or dense retrieval to fetch top 100 to 1000 candidates.  
2. Use a cross-encoder reranker to score the candidates.  
3. Return the top 10 to 50 results.  
4. Optionally pass top passages to a QA or generation model.

This architecture balances speed and quality. The first stage makes search scalable. The second stage improves ranking quality.

### Hybrid Retrieval

Hybrid retrieval combines sparse and dense methods.

Sparse retrieval is good at exact matching. Dense retrieval is good at semantic matching. Combining them often improves robustness.

A simple hybrid score is

$$
s(q,d) =
\alpha s_{\text{sparse}}(q,d)
+
(1-\alpha)s_{\text{dense}}(q,d),
$$

where $\alpha$ controls the balance.

Another common method is rank fusion. Reciprocal rank fusion combines ranked lists without requiring comparable score scales:

$$
\text{RRF}(d) =
\sum_{r \in R}
\frac{1}{k + \text{rank}_r(d)}.
$$

Here $R$ is the set of ranking systems and $k$ is a smoothing constant.

Hybrid retrieval is often the safest default for technical domains. Exact terms matter, but users also ask semantic questions.

### Contrastive Training

Dense retrievers are commonly trained with contrastive objectives. The model receives positive query-document pairs and negative examples.

For a query $q_i$, a positive document $d_i^+$, and negative documents $d_{i1}^-, \ldots, d_{in}^-$, the loss encourages the positive document to score higher than the negatives.

A common softmax contrastive loss is

$$
\mathcal{L}_i =
-\log
\frac{
\exp(s(q_i,d_i^+)/\tau)
}{
\exp(s(q_i,d_i^+)/\tau)
+
\sum_j \exp(s(q_i,d_{ij}^-)/\tau)
}.
$$

The temperature $\tau$ controls the sharpness of the distribution.

In-batch negatives are widely used. Other examples in the same batch serve as negatives, which makes training efficient.

```python
import torch
import torch.nn.functional as F

def contrastive_retrieval_loss(query_emb, doc_emb, temperature=0.05):
    # query_emb: [B, D]
    # doc_emb: [B, D]
    query_emb = F.normalize(query_emb, dim=-1)
    doc_emb = F.normalize(doc_emb, dim=-1)

    scores = query_emb @ doc_emb.T          # [B, B]
    scores = scores / temperature

    labels = torch.arange(scores.size(0), device=scores.device)

    loss = F.cross_entropy(scores, labels)
    return loss
```

This assumes that query $i$ matches document $i$, and all other documents in the batch are negatives.

### Negative Sampling

The quality of negative examples strongly affects retrieval training.

| Negative type | Description |
|---|---|
| Random negatives | Documents sampled randomly from the corpus |
| In-batch negatives | Other documents in the same training batch |
| Hard negatives | Irrelevant documents that look relevant |
| Mined negatives | Negatives found by an earlier retriever |
| Cross-encoder-filtered negatives | Candidate negatives checked by a stronger model |

Random negatives are often too easy. The model learns little from them because they are obviously unrelated. Hard negatives are more useful because they force the model to learn fine distinctions.

Example:

```text
Query: symptoms of vitamin B12 deficiency
Positive: article about B12 deficiency symptoms
Hard negative: article about vitamin D deficiency symptoms
```

The hard negative shares many terms and topic structure, but it does not answer the query.

### Passage Retrieval

Long documents are often split into passages before indexing. A passage may be a paragraph, a fixed-size token window, or a semantic section.

Passage retrieval has several advantages:

| Advantage | Reason |
|---|---|
| Better relevance | A short passage can match a query more precisely |
| Better generation grounding | RAG models receive focused context |
| Better vector search | Embeddings represent one local topic |
| Lower context cost | Fewer irrelevant tokens are passed downstream |

The tradeoff is that the system must map passages back to documents. It must also avoid returning many near-duplicate passages from the same document.

A common structure is:

```json
{
  "doc_id": "doc_123",
  "passage_id": "doc_123_004",
  "text": "The relevant paragraph...",
  "title": "Document title",
  "offset_start": 1830,
  "offset_end": 2410
}
```

Offsets allow the system to display snippets, citations, and highlights.

### Vector Indexes

Dense retrieval requires nearest neighbor search over many vectors. Exact search compares the query vector with every document vector. This is expensive for large collections.

Approximate nearest neighbor, or ANN, indexes accelerate vector search.

Common ANN ideas include clustering, graph search, product quantization, and hierarchical navigable small-world graphs.

At a high level, an index stores document embeddings:

$$
H = \{h_{d_1}, h_{d_2}, \ldots, h_{d_N}\}.
$$

At query time, it returns approximate top matches:

$$
\operatorname{ANN}(h_q, H) \rightarrow \{d_{(1)},\ldots,d_{(k)}\}.
$$

ANN search trades some accuracy for speed and memory efficiency. For many systems, this tradeoff is necessary.

### Evaluation Metrics

Retrieval systems are evaluated by ranked-list metrics.

| Metric | Meaning |
|---|---|
| Recall@k | Fraction of queries with a relevant item in the top $k$ |
| Precision@k | Fraction of top $k$ results that are relevant |
| MRR | Mean reciprocal rank of the first relevant result |
| MAP | Mean average precision across queries |
| nDCG | Ranking quality with graded relevance |
| Hit rate | Whether any relevant item appears in top $k$ |

Recall@k is important when retrieval feeds a downstream reader or generator. If the correct evidence is absent from top $k$, the downstream model cannot answer correctly.

MRR is useful when the first correct result matters. nDCG is useful when relevance has grades, such as perfect, partial, and irrelevant.

### Retrieval for RAG

Retrieval-augmented generation, or RAG, uses retrieved passages as context for a language model.

The retrieval system affects the final answer. Poor retrieval causes unsupported or incomplete generation. Good retrieval reduces hallucination by giving the model relevant evidence.

A typical RAG prompt contains:

```text
Question:
...

Relevant passages:
[1] ...
[2] ...
[3] ...

Answer using only the passages above.
```

The retriever should optimize for answer support, not just keyword relevance. A passage can share words with the question but fail to contain the answer.

For RAG, useful retrieval properties include:

| Property | Why it matters |
|---|---|
| High recall | Evidence must reach the generator |
| Low redundancy | Context budget should not be wasted |
| Source metadata | Citations require document IDs and offsets |
| Freshness | Answers may depend on current data |
| Permission filtering | Results must respect access control |
| Latency | Retrieval is on the user path |

### Failure Modes

Information retrieval systems fail in several common ways.

| Failure mode | Description |
|---|---|
| Vocabulary mismatch | Relevant document uses different words |
| Semantic drift | Dense retriever returns topically similar but wrong documents |
| Exact-match failure | Dense retriever misses rare identifiers |
| Long-document dilution | Document embedding averages over too many topics |
| Stale index | New or updated documents absent from search |
| Duplicate flooding | Results dominated by near-duplicate passages |
| Poor chunking | Relevant evidence split across chunks |
| Metadata loss | Results lack source, date, or offsets |
| Permission leak | User sees documents they should not access |

Hybrid search, good chunking, reranking, metadata filters, and evaluation sets reduce these failures.

### Summary

Information retrieval ranks documents or passages for a query. Sparse methods such as TF-IDF and BM25 rely on lexical overlap. Dense methods use learned embeddings to retrieve semantically related text. Cross-encoders improve precision by scoring query-document pairs jointly, but they are too expensive for first-stage search over large collections.

Practical systems often use hybrid retrieval, passage indexing, ANN vector search, and reranking. For question answering and RAG, retrieval quality is a core model quality factor. If the evidence is not retrieved, the downstream model cannot reliably produce the right answer.