# Retrieval Systems

A retrieval system finds relevant information from an external memory source. Instead of storing all knowledge directly inside neural network parameters, the model searches a database, vector index, document collection, or memory store during inference.

Retrieval systems are fundamental to modern foundation models because parametric memory is limited. A model’s weights cannot reliably store all facts, documents, conversations, codebases, or world knowledge. Retrieval provides dynamic access to external information.

The central idea is simple:

$$
\text{query}
\longrightarrow
\text{retrieve relevant context}
\longrightarrow
\text{generate or reason}.
$$

A retrieval system therefore extends a model’s effective memory and grounding ability.

### Parametric and Nonparametric Memory

Neural networks contain parametric memory inside their weights. During training, the model compresses statistical information into parameters.

For example, a language model may memorize:

- grammar
- common facts
- programming syntax
- semantic associations

However, weights are difficult to update. Retraining is expensive, and memorized information may become outdated.

Retrieval systems provide nonparametric memory. Knowledge exists outside the model and can be updated independently.

| Memory type | Storage location | Update method |
|---|---|---|
| Parametric memory | Model weights | Retraining |
| Nonparametric memory | External database | Index updates |

Modern systems increasingly combine both forms.

### Retrieval-Augmented Generation

Retrieval-augmented generation combines retrieval with language modeling.

Instead of generating only from the prompt, the model first retrieves supporting information.

The conditional distribution becomes

$$
p(y \mid x, r),
$$

where:

| Symbol | Meaning |
|---|---|
| $x$ | Input query |
| $r$ | Retrieved context |
| $y$ | Generated output |

The retrieved context may contain:

- documents
- passages
- code snippets
- database entries
- conversation history
- API outputs

The model conditions generation on both the original query and retrieved information.

### Basic Retrieval Pipeline

A retrieval system usually contains several stages.

| Stage | Purpose |
|---|---|
| Document ingestion | Store knowledge |
| Chunking | Split documents into units |
| Embedding | Convert chunks into vectors |
| Indexing | Build searchable structure |
| Query encoding | Convert query into vector |
| Similarity search | Find nearest neighbors |
| Reranking | Improve result ordering |
| Generation | Produce final response |

The pipeline is:

```text id="r2q7wo"
Documents
    ↓
Chunking
    ↓
Embeddings
    ↓
Vector Index
    ↓
Query Embedding
    ↓
Similarity Search
    ↓
Retrieved Context
    ↓
Language Model
```

This architecture is widely used in modern AI assistants and search systems.

### Embedding Models

Retrieval systems usually represent text as dense vectors.

An encoder maps text into an embedding:

$$
z = f_{\theta}(x),
$$

where:

$$
z \in \mathbb{R}^d.
$$

Semantically similar texts should produce nearby embeddings.

For example:

| Query | Relevant document |
|---|---|
| “How does SGD work?” | Optimization explanation |
| “PyTorch tensor broadcasting” | Tensor shape article |
| “Transformer attention mechanism” | Attention paper |

The embeddings may be produced using:

- transformer encoders
- contrastive learning models
- sentence embedding models
- multimodal encoders

### Similarity Search

After embedding documents, retrieval becomes a nearest-neighbor search problem.

Suppose:

$$
q \in \mathbb{R}^d
$$

is a query embedding and

$$
D = \{d_1,d_2,\ldots,d_n\}
$$

is a set of document embeddings.

The system retrieves documents maximizing similarity:

$$
d^* =
\arg\max_{d_i}
s(q,d_i).
$$

A common similarity metric is cosine similarity:

$$
s(q,d) =
\frac{q^\top d}
{\|q\|\|d\|}.
$$

This measures angular similarity between vectors.

### Dense Retrieval

Dense retrieval uses continuous embeddings.

Each document chunk becomes a vector:

$$
E \in \mathbb{R}^{N \times d}.
$$

Query retrieval becomes vector search.

Advantages:

| Property | Benefit |
|---|---|
| Semantic matching | Handles paraphrases |
| Generalization | Learns conceptual similarity |
| Compact representation | Efficient storage |
| Differentiable training | End-to-end optimization |

Dense retrieval can retrieve semantically related text even when exact keywords differ.

Example:

| Query | Retrieved concept |
|---|---|
| “car engine” | “automobile motor” |
| “SGD instability” | “optimization divergence” |

### Sparse Retrieval

Sparse retrieval uses symbolic features such as keywords or term frequencies.

Traditional systems include:

- TF-IDF
- BM25
- inverted indexes

A sparse vector may contain one dimension per vocabulary term:

$$
x \in \mathbb{R}^{|V|}.
$$

Most dimensions are zero.

Sparse retrieval is highly effective for:

- exact matching
- rare terms
- identifiers
- code symbols
- names

Dense and sparse retrieval are often combined in hybrid systems.

### Hybrid Retrieval

Hybrid systems combine semantic and lexical retrieval.

The final score may be:

$$
s(q,d) =
\lambda s_{\text{dense}}
+
(1-\lambda)s_{\text{sparse}}.
$$

This improves robustness because:

- dense retrieval captures semantics
- sparse retrieval captures exact matches

Hybrid retrieval is widely used in production systems.

### Document Chunking

Large documents are usually divided into smaller chunks before indexing.

Chunking matters because transformer context windows are finite.

A document:

```text id="qqbphw"
100-page PDF
```

may be split into:

- paragraphs
- sections
- sliding windows
- semantic blocks

Chunk size affects retrieval quality.

| Chunk size | Effect |
|---|---|
| Too small | Missing context |
| Too large | Reduced precision |

Overlap is often used:

```text id="h5ajqf"
Chunk 1: sentences 1–10
Chunk 2: sentences 8–18
```

This preserves continuity across boundaries.

### Vector Databases

Embeddings are stored inside vector indexes or vector databases.

The index supports approximate nearest-neighbor search.

Given:

$$
E \in \mathbb{R}^{N \times d},
$$

the goal is efficient retrieval even when $N$ is extremely large.

Common indexing structures include:

| Structure | Purpose |
|---|---|
| Flat index | Exact search |
| IVF | Clustered search |
| HNSW | Graph-based search |
| PQ | Quantized compression |

Approximate methods trade small accuracy loss for large speed improvements.

### Approximate Nearest Neighbor Search

Exact search requires comparing the query against every document vector:

$$
O(Nd).
$$

For billions of embeddings, this is too expensive.

Approximate nearest-neighbor methods reduce computation.

The system searches only promising regions of the vector space.

A good ANN system preserves:

- high recall
- low latency
- low memory usage

This becomes critical for large-scale AI systems.

### Reranking

Initial retrieval may produce imperfect rankings.

A reranker improves quality using a more expensive model.

Pipeline:

```text id="xz5v8j"
Query
  ↓
Fast retriever
  ↓
Top-k candidates
  ↓
Cross-encoder reranker
  ↓
Final ranking
```

A reranker jointly processes query and document:

$$
s(q,d) =
f_{\theta}(q,d).
$$

Cross-attention often improves ranking accuracy because the model directly compares tokens from both sequences.

### Retrieval and Attention

Retrieval can be interpreted as external attention.

Transformer attention selects relevant tokens from internal context:

$$
\text{softmax}
\left(
\frac{QK^\top}{\sqrt{d}}
\right)V.
$$

Retrieval selects relevant external memory entries.

Conceptually:

| Internal attention | External retrieval |
|---|---|
| Context window | Database |
| Token keys | Document embeddings |
| Attention weights | Similarity scores |
| Hidden states | Retrieved chunks |

Retrieval extends attention beyond the fixed context length.

### Retrieval-Augmented Transformers

A retrieval-augmented transformer may operate as follows:

1. Encode query.
2. Retrieve documents.
3. Insert retrieved context into prompt.
4. Generate output.

Example prompt:

```text id="df1k7s"
Question:
How does batch normalization work?

Retrieved context:
Batch normalization normalizes activations using batch statistics...

Answer:
```

The model can then generate grounded responses using retrieved information.

### Memory Systems

Retrieval systems can support long-term memory.

Examples include:

| Memory type | Description |
|---|---|
| Episodic memory | Previous conversations |
| Semantic memory | Facts and documents |
| Working memory | Current context |
| Tool memory | API outputs |
| Agent memory | Plans and actions |

An AI assistant may retrieve:

- previous chats
- user preferences
- uploaded files
- search results
- code repositories

during reasoning.

### Multimodal Retrieval

Retrieval is no longer limited to text.

Modern systems retrieve:

| Query modality | Retrieved modality |
|---|---|
| Text | Images |
| Image | Text |
| Audio | Video |
| Video | Documents |

A multimodal embedding model maps different modalities into shared vector spaces.

For example:

$$
z_x = f_{\theta}(x),
\quad
z_t = g_{\phi}(t).
$$

Image and text embeddings become directly comparable.

This enables:

- image search
- video retrieval
- caption search
- multimodal recommendation

### Retrieval for Agents

Agents use retrieval to support planning and tool use.

Examples:

| Agent capability | Retrieval target |
|---|---|
| Coding assistant | Source files |
| Research assistant | Web documents |
| Robot | Environment memory |
| Personal assistant | Calendar and notes |

Retrieval enables stateful behavior across long horizons.

Without retrieval, a model is constrained by finite context windows and static parameters.

### PyTorch Example

A simple dense retrieval system:

```python id="k7wrfv"
import torch
import torch.nn.functional as F

def cosine_similarity(query, docs):
    query = F.normalize(query, dim=-1)
    docs = F.normalize(docs, dim=-1)

    return query @ docs.T
```

Retrieval:

```python id="ev0q4g"
query_emb = encoder(query_text)
doc_embs = encoder(document_texts)

scores = cosine_similarity(query_emb, doc_embs)

topk = torch.topk(scores, k=5)
indices = topk.indices
```

The indices identify the most similar documents.

### Failure Modes

Retrieval systems have several failure modes.

| Failure | Description |
|---|---|
| Embedding collapse | Poor semantic separation |
| Retrieval drift | Wrong semantic neighborhood |
| Hallucinated grounding | Model ignores retrieved context |
| Context overload | Too many retrieved chunks |
| Outdated indexes | Stale knowledge |
| Adversarial retrieval | Malicious retrieved content |

A retrieval system is only as reliable as:

- its embeddings
- its index quality
- its chunking strategy
- its reranking pipeline

### Retrieval at Scale

Large systems may store billions of embeddings.

Important engineering concerns include:

| Concern | Description |
|---|---|
| Compression | Reduce memory footprint |
| Sharding | Distributed indexes |
| Streaming updates | Dynamic insertion |
| Latency | Fast search |
| Recall | Retrieval accuracy |
| Filtering | Metadata constraints |

Production retrieval systems therefore combine:

- vector search
- distributed storage
- metadata filtering
- caching
- ranking pipelines

### Summary

Retrieval systems extend neural networks with external memory. The central ideas are embedding representations, similarity search, vector indexing, reranking, and retrieval-augmented generation.

Modern foundation models increasingly depend on retrieval because parametric memory alone is insufficient for scalable reasoning and factual grounding. Retrieval transforms a static neural network into a dynamic information-processing system capable of accessing large external knowledge stores during inference.

