Retrieval Systems

A retrieval system finds relevant information from an external memory source. Instead of storing all knowledge directly inside neural network parameters, the model searches a database, vector index, document collection, or memory store during inference.

Retrieval systems are fundamental to modern foundation models because parametric memory is limited. A model’s weights cannot reliably store all facts, documents, conversations, codebases, or world knowledge. Retrieval provides dynamic access to external information.

The central idea is simple:

\text{query} \longrightarrow \text{retrieve relevant context} \longrightarrow \text{generate or reason}.

A retrieval system therefore extends a model’s effective memory and grounding ability.

Parametric and Nonparametric Memory

Neural networks contain parametric memory inside their weights. During training, the model compresses statistical information into parameters.

For example, a language model may memorize:

grammar
common facts
programming syntax
semantic associations

However, weights are difficult to update. Retraining is expensive, and memorized information may become outdated.

Retrieval systems provide nonparametric memory. Knowledge exists outside the model and can be updated independently.

Memory type	Storage location	Update method
Parametric memory	Model weights	Retraining
Nonparametric memory	External database	Index updates

Modern systems increasingly combine both forms.

Retrieval-Augmented Generation

Retrieval-augmented generation combines retrieval with language modeling.

Instead of generating only from the prompt, the model first retrieves supporting information.

The conditional distribution becomes

p(y \mid x, r),

where:

Symbol	Meaning
$x$	Input query
$r$	Retrieved context
$y$	Generated output

The retrieved context may contain:

documents
passages
code snippets
database entries
conversation history
API outputs

The model conditions generation on both the original query and retrieved information.

Basic Retrieval Pipeline

A retrieval system usually contains several stages.

Stage	Purpose
Document ingestion	Store knowledge
Chunking	Split documents into units
Embedding	Convert chunks into vectors
Indexing	Build searchable structure
Query encoding	Convert query into vector
Similarity search	Find nearest neighbors
Reranking	Improve result ordering
Generation	Produce final response

The pipeline is:

Documents
    ↓
Chunking
    ↓
Embeddings
    ↓
Vector Index
    ↓
Query Embedding
    ↓
Similarity Search
    ↓
Retrieved Context
    ↓
Language Model

This architecture is widely used in modern AI assistants and search systems.

Embedding Models

Retrieval systems usually represent text as dense vectors.

An encoder maps text into an embedding:

z = f_{\theta}(x),

where:

z \in \mathbb{R}^d.

Semantically similar texts should produce nearby embeddings.

For example:

Query	Relevant document
“How does SGD work?”	Optimization explanation
“PyTorch tensor broadcasting”	Tensor shape article
“Transformer attention mechanism”	Attention paper

The embeddings may be produced using:

transformer encoders
contrastive learning models
sentence embedding models
multimodal encoders

Similarity Search

After embedding documents, retrieval becomes a nearest-neighbor search problem.

Suppose:

q \in \mathbb{R}^d

is a query embedding and

D = \{d_1,d_2,\ldots,d_n\}

is a set of document embeddings.

The system retrieves documents maximizing similarity:

d^* = \arg\max_{d_i} s(q,d_i).

A common similarity metric is cosine similarity:

s(q,d) = \frac{q^\top d} {\|q\|\|d\|}.

This measures angular similarity between vectors.

Dense Retrieval

Dense retrieval uses continuous embeddings.

Each document chunk becomes a vector:

E \in \mathbb{R}^{N \times d}.

Query retrieval becomes vector search.

Advantages:

Property	Benefit
Semantic matching	Handles paraphrases
Generalization	Learns conceptual similarity
Compact representation	Efficient storage
Differentiable training	End-to-end optimization

Dense retrieval can retrieve semantically related text even when exact keywords differ.

Example:

Query	Retrieved concept
“car engine”	“automobile motor”
“SGD instability”	“optimization divergence”

Sparse Retrieval

Sparse retrieval uses symbolic features such as keywords or term frequencies.

Traditional systems include:

TF-IDF
BM25
inverted indexes

A sparse vector may contain one dimension per vocabulary term:

x \in \mathbb{R}^{|V|}.

Most dimensions are zero.

Sparse retrieval is highly effective for:

exact matching
rare terms
identifiers
code symbols
names

Dense and sparse retrieval are often combined in hybrid systems.

Hybrid Retrieval

Hybrid systems combine semantic and lexical retrieval.

The final score may be:

s(q,d) = \lambda s_{\text{dense}} + (1-\lambda)s_{\text{sparse}}.

This improves robustness because:

dense retrieval captures semantics
sparse retrieval captures exact matches

Hybrid retrieval is widely used in production systems.

Document Chunking

Large documents are usually divided into smaller chunks before indexing.

Chunking matters because transformer context windows are finite.

A document:

100-page PDF

may be split into:

paragraphs
sections
sliding windows
semantic blocks

Chunk size affects retrieval quality.

Chunk size	Effect
Too small	Missing context
Too large	Reduced precision

Overlap is often used:

Chunk 1: sentences 1–10
Chunk 2: sentences 8–18

This preserves continuity across boundaries.

Vector Databases

Embeddings are stored inside vector indexes or vector databases.

The index supports approximate nearest-neighbor search.

Given:

E \in \mathbb{R}^{N \times d},

the goal is efficient retrieval even when $N$ is extremely large.

Common indexing structures include:

Structure	Purpose
Flat index	Exact search
IVF	Clustered search
HNSW	Graph-based search
PQ	Quantized compression

Approximate methods trade small accuracy loss for large speed improvements.

Approximate Nearest Neighbor Search

Exact search requires comparing the query against every document vector:

O(Nd).

For billions of embeddings, this is too expensive.

Approximate nearest-neighbor methods reduce computation.

The system searches only promising regions of the vector space.

A good ANN system preserves:

high recall
low latency
low memory usage

This becomes critical for large-scale AI systems.

Reranking

Initial retrieval may produce imperfect rankings.

A reranker improves quality using a more expensive model.

Pipeline:

Query
  ↓
Fast retriever
  ↓
Top-k candidates
  ↓
Cross-encoder reranker
  ↓
Final ranking

A reranker jointly processes query and document:

s(q,d) = f_{\theta}(q,d).

Cross-attention often improves ranking accuracy because the model directly compares tokens from both sequences.

Retrieval and Attention

Retrieval can be interpreted as external attention.

Transformer attention selects relevant tokens from internal context:

\text{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

Retrieval selects relevant external memory entries.

Conceptually:

Internal attention	External retrieval
Context window	Database
Token keys	Document embeddings
Attention weights	Similarity scores
Hidden states	Retrieved chunks

Retrieval extends attention beyond the fixed context length.

Retrieval-Augmented Transformers

A retrieval-augmented transformer may operate as follows:

Encode query.
Retrieve documents.
Insert retrieved context into prompt.
Generate output.

Example prompt:

Question:
How does batch normalization work?

Retrieved context:
Batch normalization normalizes activations using batch statistics...

Answer:

The model can then generate grounded responses using retrieved information.

Memory Systems

Retrieval systems can support long-term memory.

Examples include:

Memory type	Description
Episodic memory	Previous conversations
Semantic memory	Facts and documents
Working memory	Current context
Tool memory	API outputs
Agent memory	Plans and actions

An AI assistant may retrieve:

previous chats
user preferences
uploaded files
search results
code repositories

during reasoning.

Multimodal Retrieval

Retrieval is no longer limited to text.

Modern systems retrieve:

Query modality	Retrieved modality
Text	Images
Image	Text
Audio	Video
Video	Documents

A multimodal embedding model maps different modalities into shared vector spaces.

For example:

z_x = f_{\theta}(x), \quad z_t = g_{\phi}(t).

Image and text embeddings become directly comparable.

This enables:

image search
video retrieval
caption search
multimodal recommendation

Retrieval for Agents

Agents use retrieval to support planning and tool use.

Examples:

Agent capability	Retrieval target
Coding assistant	Source files
Research assistant	Web documents
Robot	Environment memory
Personal assistant	Calendar and notes

Retrieval enables stateful behavior across long horizons.

Without retrieval, a model is constrained by finite context windows and static parameters.

PyTorch Example

A simple dense retrieval system:

import torch
import torch.nn.functional as F

def cosine_similarity(query, docs):
    query = F.normalize(query, dim=-1)
    docs = F.normalize(docs, dim=-1)

    return query @ docs.T

Retrieval:

query_emb = encoder(query_text)
doc_embs = encoder(document_texts)

scores = cosine_similarity(query_emb, doc_embs)

topk = torch.topk(scores, k=5)
indices = topk.indices

The indices identify the most similar documents.

Failure Modes

Retrieval systems have several failure modes.

Failure	Description
Embedding collapse	Poor semantic separation
Retrieval drift	Wrong semantic neighborhood
Hallucinated grounding	Model ignores retrieved context
Context overload	Too many retrieved chunks
Outdated indexes	Stale knowledge
Adversarial retrieval	Malicious retrieved content

A retrieval system is only as reliable as:

its embeddings
its index quality
its chunking strategy
its reranking pipeline

Retrieval at Scale

Large systems may store billions of embeddings.

Important engineering concerns include:

Concern	Description
Compression	Reduce memory footprint
Sharding	Distributed indexes
Streaming updates	Dynamic insertion
Latency	Fast search
Recall	Retrieval accuracy
Filtering	Metadata constraints

Production retrieval systems therefore combine:

vector search
distributed storage
metadata filtering
caching
ranking pipelines

Summary

Retrieval systems extend neural networks with external memory. The central ideas are embedding representations, similarity search, vector indexing, reranking, and retrieval-augmented generation.

Modern foundation models increasingly depend on retrieval because parametric memory alone is insufficient for scalable reasoning and factual grounding. Retrieval transforms a static neural network into a dynamic information-processing system capable of accessing large external knowledge stores during inference.