A retrieval system finds relevant information from an external memory source. Instead of storing all knowledge directly inside neural network parameters, the model searches a database, vector index, document collection, or memory store during inference.
Retrieval systems are fundamental to modern foundation models because parametric memory is limited. A model’s weights cannot reliably store all facts, documents, conversations, codebases, or world knowledge. Retrieval provides dynamic access to external information.
The central idea is simple:
A retrieval system therefore extends a model’s effective memory and grounding ability.
Parametric and Nonparametric Memory
Neural networks contain parametric memory inside their weights. During training, the model compresses statistical information into parameters.
For example, a language model may memorize:
- grammar
- common facts
- programming syntax
- semantic associations
However, weights are difficult to update. Retraining is expensive, and memorized information may become outdated.
Retrieval systems provide nonparametric memory. Knowledge exists outside the model and can be updated independently.
| Memory type | Storage location | Update method |
|---|---|---|
| Parametric memory | Model weights | Retraining |
| Nonparametric memory | External database | Index updates |
Modern systems increasingly combine both forms.
Retrieval-Augmented Generation
Retrieval-augmented generation combines retrieval with language modeling.
Instead of generating only from the prompt, the model first retrieves supporting information.
The conditional distribution becomes
where:
| Symbol | Meaning |
|---|---|
| Input query | |
| Retrieved context | |
| Generated output |
The retrieved context may contain:
- documents
- passages
- code snippets
- database entries
- conversation history
- API outputs
The model conditions generation on both the original query and retrieved information.
Basic Retrieval Pipeline
A retrieval system usually contains several stages.
| Stage | Purpose |
|---|---|
| Document ingestion | Store knowledge |
| Chunking | Split documents into units |
| Embedding | Convert chunks into vectors |
| Indexing | Build searchable structure |
| Query encoding | Convert query into vector |
| Similarity search | Find nearest neighbors |
| Reranking | Improve result ordering |
| Generation | Produce final response |
The pipeline is:
Documents
↓
Chunking
↓
Embeddings
↓
Vector Index
↓
Query Embedding
↓
Similarity Search
↓
Retrieved Context
↓
Language ModelThis architecture is widely used in modern AI assistants and search systems.
Embedding Models
Retrieval systems usually represent text as dense vectors.
An encoder maps text into an embedding:
where:
Semantically similar texts should produce nearby embeddings.
For example:
| Query | Relevant document |
|---|---|
| “How does SGD work?” | Optimization explanation |
| “PyTorch tensor broadcasting” | Tensor shape article |
| “Transformer attention mechanism” | Attention paper |
The embeddings may be produced using:
- transformer encoders
- contrastive learning models
- sentence embedding models
- multimodal encoders
Similarity Search
After embedding documents, retrieval becomes a nearest-neighbor search problem.
Suppose:
is a query embedding and
is a set of document embeddings.
The system retrieves documents maximizing similarity:
A common similarity metric is cosine similarity:
This measures angular similarity between vectors.
Dense Retrieval
Dense retrieval uses continuous embeddings.
Each document chunk becomes a vector:
Query retrieval becomes vector search.
Advantages:
| Property | Benefit |
|---|---|
| Semantic matching | Handles paraphrases |
| Generalization | Learns conceptual similarity |
| Compact representation | Efficient storage |
| Differentiable training | End-to-end optimization |
Dense retrieval can retrieve semantically related text even when exact keywords differ.
Example:
| Query | Retrieved concept |
|---|---|
| “car engine” | “automobile motor” |
| “SGD instability” | “optimization divergence” |
Sparse Retrieval
Sparse retrieval uses symbolic features such as keywords or term frequencies.
Traditional systems include:
- TF-IDF
- BM25
- inverted indexes
A sparse vector may contain one dimension per vocabulary term:
Most dimensions are zero.
Sparse retrieval is highly effective for:
- exact matching
- rare terms
- identifiers
- code symbols
- names
Dense and sparse retrieval are often combined in hybrid systems.
Hybrid Retrieval
Hybrid systems combine semantic and lexical retrieval.
The final score may be:
This improves robustness because:
- dense retrieval captures semantics
- sparse retrieval captures exact matches
Hybrid retrieval is widely used in production systems.
Document Chunking
Large documents are usually divided into smaller chunks before indexing.
Chunking matters because transformer context windows are finite.
A document:
100-page PDFmay be split into:
- paragraphs
- sections
- sliding windows
- semantic blocks
Chunk size affects retrieval quality.
| Chunk size | Effect |
|---|---|
| Too small | Missing context |
| Too large | Reduced precision |
Overlap is often used:
Chunk 1: sentences 1–10
Chunk 2: sentences 8–18This preserves continuity across boundaries.
Vector Databases
Embeddings are stored inside vector indexes or vector databases.
The index supports approximate nearest-neighbor search.
Given:
the goal is efficient retrieval even when is extremely large.
Common indexing structures include:
| Structure | Purpose |
|---|---|
| Flat index | Exact search |
| IVF | Clustered search |
| HNSW | Graph-based search |
| PQ | Quantized compression |
Approximate methods trade small accuracy loss for large speed improvements.
Approximate Nearest Neighbor Search
Exact search requires comparing the query against every document vector:
For billions of embeddings, this is too expensive.
Approximate nearest-neighbor methods reduce computation.
The system searches only promising regions of the vector space.
A good ANN system preserves:
- high recall
- low latency
- low memory usage
This becomes critical for large-scale AI systems.
Reranking
Initial retrieval may produce imperfect rankings.
A reranker improves quality using a more expensive model.
Pipeline:
Query
↓
Fast retriever
↓
Top-k candidates
↓
Cross-encoder reranker
↓
Final rankingA reranker jointly processes query and document:
Cross-attention often improves ranking accuracy because the model directly compares tokens from both sequences.
Retrieval and Attention
Retrieval can be interpreted as external attention.
Transformer attention selects relevant tokens from internal context:
Retrieval selects relevant external memory entries.
Conceptually:
| Internal attention | External retrieval |
|---|---|
| Context window | Database |
| Token keys | Document embeddings |
| Attention weights | Similarity scores |
| Hidden states | Retrieved chunks |
Retrieval extends attention beyond the fixed context length.
Retrieval-Augmented Transformers
A retrieval-augmented transformer may operate as follows:
- Encode query.
- Retrieve documents.
- Insert retrieved context into prompt.
- Generate output.
Example prompt:
Question:
How does batch normalization work?
Retrieved context:
Batch normalization normalizes activations using batch statistics...
Answer:The model can then generate grounded responses using retrieved information.
Memory Systems
Retrieval systems can support long-term memory.
Examples include:
| Memory type | Description |
|---|---|
| Episodic memory | Previous conversations |
| Semantic memory | Facts and documents |
| Working memory | Current context |
| Tool memory | API outputs |
| Agent memory | Plans and actions |
An AI assistant may retrieve:
- previous chats
- user preferences
- uploaded files
- search results
- code repositories
during reasoning.
Multimodal Retrieval
Retrieval is no longer limited to text.
Modern systems retrieve:
| Query modality | Retrieved modality |
|---|---|
| Text | Images |
| Image | Text |
| Audio | Video |
| Video | Documents |
A multimodal embedding model maps different modalities into shared vector spaces.
For example:
Image and text embeddings become directly comparable.
This enables:
- image search
- video retrieval
- caption search
- multimodal recommendation
Retrieval for Agents
Agents use retrieval to support planning and tool use.
Examples:
| Agent capability | Retrieval target |
|---|---|
| Coding assistant | Source files |
| Research assistant | Web documents |
| Robot | Environment memory |
| Personal assistant | Calendar and notes |
Retrieval enables stateful behavior across long horizons.
Without retrieval, a model is constrained by finite context windows and static parameters.
PyTorch Example
A simple dense retrieval system:
import torch
import torch.nn.functional as F
def cosine_similarity(query, docs):
query = F.normalize(query, dim=-1)
docs = F.normalize(docs, dim=-1)
return query @ docs.TRetrieval:
query_emb = encoder(query_text)
doc_embs = encoder(document_texts)
scores = cosine_similarity(query_emb, doc_embs)
topk = torch.topk(scores, k=5)
indices = topk.indicesThe indices identify the most similar documents.
Failure Modes
Retrieval systems have several failure modes.
| Failure | Description |
|---|---|
| Embedding collapse | Poor semantic separation |
| Retrieval drift | Wrong semantic neighborhood |
| Hallucinated grounding | Model ignores retrieved context |
| Context overload | Too many retrieved chunks |
| Outdated indexes | Stale knowledge |
| Adversarial retrieval | Malicious retrieved content |
A retrieval system is only as reliable as:
- its embeddings
- its index quality
- its chunking strategy
- its reranking pipeline
Retrieval at Scale
Large systems may store billions of embeddings.
Important engineering concerns include:
| Concern | Description |
|---|---|
| Compression | Reduce memory footprint |
| Sharding | Distributed indexes |
| Streaming updates | Dynamic insertion |
| Latency | Fast search |
| Recall | Retrieval accuracy |
| Filtering | Metadata constraints |
Production retrieval systems therefore combine:
- vector search
- distributed storage
- metadata filtering
- caching
- ranking pipelines
Summary
Retrieval systems extend neural networks with external memory. The central ideas are embedding representations, similarity search, vector indexing, reranking, and retrieval-augmented generation.
Modern foundation models increasingly depend on retrieval because parametric memory alone is insufficient for scalable reasoning and factual grounding. Retrieval transforms a static neural network into a dynamic information-processing system capable of accessing large external knowledge stores during inference.