Skip to content

Retrieval Systems

A retrieval system finds relevant information from an external memory source.

A retrieval system finds relevant information from an external memory source. Instead of storing all knowledge directly inside neural network parameters, the model searches a database, vector index, document collection, or memory store during inference.

Retrieval systems are fundamental to modern foundation models because parametric memory is limited. A model’s weights cannot reliably store all facts, documents, conversations, codebases, or world knowledge. Retrieval provides dynamic access to external information.

The central idea is simple:

queryretrieve relevant contextgenerate or reason. \text{query} \longrightarrow \text{retrieve relevant context} \longrightarrow \text{generate or reason}.

A retrieval system therefore extends a model’s effective memory and grounding ability.

Parametric and Nonparametric Memory

Neural networks contain parametric memory inside their weights. During training, the model compresses statistical information into parameters.

For example, a language model may memorize:

  • grammar
  • common facts
  • programming syntax
  • semantic associations

However, weights are difficult to update. Retraining is expensive, and memorized information may become outdated.

Retrieval systems provide nonparametric memory. Knowledge exists outside the model and can be updated independently.

Memory typeStorage locationUpdate method
Parametric memoryModel weightsRetraining
Nonparametric memoryExternal databaseIndex updates

Modern systems increasingly combine both forms.

Retrieval-Augmented Generation

Retrieval-augmented generation combines retrieval with language modeling.

Instead of generating only from the prompt, the model first retrieves supporting information.

The conditional distribution becomes

p(yx,r), p(y \mid x, r),

where:

SymbolMeaning
xxInput query
rrRetrieved context
yyGenerated output

The retrieved context may contain:

  • documents
  • passages
  • code snippets
  • database entries
  • conversation history
  • API outputs

The model conditions generation on both the original query and retrieved information.

Basic Retrieval Pipeline

A retrieval system usually contains several stages.

StagePurpose
Document ingestionStore knowledge
ChunkingSplit documents into units
EmbeddingConvert chunks into vectors
IndexingBuild searchable structure
Query encodingConvert query into vector
Similarity searchFind nearest neighbors
RerankingImprove result ordering
GenerationProduce final response

The pipeline is:

Documents
Chunking
Embeddings
Vector Index
Query Embedding
Similarity Search
Retrieved Context
Language Model

This architecture is widely used in modern AI assistants and search systems.

Embedding Models

Retrieval systems usually represent text as dense vectors.

An encoder maps text into an embedding:

z=fθ(x), z = f_{\theta}(x),

where:

zRd. z \in \mathbb{R}^d.

Semantically similar texts should produce nearby embeddings.

For example:

QueryRelevant document
“How does SGD work?”Optimization explanation
“PyTorch tensor broadcasting”Tensor shape article
“Transformer attention mechanism”Attention paper

The embeddings may be produced using:

  • transformer encoders
  • contrastive learning models
  • sentence embedding models
  • multimodal encoders

Similarity Search

After embedding documents, retrieval becomes a nearest-neighbor search problem.

Suppose:

qRd q \in \mathbb{R}^d

is a query embedding and

D={d1,d2,,dn} D = \{d_1,d_2,\ldots,d_n\}

is a set of document embeddings.

The system retrieves documents maximizing similarity:

d=argmaxdis(q,di). d^* = \arg\max_{d_i} s(q,d_i).

A common similarity metric is cosine similarity:

s(q,d)=qdqd. s(q,d) = \frac{q^\top d} {\|q\|\|d\|}.

This measures angular similarity between vectors.

Dense Retrieval

Dense retrieval uses continuous embeddings.

Each document chunk becomes a vector:

ERN×d. E \in \mathbb{R}^{N \times d}.

Query retrieval becomes vector search.

Advantages:

PropertyBenefit
Semantic matchingHandles paraphrases
GeneralizationLearns conceptual similarity
Compact representationEfficient storage
Differentiable trainingEnd-to-end optimization

Dense retrieval can retrieve semantically related text even when exact keywords differ.

Example:

QueryRetrieved concept
“car engine”“automobile motor”
“SGD instability”“optimization divergence”

Sparse Retrieval

Sparse retrieval uses symbolic features such as keywords or term frequencies.

Traditional systems include:

  • TF-IDF
  • BM25
  • inverted indexes

A sparse vector may contain one dimension per vocabulary term:

xRV. x \in \mathbb{R}^{|V|}.

Most dimensions are zero.

Sparse retrieval is highly effective for:

  • exact matching
  • rare terms
  • identifiers
  • code symbols
  • names

Dense and sparse retrieval are often combined in hybrid systems.

Hybrid Retrieval

Hybrid systems combine semantic and lexical retrieval.

The final score may be:

s(q,d)=λsdense+(1λ)ssparse. s(q,d) = \lambda s_{\text{dense}} + (1-\lambda)s_{\text{sparse}}.

This improves robustness because:

  • dense retrieval captures semantics
  • sparse retrieval captures exact matches

Hybrid retrieval is widely used in production systems.

Document Chunking

Large documents are usually divided into smaller chunks before indexing.

Chunking matters because transformer context windows are finite.

A document:

100-page PDF

may be split into:

  • paragraphs
  • sections
  • sliding windows
  • semantic blocks

Chunk size affects retrieval quality.

Chunk sizeEffect
Too smallMissing context
Too largeReduced precision

Overlap is often used:

Chunk 1: sentences 1–10
Chunk 2: sentences 8–18

This preserves continuity across boundaries.

Vector Databases

Embeddings are stored inside vector indexes or vector databases.

The index supports approximate nearest-neighbor search.

Given:

ERN×d, E \in \mathbb{R}^{N \times d},

the goal is efficient retrieval even when NN is extremely large.

Common indexing structures include:

StructurePurpose
Flat indexExact search
IVFClustered search
HNSWGraph-based search
PQQuantized compression

Approximate methods trade small accuracy loss for large speed improvements.

Approximate Nearest Neighbor Search

Exact search requires comparing the query against every document vector:

O(Nd). O(Nd).

For billions of embeddings, this is too expensive.

Approximate nearest-neighbor methods reduce computation.

The system searches only promising regions of the vector space.

A good ANN system preserves:

  • high recall
  • low latency
  • low memory usage

This becomes critical for large-scale AI systems.

Reranking

Initial retrieval may produce imperfect rankings.

A reranker improves quality using a more expensive model.

Pipeline:

Query
Fast retriever
Top-k candidates
Cross-encoder reranker
Final ranking

A reranker jointly processes query and document:

s(q,d)=fθ(q,d). s(q,d) = f_{\theta}(q,d).

Cross-attention often improves ranking accuracy because the model directly compares tokens from both sequences.

Retrieval and Attention

Retrieval can be interpreted as external attention.

Transformer attention selects relevant tokens from internal context:

softmax(QKd)V. \text{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

Retrieval selects relevant external memory entries.

Conceptually:

Internal attentionExternal retrieval
Context windowDatabase
Token keysDocument embeddings
Attention weightsSimilarity scores
Hidden statesRetrieved chunks

Retrieval extends attention beyond the fixed context length.

Retrieval-Augmented Transformers

A retrieval-augmented transformer may operate as follows:

  1. Encode query.
  2. Retrieve documents.
  3. Insert retrieved context into prompt.
  4. Generate output.

Example prompt:

Question:
How does batch normalization work?

Retrieved context:
Batch normalization normalizes activations using batch statistics...

Answer:

The model can then generate grounded responses using retrieved information.

Memory Systems

Retrieval systems can support long-term memory.

Examples include:

Memory typeDescription
Episodic memoryPrevious conversations
Semantic memoryFacts and documents
Working memoryCurrent context
Tool memoryAPI outputs
Agent memoryPlans and actions

An AI assistant may retrieve:

  • previous chats
  • user preferences
  • uploaded files
  • search results
  • code repositories

during reasoning.

Multimodal Retrieval

Retrieval is no longer limited to text.

Modern systems retrieve:

Query modalityRetrieved modality
TextImages
ImageText
AudioVideo
VideoDocuments

A multimodal embedding model maps different modalities into shared vector spaces.

For example:

zx=fθ(x),zt=gϕ(t). z_x = f_{\theta}(x), \quad z_t = g_{\phi}(t).

Image and text embeddings become directly comparable.

This enables:

  • image search
  • video retrieval
  • caption search
  • multimodal recommendation

Retrieval for Agents

Agents use retrieval to support planning and tool use.

Examples:

Agent capabilityRetrieval target
Coding assistantSource files
Research assistantWeb documents
RobotEnvironment memory
Personal assistantCalendar and notes

Retrieval enables stateful behavior across long horizons.

Without retrieval, a model is constrained by finite context windows and static parameters.

PyTorch Example

A simple dense retrieval system:

import torch
import torch.nn.functional as F

def cosine_similarity(query, docs):
    query = F.normalize(query, dim=-1)
    docs = F.normalize(docs, dim=-1)

    return query @ docs.T

Retrieval:

query_emb = encoder(query_text)
doc_embs = encoder(document_texts)

scores = cosine_similarity(query_emb, doc_embs)

topk = torch.topk(scores, k=5)
indices = topk.indices

The indices identify the most similar documents.

Failure Modes

Retrieval systems have several failure modes.

FailureDescription
Embedding collapsePoor semantic separation
Retrieval driftWrong semantic neighborhood
Hallucinated groundingModel ignores retrieved context
Context overloadToo many retrieved chunks
Outdated indexesStale knowledge
Adversarial retrievalMalicious retrieved content

A retrieval system is only as reliable as:

  • its embeddings
  • its index quality
  • its chunking strategy
  • its reranking pipeline

Retrieval at Scale

Large systems may store billions of embeddings.

Important engineering concerns include:

ConcernDescription
CompressionReduce memory footprint
ShardingDistributed indexes
Streaming updatesDynamic insertion
LatencyFast search
RecallRetrieval accuracy
FilteringMetadata constraints

Production retrieval systems therefore combine:

  • vector search
  • distributed storage
  • metadata filtering
  • caching
  • ranking pipelines

Summary

Retrieval systems extend neural networks with external memory. The central ideas are embedding representations, similarity search, vector indexing, reranking, and retrieval-augmented generation.

Modern foundation models increasingly depend on retrieval because parametric memory alone is insufficient for scalable reasoning and factual grounding. Retrieval transforms a static neural network into a dynamic information-processing system capable of accessing large external knowledge stores during inference.