Skip to content

Retrieval-Augmented Generation

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system.

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system. Instead of relying only on knowledge stored in model parameters, the system retrieves relevant documents at inference time and places them into the model’s context.

The core pattern is:

user question
  -> retrieve relevant evidence
  -> condition the model on evidence
  -> generate answer

RAG is useful because pretrained models have limited, static, and imperfect memory. Retrieval gives the model access to current information, private documents, large corpora, and evidence that can be cited or inspected.

Why Retrieval Is Needed

A language model trained on a fixed corpus has several limits.

LimitExample
Stale knowledgeRecent software releases or laws
Missing private dataCompany documents, emails, tickets
HallucinationFluent but unsupported claims
Context limitLarge documents cannot fit fully
Poor attributionHard to know where an answer came from

Retrieval reduces these problems by separating knowledge storage from generation.

The retrieval system stores documents. The language model reads selected passages and synthesizes an answer. This makes the system more updateable and auditable.

The RAG Pipeline

A typical RAG system has two phases: indexing and inference.

During indexing:

  1. Collect documents.
  2. Split documents into chunks.
  3. Embed or index chunks.
  4. Store chunks and metadata.
  5. Build search structures.

During inference:

  1. Receive a user query.
  2. Retrieve candidate chunks.
  3. Optionally rerank candidates.
  4. Insert selected evidence into the prompt.
  5. Generate an answer.
  6. Optionally cite sources.

The inference path can be summarized as:

query -> retriever -> chunks -> reranker -> prompt -> generator -> answer

Each stage affects final answer quality.

Documents, Chunks, and Metadata

Most documents are too long to retrieve as a single unit. They are split into chunks.

A chunk should be large enough to contain useful context but small enough to retrieve precisely.

Chunk sizeTradeoff
Too smallMissing context
Too largeIrrelevant text enters prompt
OverlappingBetter continuity, more storage
Semantic chunksBetter coherence, harder preprocessing

Metadata is also important. Useful metadata includes:

MetadataPurpose
Document titleSource identification
URL or file pathTraceability
AuthorProvenance
TimestampRecency
Section headingLocal context
Access control tagsPermission filtering
Document typeSearch filtering

A retrieval system without metadata is hard to debug and hard to trust.

Sparse Retrieval

Sparse retrieval uses lexical matching. It retrieves documents based on shared words or terms.

Classic methods include TF-IDF and BM25.

Sparse retrieval is strong when the query and document use the same vocabulary.

Example:

Query: PyTorch DistributedDataParallel gradient synchronization

A sparse retriever can find documents containing those exact terms.

Advantages:

AdvantageDescription
FastMature inverted-index systems
InterpretableMatch terms are visible
Strong for exact termsAPIs, names, error messages
CheapNo embedding model required

Limitations:

LimitationDescription
Vocabulary mismatch“car” versus “automobile”
Weak semantic matchingMeaning may differ from words
Poor multilingual transferUnless indexed carefully

Sparse retrieval remains important, especially for code, logs, names, legal text, and exact technical queries.

Dense Retrieval

Dense retrieval uses neural embeddings.

A query encoder maps the query into a vector. A document encoder maps chunks into vectors. Retrieval uses vector similarity.

Let:

q=fθ(query),di=gθ(chunki). q = f_\theta(\text{query}), \quad d_i = g_\theta(\text{chunk}_i).

A common similarity function is dot product or cosine similarity:

s(q,di)=qdi. s(q,d_i)=q^\top d_i.

Dense retrieval can find semantically related text even when exact words differ.

Example:

Query: how to make model answers cite documents

A dense retriever may find chunks about retrieval-augmented generation, source grounding, and attribution.

Advantages:

AdvantageDescription
Semantic matchingHandles paraphrase
Cross-lingual potentialWith multilingual embeddings
Useful for natural questionsGood for broad queries
Compact scoringVector similarity

Limitations:

LimitationDescription
Harder to inspectSimilarity is less transparent
Can miss exact termsPoor for rare names or IDs
Requires embedding modelExtra training or API cost
Index maintenanceVectors must be recomputed after model changes

Dense retrieval is powerful, but it should often be combined with sparse retrieval.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense search.

Sparse retrieval catches exact terms. Dense retrieval catches semantic matches.

A hybrid system may:

  1. Run BM25.
  2. Run vector search.
  3. Merge candidate lists.
  4. Rerank the combined set.

This is often stronger than either method alone.

Query typeBest signal
Error messagesSparse
API namesSparse
Conceptual questionsDense
Synonym-heavy queriesDense
Mixed technical questionsHybrid

Hybrid retrieval is a practical default for serious RAG systems.

Reranking

The retriever usually returns a candidate set. A reranker then scores candidates more carefully.

A retriever may return 50 or 100 chunks. A reranker selects the top 5 to 10 for the prompt.

Rerankers are often cross-encoders. They read the query and candidate chunk together, producing a relevance score.

StageCostQuality
RetrieverLowMedium
RerankerHigherHigher

Reranking improves precision. It is especially useful when the first-stage retriever returns noisy results.

Prompt Construction

Once chunks are selected, they must be inserted into the prompt.

A simple prompt structure:

Use the following sources to answer the question.

Source 1:
...

Source 2:
...

Question:
...

Answer:

Prompt construction should preserve source boundaries.

Each chunk should include citation metadata, such as title, section, page, or URL.

Bad prompt construction can cause:

ProblemEffect
Mixed sourcesCitation confusion
Missing metadataPoor attribution
Too much contextModel ignores relevant text
No instruction hierarchyPrompt injection risk
Truncated chunksMissing evidence

The prompt should tell the model how to use evidence and what to do when evidence is insufficient.

Grounded Generation

Grounded generation means the answer should be supported by retrieved evidence.

A grounded answer should:

RequirementMeaning
Use sourcesClaims come from retrieved text
Avoid unsupported claimsNo invented details
Cite evidencePoint to source chunks
Admit insufficiencySay when evidence is missing
Preserve uncertaintyAvoid overclaiming

RAG does not automatically guarantee grounding. The model may still hallucinate or blend retrieved evidence with parametric memory.

Good systems therefore include instructions such as:

Answer only using the provided sources.
If the sources do not contain the answer, say that the sources are insufficient.

Even then, evaluation is required.

Retrieval Failure Modes

Many RAG errors are retrieval errors, not generation errors.

Common failures include:

FailureDescription
No relevant chunk retrievedAnswer lacks evidence
Wrong chunk retrievedModel answers wrong question
Partial evidenceImportant context missing
Outdated evidenceOlder document overrides newer one
Access-control failureUnauthorized document retrieved
Duplicate chunksPrompt wastes space
Low-quality sourceWeak evidence
Conflicting sourcesModel fails to resolve disagreement

When debugging RAG, inspect retrieved chunks before inspecting the model output.

If the evidence is wrong, generation quality cannot fix the system reliably.

Chunking Failure Modes

Chunking strongly affects retrieval.

Bad chunking can split definitions from their explanations, separate tables from captions, or remove headings needed for context.

For example, this chunk is weak:

It must be enabled before deployment.

The pronoun “it” has no context.

A better chunk includes the heading or surrounding sentence:

Feature Flag: Audit Logging
Audit logging must be enabled before deployment.

Good chunking preserves semantic units:

Document typeChunking strategy
MarkdownHeadings and sections
PDFsPages plus layout-aware blocks
CodeFunctions and classes
Legal textClauses and sections
TicketsThread messages and metadata
TablesTable plus caption and headers

Chunking is not a minor preprocessing detail. It controls what the model can see.

Query Rewriting

User queries are often underspecified. Query rewriting improves retrieval by expanding or clarifying the query before search.

Example:

User query:
"How do I fix it?"

Conversation context:
Previous message mentioned CUDA out-of-memory during PyTorch training.

Rewritten query:
"PyTorch CUDA out of memory training batch size gradient checkpointing mixed precision"

Query rewriting may include:

Rewrite typeExample
Context expansionAdd previous conversation details
Acronym expansion“DDP” to “DistributedDataParallel”
Entity normalizationProduct names and versions
DecompositionSplit complex question into subqueries
HyDEGenerate hypothetical answer, then retrieve

Query rewriting is useful, but it can introduce errors. The rewritten query should remain faithful to the user’s intent.

Multi-Hop Retrieval

Some questions require multiple retrieval steps.

Example:

Which model mentioned in the March roadmap later failed the April evaluation?

The system must retrieve the March roadmap, identify the model, then retrieve the April evaluation.

A multi-hop RAG system may:

  1. Retrieve initial documents.
  2. Extract entities.
  3. Form a second query.
  4. Retrieve follow-up documents.
  5. Synthesize across evidence.

Multi-hop retrieval is important for research, legal analysis, debugging, and enterprise knowledge systems.

It is harder than single-shot retrieval because errors compound across steps.

RAG Versus Fine-Tuning

RAG and fine-tuning solve different problems.

NeedBetter approach
Current factsRAG
Private documentsRAG
Source citationsRAG
Stable behavior styleFine-tuning
Domain-specific languageFine-tuning
Repeated task formatFine-tuning
Lower prompt costFine-tuning
Knowledge that changes oftenRAG

Fine-tuning changes model parameters. RAG changes the context.

For factual knowledge that updates frequently, RAG is usually preferable. For behavior and style, fine-tuning is often better.

Many production systems use both.

RAG Evaluation

RAG evaluation should measure each stage.

ComponentEvaluation question
RetrieverDid it find relevant evidence?
RerankerDid it rank the best chunks highest?
GeneratorDid it answer correctly from evidence?
Citation systemAre citations accurate?
End-to-end systemDid the user get a correct answer?

Important metrics include:

MetricMeaning
Recall@kRelevant document appears in top k
Precision@kTop k documents are relevant
MRRReciprocal rank of first relevant item
nDCGRanking quality with graded relevance
FaithfulnessAnswer supported by sources
Answer correctnessFinal answer is right
Citation accuracyCited source supports claim

End-to-end accuracy alone is insufficient. A model may answer correctly from memorized knowledge while retrieval failed. Conversely, retrieval may succeed but the generator may ignore evidence.

Security and Access Control

RAG systems often retrieve private data. Security must be part of the architecture.

Key requirements:

RequirementPurpose
Document-level permissionsOnly retrieve authorized documents
Chunk-level filteringPrevent leakage from mixed documents
Tenant isolationSeparate users and organizations
Audit logsTrack access
RedactionRemove sensitive fields
Prompt injection defenseTreat documents as untrusted
Source visibilityLet users inspect provenance

Access control must happen before generation. Filtering after the model sees unauthorized text is too late.

PyTorch View: Dense Retrieval

A dense retriever can be implemented with two encoders.

import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)      # [B, D]
doc_emb = doc_encoder(doc_input_ids)            # [N, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

scores = query_emb @ doc_emb.T                  # [B, N]
topk_scores, topk_indices = scores.topk(k=5, dim=-1)

The result topk_indices identifies the most similar document chunks.

In production, document embeddings are precomputed and stored in a vector index. The code above shows the mathematical core, not a scalable serving system.

PyTorch View: Contrastive Retriever Training

A retriever can be trained using contrastive learning.

Suppose each query has one positive document in the same batch.

import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)  # [B, D]
doc_emb = doc_encoder(doc_input_ids)        # [B, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

logits = query_emb @ doc_emb.T              # [B, B]
labels = torch.arange(logits.size(0), device=logits.device)

loss = F.cross_entropy(logits, labels)

Each query should match the document at the same batch index. Other documents in the batch act as negatives.

This is the standard in-batch negative training pattern.

Practical Design Rules

A strong RAG system follows several rules.

First, preserve provenance. Every chunk should carry source metadata.

Second, use hybrid retrieval for mixed corpora. Exact terms and semantic meaning both matter.

Third, rerank before generation when precision matters.

Fourth, evaluate retrieval separately. Many failures happen before the language model sees the prompt.

Fifth, make the model admit insufficient evidence. A grounded system should not guess.

Sixth, enforce access control before retrieval results reach the model.

Seventh, treat retrieved text as untrusted data. It can contain malicious instructions.

Eighth, log retrieved chunks for debugging and audit.

Summary

Retrieval-augmented generation connects language models to external knowledge.

The retrieval system stores and searches documents. The language model reads retrieved evidence and generates an answer.

RAG improves factual grounding, recency, personalization, and source attribution. It is especially important for enterprise search, research assistants, customer support, legal analysis, codebase assistance, and any domain where the answer depends on external documents.

The core challenge is system quality. Good RAG requires careful chunking, indexing, hybrid retrieval, reranking, prompt construction, access control, citation design, and evaluation.