Retrieval-Augmented Generation

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system. Instead of relying only on knowledge stored in model parameters, the system retrieves relevant documents at inference time and places them into the model’s context.

The core pattern is:

user question
  -> retrieve relevant evidence
  -> condition the model on evidence
  -> generate answer

RAG is useful because pretrained models have limited, static, and imperfect memory. Retrieval gives the model access to current information, private documents, large corpora, and evidence that can be cited or inspected.

Why Retrieval Is Needed

A language model trained on a fixed corpus has several limits.

Limit	Example
Stale knowledge	Recent software releases or laws
Missing private data	Company documents, emails, tickets
Hallucination	Fluent but unsupported claims
Context limit	Large documents cannot fit fully
Poor attribution	Hard to know where an answer came from

Retrieval reduces these problems by separating knowledge storage from generation.

The retrieval system stores documents. The language model reads selected passages and synthesizes an answer. This makes the system more updateable and auditable.

The RAG Pipeline

A typical RAG system has two phases: indexing and inference.

During indexing:

Collect documents.
Split documents into chunks.
Embed or index chunks.
Store chunks and metadata.
Build search structures.

During inference:

Receive a user query.
Retrieve candidate chunks.
Optionally rerank candidates.
Insert selected evidence into the prompt.
Generate an answer.
Optionally cite sources.

The inference path can be summarized as:

query -> retriever -> chunks -> reranker -> prompt -> generator -> answer

Each stage affects final answer quality.

Documents, Chunks, and Metadata

Most documents are too long to retrieve as a single unit. They are split into chunks.

A chunk should be large enough to contain useful context but small enough to retrieve precisely.

Chunk size	Tradeoff
Too small	Missing context
Too large	Irrelevant text enters prompt
Overlapping	Better continuity, more storage
Semantic chunks	Better coherence, harder preprocessing

Metadata is also important. Useful metadata includes:

Metadata	Purpose
Document title	Source identification
URL or file path	Traceability
Author	Provenance
Timestamp	Recency
Section heading	Local context
Access control tags	Permission filtering
Document type	Search filtering

A retrieval system without metadata is hard to debug and hard to trust.

Sparse Retrieval

Sparse retrieval uses lexical matching. It retrieves documents based on shared words or terms.

Classic methods include TF-IDF and BM25.

Sparse retrieval is strong when the query and document use the same vocabulary.

Example:

Query: PyTorch DistributedDataParallel gradient synchronization

A sparse retriever can find documents containing those exact terms.

Advantages:

Advantage	Description
Fast	Mature inverted-index systems
Interpretable	Match terms are visible
Strong for exact terms	APIs, names, error messages
Cheap	No embedding model required

Limitations:

Limitation	Description
Vocabulary mismatch	“car” versus “automobile”
Weak semantic matching	Meaning may differ from words
Poor multilingual transfer	Unless indexed carefully

Sparse retrieval remains important, especially for code, logs, names, legal text, and exact technical queries.

Dense Retrieval

Dense retrieval uses neural embeddings.

A query encoder maps the query into a vector. A document encoder maps chunks into vectors. Retrieval uses vector similarity.

Let:

q = f_\theta(\text{query}), \quad d_i = g_\theta(\text{chunk}_i).

A common similarity function is dot product or cosine similarity:

s(q,d_i)=q^\top d_i.

Dense retrieval can find semantically related text even when exact words differ.

Example:

Query: how to make model answers cite documents

A dense retriever may find chunks about retrieval-augmented generation, source grounding, and attribution.

Advantages:

Advantage	Description
Semantic matching	Handles paraphrase
Cross-lingual potential	With multilingual embeddings
Useful for natural questions	Good for broad queries
Compact scoring	Vector similarity

Limitations:

Limitation	Description
Harder to inspect	Similarity is less transparent
Can miss exact terms	Poor for rare names or IDs
Requires embedding model	Extra training or API cost
Index maintenance	Vectors must be recomputed after model changes

Dense retrieval is powerful, but it should often be combined with sparse retrieval.

Hybrid Retrieval

Hybrid retrieval combines sparse and dense search.

Sparse retrieval catches exact terms. Dense retrieval catches semantic matches.

A hybrid system may:

Run BM25.
Run vector search.
Merge candidate lists.
Rerank the combined set.

This is often stronger than either method alone.

Query type	Best signal
Error messages	Sparse
API names	Sparse
Conceptual questions	Dense
Synonym-heavy queries	Dense
Mixed technical questions	Hybrid

Hybrid retrieval is a practical default for serious RAG systems.

Reranking

The retriever usually returns a candidate set. A reranker then scores candidates more carefully.

A retriever may return 50 or 100 chunks. A reranker selects the top 5 to 10 for the prompt.

Rerankers are often cross-encoders. They read the query and candidate chunk together, producing a relevance score.

Stage	Cost	Quality
Retriever	Low	Medium
Reranker	Higher	Higher

Reranking improves precision. It is especially useful when the first-stage retriever returns noisy results.

Prompt Construction

Once chunks are selected, they must be inserted into the prompt.

A simple prompt structure:

Use the following sources to answer the question.

Source 1:
...

Source 2:
...

Question:
...

Answer:

Prompt construction should preserve source boundaries.

Each chunk should include citation metadata, such as title, section, page, or URL.

Bad prompt construction can cause:

Problem	Effect
Mixed sources	Citation confusion
Missing metadata	Poor attribution
Too much context	Model ignores relevant text
No instruction hierarchy	Prompt injection risk
Truncated chunks	Missing evidence

The prompt should tell the model how to use evidence and what to do when evidence is insufficient.

Grounded Generation

Grounded generation means the answer should be supported by retrieved evidence.

A grounded answer should:

Requirement	Meaning
Use sources	Claims come from retrieved text
Avoid unsupported claims	No invented details
Cite evidence	Point to source chunks
Admit insufficiency	Say when evidence is missing
Preserve uncertainty	Avoid overclaiming

RAG does not automatically guarantee grounding. The model may still hallucinate or blend retrieved evidence with parametric memory.

Good systems therefore include instructions such as:

Answer only using the provided sources.
If the sources do not contain the answer, say that the sources are insufficient.

Even then, evaluation is required.

Retrieval Failure Modes

Many RAG errors are retrieval errors, not generation errors.

Common failures include:

Failure	Description
No relevant chunk retrieved	Answer lacks evidence
Wrong chunk retrieved	Model answers wrong question
Partial evidence	Important context missing
Outdated evidence	Older document overrides newer one
Access-control failure	Unauthorized document retrieved
Duplicate chunks	Prompt wastes space
Low-quality source	Weak evidence
Conflicting sources	Model fails to resolve disagreement

When debugging RAG, inspect retrieved chunks before inspecting the model output.

If the evidence is wrong, generation quality cannot fix the system reliably.

Chunking Failure Modes

Chunking strongly affects retrieval.

Bad chunking can split definitions from their explanations, separate tables from captions, or remove headings needed for context.

For example, this chunk is weak:

It must be enabled before deployment.

The pronoun “it” has no context.

A better chunk includes the heading or surrounding sentence:

Feature Flag: Audit Logging
Audit logging must be enabled before deployment.

Good chunking preserves semantic units:

Document type	Chunking strategy
Markdown	Headings and sections
PDFs	Pages plus layout-aware blocks
Code	Functions and classes
Legal text	Clauses and sections
Tickets	Thread messages and metadata
Tables	Table plus caption and headers

Chunking is not a minor preprocessing detail. It controls what the model can see.

Query Rewriting

User queries are often underspecified. Query rewriting improves retrieval by expanding or clarifying the query before search.

Example:

User query:
"How do I fix it?"

Conversation context:
Previous message mentioned CUDA out-of-memory during PyTorch training.

Rewritten query:
"PyTorch CUDA out of memory training batch size gradient checkpointing mixed precision"

Query rewriting may include:

Rewrite type	Example
Context expansion	Add previous conversation details
Acronym expansion	“DDP” to “DistributedDataParallel”
Entity normalization	Product names and versions
Decomposition	Split complex question into subqueries
HyDE	Generate hypothetical answer, then retrieve

Query rewriting is useful, but it can introduce errors. The rewritten query should remain faithful to the user’s intent.

Multi-Hop Retrieval

Some questions require multiple retrieval steps.

Example:

Which model mentioned in the March roadmap later failed the April evaluation?

The system must retrieve the March roadmap, identify the model, then retrieve the April evaluation.

A multi-hop RAG system may:

Retrieve initial documents.
Extract entities.
Form a second query.
Retrieve follow-up documents.
Synthesize across evidence.

Multi-hop retrieval is important for research, legal analysis, debugging, and enterprise knowledge systems.

It is harder than single-shot retrieval because errors compound across steps.

RAG Versus Fine-Tuning

RAG and fine-tuning solve different problems.

Need	Better approach
Current facts	RAG
Private documents	RAG
Source citations	RAG
Stable behavior style	Fine-tuning
Domain-specific language	Fine-tuning
Repeated task format	Fine-tuning
Lower prompt cost	Fine-tuning
Knowledge that changes often	RAG

Fine-tuning changes model parameters. RAG changes the context.

For factual knowledge that updates frequently, RAG is usually preferable. For behavior and style, fine-tuning is often better.

Many production systems use both.

RAG Evaluation

RAG evaluation should measure each stage.

Component	Evaluation question
Retriever	Did it find relevant evidence?
Reranker	Did it rank the best chunks highest?
Generator	Did it answer correctly from evidence?
Citation system	Are citations accurate?
End-to-end system	Did the user get a correct answer?

Important metrics include:

Metric	Meaning
Recall@k	Relevant document appears in top k
Precision@k	Top k documents are relevant
MRR	Reciprocal rank of first relevant item
nDCG	Ranking quality with graded relevance
Faithfulness	Answer supported by sources
Answer correctness	Final answer is right
Citation accuracy	Cited source supports claim

End-to-end accuracy alone is insufficient. A model may answer correctly from memorized knowledge while retrieval failed. Conversely, retrieval may succeed but the generator may ignore evidence.

Security and Access Control

RAG systems often retrieve private data. Security must be part of the architecture.

Key requirements:

Requirement	Purpose
Document-level permissions	Only retrieve authorized documents
Chunk-level filtering	Prevent leakage from mixed documents
Tenant isolation	Separate users and organizations
Audit logs	Track access
Redaction	Remove sensitive fields
Prompt injection defense	Treat documents as untrusted
Source visibility	Let users inspect provenance

Access control must happen before generation. Filtering after the model sees unauthorized text is too late.

PyTorch View: Dense Retrieval

A dense retriever can be implemented with two encoders.

import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)      # [B, D]
doc_emb = doc_encoder(doc_input_ids)            # [N, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

scores = query_emb @ doc_emb.T                  # [B, N]
topk_scores, topk_indices = scores.topk(k=5, dim=-1)

The result topk_indices identifies the most similar document chunks.

In production, document embeddings are precomputed and stored in a vector index. The code above shows the mathematical core, not a scalable serving system.

PyTorch View: Contrastive Retriever Training

A retriever can be trained using contrastive learning.

Suppose each query has one positive document in the same batch.

import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)  # [B, D]
doc_emb = doc_encoder(doc_input_ids)        # [B, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

logits = query_emb @ doc_emb.T              # [B, B]
labels = torch.arange(logits.size(0), device=logits.device)

loss = F.cross_entropy(logits, labels)

Each query should match the document at the same batch index. Other documents in the batch act as negatives.

This is the standard in-batch negative training pattern.

Practical Design Rules

A strong RAG system follows several rules.

First, preserve provenance. Every chunk should carry source metadata.

Second, use hybrid retrieval for mixed corpora. Exact terms and semantic meaning both matter.

Third, rerank before generation when precision matters.

Fourth, evaluate retrieval separately. Many failures happen before the language model sees the prompt.

Fifth, make the model admit insufficient evidence. A grounded system should not guess.

Sixth, enforce access control before retrieval results reach the model.

Seventh, treat retrieved text as untrusted data. It can contain malicious instructions.

Eighth, log retrieved chunks for debugging and audit.

Summary

Retrieval-augmented generation connects language models to external knowledge.

The retrieval system stores and searches documents. The language model reads retrieved evidence and generates an answer.

RAG improves factual grounding, recency, personalization, and source attribution. It is especially important for enterprise search, research assistants, customer support, legal analysis, codebase assistance, and any domain where the answer depends on external documents.

The core challenge is system quality. Good RAG requires careful chunking, indexing, hybrid retrieval, reranking, prompt construction, access control, citation design, and evaluation.