# Retrieval-Augmented Generation

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system. Instead of relying only on knowledge stored in model parameters, the system retrieves relevant documents at inference time and places them into the model’s context.

The core pattern is:

```text
user question
  -> retrieve relevant evidence
  -> condition the model on evidence
  -> generate answer
```

RAG is useful because pretrained models have limited, static, and imperfect memory. Retrieval gives the model access to current information, private documents, large corpora, and evidence that can be cited or inspected.

### Why Retrieval Is Needed

A language model trained on a fixed corpus has several limits.

| Limit | Example |
|---|---|
| Stale knowledge | Recent software releases or laws |
| Missing private data | Company documents, emails, tickets |
| Hallucination | Fluent but unsupported claims |
| Context limit | Large documents cannot fit fully |
| Poor attribution | Hard to know where an answer came from |

Retrieval reduces these problems by separating knowledge storage from generation.

The retrieval system stores documents. The language model reads selected passages and synthesizes an answer. This makes the system more updateable and auditable.

### The RAG Pipeline

A typical RAG system has two phases: indexing and inference.

During indexing:

1. Collect documents.
2. Split documents into chunks.
3. Embed or index chunks.
4. Store chunks and metadata.
5. Build search structures.

During inference:

1. Receive a user query.
2. Retrieve candidate chunks.
3. Optionally rerank candidates.
4. Insert selected evidence into the prompt.
5. Generate an answer.
6. Optionally cite sources.

The inference path can be summarized as:

```text
query -> retriever -> chunks -> reranker -> prompt -> generator -> answer
```

Each stage affects final answer quality.

### Documents, Chunks, and Metadata

Most documents are too long to retrieve as a single unit. They are split into chunks.

A chunk should be large enough to contain useful context but small enough to retrieve precisely.

| Chunk size | Tradeoff |
|---|---|
| Too small | Missing context |
| Too large | Irrelevant text enters prompt |
| Overlapping | Better continuity, more storage |
| Semantic chunks | Better coherence, harder preprocessing |

Metadata is also important. Useful metadata includes:

| Metadata | Purpose |
|---|---|
| Document title | Source identification |
| URL or file path | Traceability |
| Author | Provenance |
| Timestamp | Recency |
| Section heading | Local context |
| Access control tags | Permission filtering |
| Document type | Search filtering |

A retrieval system without metadata is hard to debug and hard to trust.

### Sparse Retrieval

Sparse retrieval uses lexical matching. It retrieves documents based on shared words or terms.

Classic methods include TF-IDF and BM25.

Sparse retrieval is strong when the query and document use the same vocabulary.

Example:

```text
Query: PyTorch DistributedDataParallel gradient synchronization
```

A sparse retriever can find documents containing those exact terms.

Advantages:

| Advantage | Description |
|---|---|
| Fast | Mature inverted-index systems |
| Interpretable | Match terms are visible |
| Strong for exact terms | APIs, names, error messages |
| Cheap | No embedding model required |

Limitations:

| Limitation | Description |
|---|---|
| Vocabulary mismatch | “car” versus “automobile” |
| Weak semantic matching | Meaning may differ from words |
| Poor multilingual transfer | Unless indexed carefully |

Sparse retrieval remains important, especially for code, logs, names, legal text, and exact technical queries.

### Dense Retrieval

Dense retrieval uses neural embeddings.

A query encoder maps the query into a vector. A document encoder maps chunks into vectors. Retrieval uses vector similarity.

Let:

$$
q = f_\theta(\text{query}),
\quad
d_i = g_\theta(\text{chunk}_i).
$$

A common similarity function is dot product or cosine similarity:

$$
s(q,d_i)=q^\top d_i.
$$

Dense retrieval can find semantically related text even when exact words differ.

Example:

```text
Query: how to make model answers cite documents
```

A dense retriever may find chunks about retrieval-augmented generation, source grounding, and attribution.

Advantages:

| Advantage | Description |
|---|---|
| Semantic matching | Handles paraphrase |
| Cross-lingual potential | With multilingual embeddings |
| Useful for natural questions | Good for broad queries |
| Compact scoring | Vector similarity |

Limitations:

| Limitation | Description |
|---|---|
| Harder to inspect | Similarity is less transparent |
| Can miss exact terms | Poor for rare names or IDs |
| Requires embedding model | Extra training or API cost |
| Index maintenance | Vectors must be recomputed after model changes |

Dense retrieval is powerful, but it should often be combined with sparse retrieval.

### Hybrid Retrieval

Hybrid retrieval combines sparse and dense search.

Sparse retrieval catches exact terms. Dense retrieval catches semantic matches.

A hybrid system may:

1. Run BM25.
2. Run vector search.
3. Merge candidate lists.
4. Rerank the combined set.

This is often stronger than either method alone.

| Query type | Best signal |
|---|---|
| Error messages | Sparse |
| API names | Sparse |
| Conceptual questions | Dense |
| Synonym-heavy queries | Dense |
| Mixed technical questions | Hybrid |

Hybrid retrieval is a practical default for serious RAG systems.

### Reranking

The retriever usually returns a candidate set. A reranker then scores candidates more carefully.

A retriever may return 50 or 100 chunks. A reranker selects the top 5 to 10 for the prompt.

Rerankers are often cross-encoders. They read the query and candidate chunk together, producing a relevance score.

| Stage | Cost | Quality |
|---|---:|---:|
| Retriever | Low | Medium |
| Reranker | Higher | Higher |

Reranking improves precision. It is especially useful when the first-stage retriever returns noisy results.

### Prompt Construction

Once chunks are selected, they must be inserted into the prompt.

A simple prompt structure:

```text
Use the following sources to answer the question.

Source 1:
...

Source 2:
...

Question:
...

Answer:
```

Prompt construction should preserve source boundaries.

Each chunk should include citation metadata, such as title, section, page, or URL.

Bad prompt construction can cause:

| Problem | Effect |
|---|---|
| Mixed sources | Citation confusion |
| Missing metadata | Poor attribution |
| Too much context | Model ignores relevant text |
| No instruction hierarchy | Prompt injection risk |
| Truncated chunks | Missing evidence |

The prompt should tell the model how to use evidence and what to do when evidence is insufficient.

### Grounded Generation

Grounded generation means the answer should be supported by retrieved evidence.

A grounded answer should:

| Requirement | Meaning |
|---|---|
| Use sources | Claims come from retrieved text |
| Avoid unsupported claims | No invented details |
| Cite evidence | Point to source chunks |
| Admit insufficiency | Say when evidence is missing |
| Preserve uncertainty | Avoid overclaiming |

RAG does not automatically guarantee grounding. The model may still hallucinate or blend retrieved evidence with parametric memory.

Good systems therefore include instructions such as:

```text
Answer only using the provided sources.
If the sources do not contain the answer, say that the sources are insufficient.
```

Even then, evaluation is required.

### Retrieval Failure Modes

Many RAG errors are retrieval errors, not generation errors.

Common failures include:

| Failure | Description |
|---|---|
| No relevant chunk retrieved | Answer lacks evidence |
| Wrong chunk retrieved | Model answers wrong question |
| Partial evidence | Important context missing |
| Outdated evidence | Older document overrides newer one |
| Access-control failure | Unauthorized document retrieved |
| Duplicate chunks | Prompt wastes space |
| Low-quality source | Weak evidence |
| Conflicting sources | Model fails to resolve disagreement |

When debugging RAG, inspect retrieved chunks before inspecting the model output.

If the evidence is wrong, generation quality cannot fix the system reliably.

### Chunking Failure Modes

Chunking strongly affects retrieval.

Bad chunking can split definitions from their explanations, separate tables from captions, or remove headings needed for context.

For example, this chunk is weak:

```text
It must be enabled before deployment.
```

The pronoun “it” has no context.

A better chunk includes the heading or surrounding sentence:

```text
Feature Flag: Audit Logging
Audit logging must be enabled before deployment.
```

Good chunking preserves semantic units:

| Document type | Chunking strategy |
|---|---|
| Markdown | Headings and sections |
| PDFs | Pages plus layout-aware blocks |
| Code | Functions and classes |
| Legal text | Clauses and sections |
| Tickets | Thread messages and metadata |
| Tables | Table plus caption and headers |

Chunking is not a minor preprocessing detail. It controls what the model can see.

### Query Rewriting

User queries are often underspecified. Query rewriting improves retrieval by expanding or clarifying the query before search.

Example:

```text
User query:
"How do I fix it?"

Conversation context:
Previous message mentioned CUDA out-of-memory during PyTorch training.

Rewritten query:
"PyTorch CUDA out of memory training batch size gradient checkpointing mixed precision"
```

Query rewriting may include:

| Rewrite type | Example |
|---|---|
| Context expansion | Add previous conversation details |
| Acronym expansion | “DDP” to “DistributedDataParallel” |
| Entity normalization | Product names and versions |
| Decomposition | Split complex question into subqueries |
| HyDE | Generate hypothetical answer, then retrieve |

Query rewriting is useful, but it can introduce errors. The rewritten query should remain faithful to the user’s intent.

### Multi-Hop Retrieval

Some questions require multiple retrieval steps.

Example:

```text
Which model mentioned in the March roadmap later failed the April evaluation?
```

The system must retrieve the March roadmap, identify the model, then retrieve the April evaluation.

A multi-hop RAG system may:

1. Retrieve initial documents.
2. Extract entities.
3. Form a second query.
4. Retrieve follow-up documents.
5. Synthesize across evidence.

Multi-hop retrieval is important for research, legal analysis, debugging, and enterprise knowledge systems.

It is harder than single-shot retrieval because errors compound across steps.

### RAG Versus Fine-Tuning

RAG and fine-tuning solve different problems.

| Need | Better approach |
|---|---|
| Current facts | RAG |
| Private documents | RAG |
| Source citations | RAG |
| Stable behavior style | Fine-tuning |
| Domain-specific language | Fine-tuning |
| Repeated task format | Fine-tuning |
| Lower prompt cost | Fine-tuning |
| Knowledge that changes often | RAG |

Fine-tuning changes model parameters. RAG changes the context.

For factual knowledge that updates frequently, RAG is usually preferable. For behavior and style, fine-tuning is often better.

Many production systems use both.

### RAG Evaluation

RAG evaluation should measure each stage.

| Component | Evaluation question |
|---|---|
| Retriever | Did it find relevant evidence? |
| Reranker | Did it rank the best chunks highest? |
| Generator | Did it answer correctly from evidence? |
| Citation system | Are citations accurate? |
| End-to-end system | Did the user get a correct answer? |

Important metrics include:

| Metric | Meaning |
|---|---|
| Recall@k | Relevant document appears in top k |
| Precision@k | Top k documents are relevant |
| MRR | Reciprocal rank of first relevant item |
| nDCG | Ranking quality with graded relevance |
| Faithfulness | Answer supported by sources |
| Answer correctness | Final answer is right |
| Citation accuracy | Cited source supports claim |

End-to-end accuracy alone is insufficient. A model may answer correctly from memorized knowledge while retrieval failed. Conversely, retrieval may succeed but the generator may ignore evidence.

### Security and Access Control

RAG systems often retrieve private data. Security must be part of the architecture.

Key requirements:

| Requirement | Purpose |
|---|---|
| Document-level permissions | Only retrieve authorized documents |
| Chunk-level filtering | Prevent leakage from mixed documents |
| Tenant isolation | Separate users and organizations |
| Audit logs | Track access |
| Redaction | Remove sensitive fields |
| Prompt injection defense | Treat documents as untrusted |
| Source visibility | Let users inspect provenance |

Access control must happen before generation. Filtering after the model sees unauthorized text is too late.

### PyTorch View: Dense Retrieval

A dense retriever can be implemented with two encoders.

```python
import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)      # [B, D]
doc_emb = doc_encoder(doc_input_ids)            # [N, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

scores = query_emb @ doc_emb.T                  # [B, N]
topk_scores, topk_indices = scores.topk(k=5, dim=-1)
```

The result `topk_indices` identifies the most similar document chunks.

In production, document embeddings are precomputed and stored in a vector index. The code above shows the mathematical core, not a scalable serving system.

### PyTorch View: Contrastive Retriever Training

A retriever can be trained using contrastive learning.

Suppose each query has one positive document in the same batch.

```python
import torch
import torch.nn.functional as F

query_emb = query_encoder(query_input_ids)  # [B, D]
doc_emb = doc_encoder(doc_input_ids)        # [B, D]

query_emb = F.normalize(query_emb, dim=-1)
doc_emb = F.normalize(doc_emb, dim=-1)

logits = query_emb @ doc_emb.T              # [B, B]
labels = torch.arange(logits.size(0), device=logits.device)

loss = F.cross_entropy(logits, labels)
```

Each query should match the document at the same batch index. Other documents in the batch act as negatives.

This is the standard in-batch negative training pattern.

### Practical Design Rules

A strong RAG system follows several rules.

First, preserve provenance. Every chunk should carry source metadata.

Second, use hybrid retrieval for mixed corpora. Exact terms and semantic meaning both matter.

Third, rerank before generation when precision matters.

Fourth, evaluate retrieval separately. Many failures happen before the language model sees the prompt.

Fifth, make the model admit insufficient evidence. A grounded system should not guess.

Sixth, enforce access control before retrieval results reach the model.

Seventh, treat retrieved text as untrusted data. It can contain malicious instructions.

Eighth, log retrieved chunks for debugging and audit.

### Summary

Retrieval-augmented generation connects language models to external knowledge.

The retrieval system stores and searches documents. The language model reads retrieved evidence and generates an answer.

RAG improves factual grounding, recency, personalization, and source attribution. It is especially important for enterprise search, research assistants, customer support, legal analysis, codebase assistance, and any domain where the answer depends on external documents.

The core challenge is system quality. Good RAG requires careful chunking, indexing, hybrid retrieval, reranking, prompt construction, access control, citation design, and evaluation.

