# Question Answering

Question answering, often abbreviated **QA**, is the task of producing an answer to a question. The input may contain only a question, or it may contain both a question and a context passage. The output may be a span from the passage, a generated sentence, a choice from several candidates, or a structured value.

Examples:

| Question | Context | Answer |
|---|---|---|
| `Who invented the World Wide Web?` | `Tim Berners-Lee invented the World Wide Web in 1989.` | `Tim Berners-Lee` |
| `When was PyTorch released?` | `PyTorch was first released in 2016.` | `2016` |
| `Is CUDA required for PyTorch?` | `PyTorch can run on CPUs and GPUs.` | `No` |

QA is a useful benchmark because it tests several capabilities at once: language understanding, retrieval, entity recognition, reasoning over context, and answer generation.

### Forms of Question Answering

There are several common QA settings.

| Type | Input | Output |
|---|---|---|
| Extractive QA | Question plus passage | Text span from passage |
| Generative QA | Question plus optional context | Generated answer text |
| Multiple-choice QA | Question plus answer options | Selected option |
| Open-domain QA | Question plus large corpus | Retrieved evidence plus answer |
| Conversational QA | Dialogue history plus question | Context-dependent answer |
| Table QA | Question plus table | Cell, row, or computed value |

The model architecture depends on the setting. Extractive QA uses span prediction. Generative QA uses sequence generation. Open-domain QA combines retrieval and reading.

### Extractive Question Answering

In extractive QA, the answer must appear as a contiguous span in the context passage.

Example:

```text
Question:
Who created Python?

Context:
Python was created by Guido van Rossum and first released in 1991.

Answer:
Guido van Rossum
```

The model receives the question and context together. It predicts two positions: the start index and the end index of the answer span.

If the tokenized context is:

| Index | Token |
|---:|---|
| 0 | Python |
| 1 | was |
| 2 | created |
| 3 | by |
| 4 | Guido |
| 5 | van |
| 6 | Rossum |
| 7 | and |
| 8 | first |
| 9 | released |
| 10 | in |
| 11 | 1991 |

then the answer span is:

$$
\text{start} = 4,\quad \text{end} = 6.
$$

The model learns to assign high probability to these two positions.

### Input Format for Extractive QA

Transformer QA models commonly concatenate the question and context into one sequence:

```text
[CLS] question tokens [SEP] context tokens [SEP]
```

For example:

```text
[CLS] Who created Python ? [SEP] Python was created by Guido van Rossum . [SEP]
```

The input IDs have shape:

```text
[B, T]
```

The model produces hidden states:

```text
[B, T, D]
```

A span prediction head maps each token representation to two logits:

```text
start_logits: [B, T]
end_logits:   [B, T]
```

The highest-scoring start and end positions define the predicted answer span.

### Span Prediction Head

A simple extractive QA head is a linear layer that maps each token vector to two scores: one start score and one end score.

```python
import torch
import torch.nn as nn

class ExtractiveQAHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.qa_outputs = nn.Linear(hidden_dim, 2)

    def forward(self, hidden_states):
        # hidden_states: [B, T, D]
        logits = self.qa_outputs(hidden_states)
        # logits: [B, T, 2]

        start_logits = logits[:, :, 0]
        end_logits = logits[:, :, 1]
        # start_logits: [B, T]
        # end_logits: [B, T]

        return start_logits, end_logits
```

The model does not directly generate the answer string. It selects token positions. The selected tokens are then decoded back into text.

### Training Loss for Extractive QA

The training target consists of two integer positions:

```text
start_positions: [B]
end_positions:   [B]
```

The loss is the sum or average of two cross-entropy losses:

```python
loss_fn = nn.CrossEntropyLoss()

start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)

loss = (start_loss + end_loss) / 2
```

The start position and end position are learned independently, but decoding usually enforces constraints such as:

```text
start <= end
answer_length <= max_answer_length
```

These constraints prevent invalid spans.

### Decoding Answer Spans

At inference time, the model produces start and end logits. The simplest method selects:

$$
\hat{s} = \arg\max_i z_i^{start},
\quad
\hat{e} = \arg\max_j z_j^{end}.
$$

However, this can produce invalid spans if $\hat{e} < \hat{s}$. A better method searches over valid pairs:

$$
(\hat{s}, \hat{e}) =
\arg\max_{s \le e}
\left(z_s^{start} + z_e^{end}\right).
$$

A practical decoder also limits span length:

```python
def decode_span(start_logits, end_logits, max_answer_length=30):
    best_score = None
    best_span = (0, 0)

    T = start_logits.size(0)

    for start in range(T):
        max_end = min(T, start + max_answer_length)
        for end in range(start, max_end):
            score = start_logits[start] + end_logits[end]
            if best_score is None or score > best_score:
                best_score = score
                best_span = (start, end)

    return best_span
```

This procedure is simple and reliable for short contexts.

### Handling Long Contexts

Many passages are longer than a transformer’s maximum sequence length. A common solution is sliding-window chunking.

The context is split into overlapping windows:

```text
question + context window 1
question + context window 2
question + context window 3
```

Each window is scored independently. The answer span with the highest score across all windows is selected.

Overlap is important because an answer may cross a chunk boundary. For example, if the maximum context window is 384 tokens, a stride of 128 tokens allows neighboring windows to share content.

This approach increases compute cost because one question may create several model inputs.

### No-Answer Questions

Some QA datasets include questions whose answer does not appear in the context.

Example:

```text
Question:
Who founded Rust?

Context:
Python was created by Guido van Rossum.

Answer:
No answer
```

A common approach uses the `[CLS]` token as the no-answer position. If the best span score is lower than the no-answer score, the model predicts no answer.

The model must learn not only where the answer is, but whether the passage contains enough evidence to answer.

This is important in real systems. A QA model that always returns an answer may hallucinate or select irrelevant text.

### Generative Question Answering

In generative QA, the model writes the answer rather than selecting a span. The answer may copy words from the context, paraphrase them, combine multiple facts, or produce a short explanation.

Example:

```text
Question:
Why do transformers use positional encodings?

Answer:
Because self-attention has no built-in notion of token order, so positional encodings provide information about sequence position.
```

Generative QA is usually modeled as conditional generation:

$$
P(a \mid q, c) =
\prod_{t=1}^{T}
P(a_t \mid a_{<t}, q, c),
$$

where $q$ is the question, $c$ is the context, and $a$ is the answer.

Encoder-decoder models and decoder-only language models can both be used.

### Multiple-Choice QA

Multiple-choice QA gives a fixed set of candidate answers.

Example:

```text
Question:
Which operation computes gradients in PyTorch?

A. optimizer.step()
B. loss.backward()
C. model.eval()
D. torch.no_grad()
```

The correct answer is B.

A common architecture scores each candidate independently. The input is constructed as:

```text
question + candidate answer
```

For $K$ candidates, the model produces $K$ scores. Cross-entropy loss selects the correct candidate.

```python
class MultipleChoiceHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.scorer = nn.Linear(hidden_dim, 1)

    def forward(self, pooled_states):
        # pooled_states: [B, K, D]
        scores = self.scorer(pooled_states).squeeze(-1)
        # scores: [B, K]
        return scores
```

Multiple-choice QA is easier to evaluate than free-form QA, but it can reward test-taking shortcuts if the candidates contain artifacts.

### Open-Domain Question Answering

Open-domain QA answers questions using a large corpus rather than a provided passage.

The system usually has two stages:

```text
question
-> retriever
-> relevant passages
-> reader or generator
-> answer
```

The retriever finds candidate documents or passages. The reader extracts or generates the answer using those passages.

Retrieval methods include sparse lexical retrieval, dense vector retrieval, and hybrid retrieval.

| Retriever type | Main signal |
|---|---|
| Sparse retrieval | Keyword overlap |
| Dense retrieval | Embedding similarity |
| Hybrid retrieval | Both lexical and dense signals |

Open-domain QA depends heavily on retrieval quality. If the correct evidence is missing from retrieved passages, even a strong reader may fail.

### Retrieval-Augmented QA

Retrieval-augmented generation, often abbreviated RAG, combines retrieved evidence with a generative model.

A typical RAG pipeline:

```text
question
-> embed question
-> retrieve top-k passages
-> concatenate passages into context
-> generate answer with citations or evidence
```

The model conditions on retrieved text instead of relying only on its parameters.

RAG is useful when the answer depends on private documents, recent information, or large knowledge bases. The model’s parametric memory is fixed after training, but retrieval can access updated content.

However, RAG introduces new failure modes:

| Failure mode | Description |
|---|---|
| Retrieval miss | Correct evidence is not retrieved |
| Context overload | Too much irrelevant context distracts the model |
| Citation mismatch | Answer cites text that does not support it |
| Conflicting evidence | Retrieved passages disagree |
| Stale index | Corpus has changed but index has not |

### Conversational Question Answering

Conversational QA uses dialogue history. A user may ask follow-up questions that depend on previous turns.

Example:

```text
User: Who created Python?
Assistant: Guido van Rossum.
User: When was it first released?
```

The second question contains the pronoun `it`, which refers to Python. The system must resolve the reference using dialogue context.

A conversational QA input may include:

```text
previous turns + current question + retrieved context
```

The model must decide which prior information is relevant.

Conversational QA is harder than single-turn QA because questions may be underspecified, elliptical, or dependent on user intent.

### Table Question Answering

Some questions require reasoning over structured tables.

Example:

| Year | Revenue |
|---:|---:|
| 2022 | 10 |
| 2023 | 14 |
| 2024 | 17 |

Question:

```text
What was the revenue increase from 2022 to 2024?
```

Answer:

```text
7
```

This requires selecting two cells and computing a difference. Table QA may involve lookup, filtering, aggregation, comparison, and arithmetic.

Models for table QA may serialize tables as text, use table-aware encoders, or generate executable programs such as SQL.

### Evaluation Metrics

QA evaluation depends on the task type.

Extractive QA often uses exact match and token-level F1.

| Metric | Meaning |
|---|---|
| Exact match | Predicted answer exactly matches a reference answer |
| Token F1 | Overlap between predicted and reference answer tokens |
| Accuracy | Used for multiple-choice QA |
| Faithfulness | Whether answer is supported by context |
| Citation accuracy | Whether cited evidence supports the answer |

Exact match can be too strict. For example:

```text
Guido van Rossum
van Rossum
```

may refer to the same answer, but exact match treats them as different.

Generative QA requires more careful evaluation. A fluent answer may be unsupported. A short answer may be correct but phrased differently from the reference. Human evaluation or evidence-based checks are often needed.

### Calibration and Abstention

A QA system should sometimes say that it cannot answer. This is especially important when the context lacks evidence or when the retrieved passages are weak.

Abstention can be based on:

| Signal | Example |
|---|---|
| No-answer score | Extractive model predicts `[CLS]` |
| Low span confidence | Best span score is weak |
| Retrieval score | Retrieved evidence is poor |
| Entailment check | Answer unsupported by passage |
| Uncertainty estimate | Model distribution is diffuse |

A reliable QA system should prefer abstention over unsupported answers in high-stakes settings.

### Common Errors

Question answering systems commonly fail in several ways.

| Error type | Example |
|---|---|
| Wrong span | Selects nearby but incorrect phrase |
| Boundary error | Misses part of the answer |
| No-answer failure | Answers when context lacks evidence |
| Multi-hop failure | Fails to combine facts from multiple passages |
| Retrieval failure | Correct document is not retrieved |
| Hallucination | Generates unsupported answer |
| Temporal error | Uses outdated information |
| Coreference error | Misunderstands pronouns or follow-ups |

For extractive QA, many errors come from span boundaries. For generative QA, many errors come from unsupported synthesis.

### Practical PyTorch Dataset Format

A QA dataset usually stores tokenized inputs and answer positions.

For extractive QA:

```python
example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "start_positions": torch.tensor(14, dtype=torch.long),
    "end_positions": torch.tensor(17, dtype=torch.long),
}
```

For generative QA:

```python
example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "labels": torch.tensor([...], dtype=torch.long),
}
```

In both cases, padding and truncation must be handled carefully. For extractive QA, if the answer span falls outside a truncated window, that example must be dropped, marked as no-answer, or represented in another window.

### Summary

Question answering maps questions to answers. Extractive QA predicts answer spans from a passage. Generative QA writes answer text. Multiple-choice QA selects among candidates. Open-domain QA adds retrieval over a large corpus.

Modern QA systems are usually transformer-based. Extractive models use start and end position heads. Generative models use sequence generation. Retrieval-augmented systems combine search with neural readers or language models.

A useful QA system must handle long contexts, missing evidence, ambiguous questions, and evaluation beyond surface string match.