Question Answering

Question answering, often abbreviated QA, is the task of producing an answer to a question. The input may contain only a question, or it may contain both a question and a context passage. The output may be a span from the passage, a generated sentence, a choice from several candidates, or a structured value.

Examples:

Question	Context	Answer
`Who invented the World Wide Web?`	`Tim Berners-Lee invented the World Wide Web in 1989.`	`Tim Berners-Lee`
`When was PyTorch released?`	`PyTorch was first released in 2016.`	`2016`
`Is CUDA required for PyTorch?`	`PyTorch can run on CPUs and GPUs.`	`No`

QA is a useful benchmark because it tests several capabilities at once: language understanding, retrieval, entity recognition, reasoning over context, and answer generation.

Forms of Question Answering

There are several common QA settings.

Type	Input	Output
Extractive QA	Question plus passage	Text span from passage
Generative QA	Question plus optional context	Generated answer text
Multiple-choice QA	Question plus answer options	Selected option
Open-domain QA	Question plus large corpus	Retrieved evidence plus answer
Conversational QA	Dialogue history plus question	Context-dependent answer
Table QA	Question plus table	Cell, row, or computed value

The model architecture depends on the setting. Extractive QA uses span prediction. Generative QA uses sequence generation. Open-domain QA combines retrieval and reading.

Extractive Question Answering

In extractive QA, the answer must appear as a contiguous span in the context passage.

Example:

Question:
Who created Python?

Context:
Python was created by Guido van Rossum and first released in 1991.

Answer:
Guido van Rossum

The model receives the question and context together. It predicts two positions: the start index and the end index of the answer span.

If the tokenized context is:

Index	Token
0	Python
1	was
2	created
3	by
4	Guido
5	van
6	Rossum
7	and
8	first
9	released
10	in
11	1991

then the answer span is:

\text{start} = 4,\quad \text{end} = 6.

The model learns to assign high probability to these two positions.

Input Format for Extractive QA

Transformer QA models commonly concatenate the question and context into one sequence:

[CLS] question tokens [SEP] context tokens [SEP]

For example:

[CLS] Who created Python ? [SEP] Python was created by Guido van Rossum . [SEP]

The input IDs have shape:

[B, T]

The model produces hidden states:

[B, T, D]

A span prediction head maps each token representation to two logits:

start_logits: [B, T]
end_logits:   [B, T]

The highest-scoring start and end positions define the predicted answer span.

Span Prediction Head

A simple extractive QA head is a linear layer that maps each token vector to two scores: one start score and one end score.

import torch
import torch.nn as nn

class ExtractiveQAHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.qa_outputs = nn.Linear(hidden_dim, 2)

    def forward(self, hidden_states):
        # hidden_states: [B, T, D]
        logits = self.qa_outputs(hidden_states)
        # logits: [B, T, 2]

        start_logits = logits[:, :, 0]
        end_logits = logits[:, :, 1]
        # start_logits: [B, T]
        # end_logits: [B, T]

        return start_logits, end_logits

The model does not directly generate the answer string. It selects token positions. The selected tokens are then decoded back into text.

Training Loss for Extractive QA

The training target consists of two integer positions:

start_positions: [B]
end_positions:   [B]

The loss is the sum or average of two cross-entropy losses:

loss_fn = nn.CrossEntropyLoss()

start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)

loss = (start_loss + end_loss) / 2

The start position and end position are learned independently, but decoding usually enforces constraints such as:

start <= end
answer_length <= max_answer_length

These constraints prevent invalid spans.

Decoding Answer Spans

At inference time, the model produces start and end logits. The simplest method selects:

\hat{s} = \arg\max_i z_i^{start}, \quad \hat{e} = \arg\max_j z_j^{end}.

However, this can produce invalid spans if $\hat{e} < \hat{s}$ . A better method searches over valid pairs:

(\hat{s}, \hat{e}) = \arg\max_{s \le e} \left(z_s^{start} + z_e^{end}\right).

A practical decoder also limits span length:

def decode_span(start_logits, end_logits, max_answer_length=30):
    best_score = None
    best_span = (0, 0)

    T = start_logits.size(0)

    for start in range(T):
        max_end = min(T, start + max_answer_length)
        for end in range(start, max_end):
            score = start_logits[start] + end_logits[end]
            if best_score is None or score > best_score:
                best_score = score
                best_span = (start, end)

    return best_span

This procedure is simple and reliable for short contexts.

Handling Long Contexts

Many passages are longer than a transformer’s maximum sequence length. A common solution is sliding-window chunking.

The context is split into overlapping windows:

question + context window 1
question + context window 2
question + context window 3

Each window is scored independently. The answer span with the highest score across all windows is selected.

Overlap is important because an answer may cross a chunk boundary. For example, if the maximum context window is 384 tokens, a stride of 128 tokens allows neighboring windows to share content.

This approach increases compute cost because one question may create several model inputs.

No-Answer Questions

Some QA datasets include questions whose answer does not appear in the context.

Example:

Question:
Who founded Rust?

Context:
Python was created by Guido van Rossum.

Answer:
No answer

A common approach uses the [CLS] token as the no-answer position. If the best span score is lower than the no-answer score, the model predicts no answer.

The model must learn not only where the answer is, but whether the passage contains enough evidence to answer.

This is important in real systems. A QA model that always returns an answer may hallucinate or select irrelevant text.

Generative Question Answering

In generative QA, the model writes the answer rather than selecting a span. The answer may copy words from the context, paraphrase them, combine multiple facts, or produce a short explanation.

Example:

Question:
Why do transformers use positional encodings?

Answer:
Because self-attention has no built-in notion of token order, so positional encodings provide information about sequence position.

Generative QA is usually modeled as conditional generation:

P(a \mid q, c) = \prod_{t=1}^{T} P(a_t \mid a_{<t}, q, c),

where $q$ is the question, $c$ is the context, and $a$ is the answer.

Encoder-decoder models and decoder-only language models can both be used.

Multiple-Choice QA

Multiple-choice QA gives a fixed set of candidate answers.

Example:

Question:
Which operation computes gradients in PyTorch?

A. optimizer.step()
B. loss.backward()
C. model.eval()
D. torch.no_grad()

The correct answer is B.

A common architecture scores each candidate independently. The input is constructed as:

question + candidate answer

For $K$ candidates, the model produces $K$ scores. Cross-entropy loss selects the correct candidate.

class MultipleChoiceHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.scorer = nn.Linear(hidden_dim, 1)

    def forward(self, pooled_states):
        # pooled_states: [B, K, D]
        scores = self.scorer(pooled_states).squeeze(-1)
        # scores: [B, K]
        return scores

Multiple-choice QA is easier to evaluate than free-form QA, but it can reward test-taking shortcuts if the candidates contain artifacts.

Open-Domain Question Answering

Open-domain QA answers questions using a large corpus rather than a provided passage.

The system usually has two stages:

question
-> retriever
-> relevant passages
-> reader or generator
-> answer

The retriever finds candidate documents or passages. The reader extracts or generates the answer using those passages.

Retrieval methods include sparse lexical retrieval, dense vector retrieval, and hybrid retrieval.

Retriever type	Main signal
Sparse retrieval	Keyword overlap
Dense retrieval	Embedding similarity
Hybrid retrieval	Both lexical and dense signals

Open-domain QA depends heavily on retrieval quality. If the correct evidence is missing from retrieved passages, even a strong reader may fail.

Retrieval-Augmented QA

Retrieval-augmented generation, often abbreviated RAG, combines retrieved evidence with a generative model.

A typical RAG pipeline:

question
-> embed question
-> retrieve top-k passages
-> concatenate passages into context
-> generate answer with citations or evidence

The model conditions on retrieved text instead of relying only on its parameters.

RAG is useful when the answer depends on private documents, recent information, or large knowledge bases. The model’s parametric memory is fixed after training, but retrieval can access updated content.

However, RAG introduces new failure modes:

Failure mode	Description
Retrieval miss	Correct evidence is not retrieved
Context overload	Too much irrelevant context distracts the model
Citation mismatch	Answer cites text that does not support it
Conflicting evidence	Retrieved passages disagree
Stale index	Corpus has changed but index has not

Conversational Question Answering

Conversational QA uses dialogue history. A user may ask follow-up questions that depend on previous turns.

Example:

User: Who created Python?
Assistant: Guido van Rossum.
User: When was it first released?

The second question contains the pronoun it, which refers to Python. The system must resolve the reference using dialogue context.

A conversational QA input may include:

previous turns + current question + retrieved context

The model must decide which prior information is relevant.

Conversational QA is harder than single-turn QA because questions may be underspecified, elliptical, or dependent on user intent.

Table Question Answering

Some questions require reasoning over structured tables.

Example:

Year	Revenue
2022	10
2023	14
2024	17

Question:

What was the revenue increase from 2022 to 2024?

Answer:

This requires selecting two cells and computing a difference. Table QA may involve lookup, filtering, aggregation, comparison, and arithmetic.

Models for table QA may serialize tables as text, use table-aware encoders, or generate executable programs such as SQL.

Evaluation Metrics

QA evaluation depends on the task type.

Extractive QA often uses exact match and token-level F1.

Metric	Meaning
Exact match	Predicted answer exactly matches a reference answer
Token F1	Overlap between predicted and reference answer tokens
Accuracy	Used for multiple-choice QA
Faithfulness	Whether answer is supported by context
Citation accuracy	Whether cited evidence supports the answer

Exact match can be too strict. For example:

Guido van Rossum
van Rossum

may refer to the same answer, but exact match treats them as different.

Generative QA requires more careful evaluation. A fluent answer may be unsupported. A short answer may be correct but phrased differently from the reference. Human evaluation or evidence-based checks are often needed.

Calibration and Abstention

A QA system should sometimes say that it cannot answer. This is especially important when the context lacks evidence or when the retrieved passages are weak.

Abstention can be based on:

Signal	Example
No-answer score	Extractive model predicts `[CLS]`
Low span confidence	Best span score is weak
Retrieval score	Retrieved evidence is poor
Entailment check	Answer unsupported by passage
Uncertainty estimate	Model distribution is diffuse

A reliable QA system should prefer abstention over unsupported answers in high-stakes settings.

Common Errors

Question answering systems commonly fail in several ways.

Error type	Example
Wrong span	Selects nearby but incorrect phrase
Boundary error	Misses part of the answer
No-answer failure	Answers when context lacks evidence
Multi-hop failure	Fails to combine facts from multiple passages
Retrieval failure	Correct document is not retrieved
Hallucination	Generates unsupported answer
Temporal error	Uses outdated information
Coreference error	Misunderstands pronouns or follow-ups

For extractive QA, many errors come from span boundaries. For generative QA, many errors come from unsupported synthesis.

Practical PyTorch Dataset Format

A QA dataset usually stores tokenized inputs and answer positions.

For extractive QA:

example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "start_positions": torch.tensor(14, dtype=torch.long),
    "end_positions": torch.tensor(17, dtype=torch.long),
}

For generative QA:

example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "labels": torch.tensor([...], dtype=torch.long),
}

In both cases, padding and truncation must be handled carefully. For extractive QA, if the answer span falls outside a truncated window, that example must be dropped, marked as no-answer, or represented in another window.

Summary

Question answering maps questions to answers. Extractive QA predicts answer spans from a passage. Generative QA writes answer text. Multiple-choice QA selects among candidates. Open-domain QA adds retrieval over a large corpus.

Modern QA systems are usually transformer-based. Extractive models use start and end position heads. Generative models use sequence generation. Retrieval-augmented systems combine search with neural readers or language models.

A useful QA system must handle long contexts, missing evidence, ambiguous questions, and evaluation beyond surface string match.