Question answering, often abbreviated QA, is the task of producing an answer to a question.
Question answering, often abbreviated QA, is the task of producing an answer to a question. The input may contain only a question, or it may contain both a question and a context passage. The output may be a span from the passage, a generated sentence, a choice from several candidates, or a structured value.
Examples:
| Question | Context | Answer |
|---|---|---|
Who invented the World Wide Web? | Tim Berners-Lee invented the World Wide Web in 1989. | Tim Berners-Lee |
When was PyTorch released? | PyTorch was first released in 2016. | 2016 |
Is CUDA required for PyTorch? | PyTorch can run on CPUs and GPUs. | No |
QA is a useful benchmark because it tests several capabilities at once: language understanding, retrieval, entity recognition, reasoning over context, and answer generation.
Forms of Question Answering
There are several common QA settings.
| Type | Input | Output |
|---|---|---|
| Extractive QA | Question plus passage | Text span from passage |
| Generative QA | Question plus optional context | Generated answer text |
| Multiple-choice QA | Question plus answer options | Selected option |
| Open-domain QA | Question plus large corpus | Retrieved evidence plus answer |
| Conversational QA | Dialogue history plus question | Context-dependent answer |
| Table QA | Question plus table | Cell, row, or computed value |
The model architecture depends on the setting. Extractive QA uses span prediction. Generative QA uses sequence generation. Open-domain QA combines retrieval and reading.
Extractive Question Answering
In extractive QA, the answer must appear as a contiguous span in the context passage.
Example:
Question:
Who created Python?
Context:
Python was created by Guido van Rossum and first released in 1991.
Answer:
Guido van RossumThe model receives the question and context together. It predicts two positions: the start index and the end index of the answer span.
If the tokenized context is:
| Index | Token |
|---|---|
| 0 | Python |
| 1 | was |
| 2 | created |
| 3 | by |
| 4 | Guido |
| 5 | van |
| 6 | Rossum |
| 7 | and |
| 8 | first |
| 9 | released |
| 10 | in |
| 11 | 1991 |
then the answer span is:
The model learns to assign high probability to these two positions.
Input Format for Extractive QA
Transformer QA models commonly concatenate the question and context into one sequence:
[CLS] question tokens [SEP] context tokens [SEP]For example:
[CLS] Who created Python ? [SEP] Python was created by Guido van Rossum . [SEP]The input IDs have shape:
[B, T]The model produces hidden states:
[B, T, D]A span prediction head maps each token representation to two logits:
start_logits: [B, T]
end_logits: [B, T]The highest-scoring start and end positions define the predicted answer span.
Span Prediction Head
A simple extractive QA head is a linear layer that maps each token vector to two scores: one start score and one end score.
import torch
import torch.nn as nn
class ExtractiveQAHead(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.qa_outputs = nn.Linear(hidden_dim, 2)
def forward(self, hidden_states):
# hidden_states: [B, T, D]
logits = self.qa_outputs(hidden_states)
# logits: [B, T, 2]
start_logits = logits[:, :, 0]
end_logits = logits[:, :, 1]
# start_logits: [B, T]
# end_logits: [B, T]
return start_logits, end_logitsThe model does not directly generate the answer string. It selects token positions. The selected tokens are then decoded back into text.
Training Loss for Extractive QA
The training target consists of two integer positions:
start_positions: [B]
end_positions: [B]The loss is the sum or average of two cross-entropy losses:
loss_fn = nn.CrossEntropyLoss()
start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)
loss = (start_loss + end_loss) / 2The start position and end position are learned independently, but decoding usually enforces constraints such as:
start <= end
answer_length <= max_answer_lengthThese constraints prevent invalid spans.
Decoding Answer Spans
At inference time, the model produces start and end logits. The simplest method selects:
However, this can produce invalid spans if . A better method searches over valid pairs:
A practical decoder also limits span length:
def decode_span(start_logits, end_logits, max_answer_length=30):
best_score = None
best_span = (0, 0)
T = start_logits.size(0)
for start in range(T):
max_end = min(T, start + max_answer_length)
for end in range(start, max_end):
score = start_logits[start] + end_logits[end]
if best_score is None or score > best_score:
best_score = score
best_span = (start, end)
return best_spanThis procedure is simple and reliable for short contexts.
Handling Long Contexts
Many passages are longer than a transformer’s maximum sequence length. A common solution is sliding-window chunking.
The context is split into overlapping windows:
question + context window 1
question + context window 2
question + context window 3Each window is scored independently. The answer span with the highest score across all windows is selected.
Overlap is important because an answer may cross a chunk boundary. For example, if the maximum context window is 384 tokens, a stride of 128 tokens allows neighboring windows to share content.
This approach increases compute cost because one question may create several model inputs.
No-Answer Questions
Some QA datasets include questions whose answer does not appear in the context.
Example:
Question:
Who founded Rust?
Context:
Python was created by Guido van Rossum.
Answer:
No answerA common approach uses the [CLS] token as the no-answer position. If the best span score is lower than the no-answer score, the model predicts no answer.
The model must learn not only where the answer is, but whether the passage contains enough evidence to answer.
This is important in real systems. A QA model that always returns an answer may hallucinate or select irrelevant text.
Generative Question Answering
In generative QA, the model writes the answer rather than selecting a span. The answer may copy words from the context, paraphrase them, combine multiple facts, or produce a short explanation.
Example:
Question:
Why do transformers use positional encodings?
Answer:
Because self-attention has no built-in notion of token order, so positional encodings provide information about sequence position.Generative QA is usually modeled as conditional generation:
where is the question, is the context, and is the answer.
Encoder-decoder models and decoder-only language models can both be used.
Multiple-Choice QA
Multiple-choice QA gives a fixed set of candidate answers.
Example:
Question:
Which operation computes gradients in PyTorch?
A. optimizer.step()
B. loss.backward()
C. model.eval()
D. torch.no_grad()The correct answer is B.
A common architecture scores each candidate independently. The input is constructed as:
question + candidate answerFor candidates, the model produces scores. Cross-entropy loss selects the correct candidate.
class MultipleChoiceHead(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.scorer = nn.Linear(hidden_dim, 1)
def forward(self, pooled_states):
# pooled_states: [B, K, D]
scores = self.scorer(pooled_states).squeeze(-1)
# scores: [B, K]
return scoresMultiple-choice QA is easier to evaluate than free-form QA, but it can reward test-taking shortcuts if the candidates contain artifacts.
Open-Domain Question Answering
Open-domain QA answers questions using a large corpus rather than a provided passage.
The system usually has two stages:
question
-> retriever
-> relevant passages
-> reader or generator
-> answerThe retriever finds candidate documents or passages. The reader extracts or generates the answer using those passages.
Retrieval methods include sparse lexical retrieval, dense vector retrieval, and hybrid retrieval.
| Retriever type | Main signal |
|---|---|
| Sparse retrieval | Keyword overlap |
| Dense retrieval | Embedding similarity |
| Hybrid retrieval | Both lexical and dense signals |
Open-domain QA depends heavily on retrieval quality. If the correct evidence is missing from retrieved passages, even a strong reader may fail.
Retrieval-Augmented QA
Retrieval-augmented generation, often abbreviated RAG, combines retrieved evidence with a generative model.
A typical RAG pipeline:
question
-> embed question
-> retrieve top-k passages
-> concatenate passages into context
-> generate answer with citations or evidenceThe model conditions on retrieved text instead of relying only on its parameters.
RAG is useful when the answer depends on private documents, recent information, or large knowledge bases. The model’s parametric memory is fixed after training, but retrieval can access updated content.
However, RAG introduces new failure modes:
| Failure mode | Description |
|---|---|
| Retrieval miss | Correct evidence is not retrieved |
| Context overload | Too much irrelevant context distracts the model |
| Citation mismatch | Answer cites text that does not support it |
| Conflicting evidence | Retrieved passages disagree |
| Stale index | Corpus has changed but index has not |
Conversational Question Answering
Conversational QA uses dialogue history. A user may ask follow-up questions that depend on previous turns.
Example:
User: Who created Python?
Assistant: Guido van Rossum.
User: When was it first released?The second question contains the pronoun it, which refers to Python. The system must resolve the reference using dialogue context.
A conversational QA input may include:
previous turns + current question + retrieved contextThe model must decide which prior information is relevant.
Conversational QA is harder than single-turn QA because questions may be underspecified, elliptical, or dependent on user intent.
Table Question Answering
Some questions require reasoning over structured tables.
Example:
| Year | Revenue |
|---|---|
| 2022 | 10 |
| 2023 | 14 |
| 2024 | 17 |
Question:
What was the revenue increase from 2022 to 2024?Answer:
7This requires selecting two cells and computing a difference. Table QA may involve lookup, filtering, aggregation, comparison, and arithmetic.
Models for table QA may serialize tables as text, use table-aware encoders, or generate executable programs such as SQL.
Evaluation Metrics
QA evaluation depends on the task type.
Extractive QA often uses exact match and token-level F1.
| Metric | Meaning |
|---|---|
| Exact match | Predicted answer exactly matches a reference answer |
| Token F1 | Overlap between predicted and reference answer tokens |
| Accuracy | Used for multiple-choice QA |
| Faithfulness | Whether answer is supported by context |
| Citation accuracy | Whether cited evidence supports the answer |
Exact match can be too strict. For example:
Guido van Rossum
van Rossummay refer to the same answer, but exact match treats them as different.
Generative QA requires more careful evaluation. A fluent answer may be unsupported. A short answer may be correct but phrased differently from the reference. Human evaluation or evidence-based checks are often needed.
Calibration and Abstention
A QA system should sometimes say that it cannot answer. This is especially important when the context lacks evidence or when the retrieved passages are weak.
Abstention can be based on:
| Signal | Example |
|---|---|
| No-answer score | Extractive model predicts [CLS] |
| Low span confidence | Best span score is weak |
| Retrieval score | Retrieved evidence is poor |
| Entailment check | Answer unsupported by passage |
| Uncertainty estimate | Model distribution is diffuse |
A reliable QA system should prefer abstention over unsupported answers in high-stakes settings.
Common Errors
Question answering systems commonly fail in several ways.
| Error type | Example |
|---|---|
| Wrong span | Selects nearby but incorrect phrase |
| Boundary error | Misses part of the answer |
| No-answer failure | Answers when context lacks evidence |
| Multi-hop failure | Fails to combine facts from multiple passages |
| Retrieval failure | Correct document is not retrieved |
| Hallucination | Generates unsupported answer |
| Temporal error | Uses outdated information |
| Coreference error | Misunderstands pronouns or follow-ups |
For extractive QA, many errors come from span boundaries. For generative QA, many errors come from unsupported synthesis.
Practical PyTorch Dataset Format
A QA dataset usually stores tokenized inputs and answer positions.
For extractive QA:
example = {
"input_ids": torch.tensor([...], dtype=torch.long),
"attention_mask": torch.tensor([...], dtype=torch.long),
"start_positions": torch.tensor(14, dtype=torch.long),
"end_positions": torch.tensor(17, dtype=torch.long),
}For generative QA:
example = {
"input_ids": torch.tensor([...], dtype=torch.long),
"attention_mask": torch.tensor([...], dtype=torch.long),
"labels": torch.tensor([...], dtype=torch.long),
}In both cases, padding and truncation must be handled carefully. For extractive QA, if the answer span falls outside a truncated window, that example must be dropped, marked as no-answer, or represented in another window.
Summary
Question answering maps questions to answers. Extractive QA predicts answer spans from a passage. Generative QA writes answer text. Multiple-choice QA selects among candidates. Open-domain QA adds retrieval over a large corpus.
Modern QA systems are usually transformer-based. Extractive models use start and end position heads. Generative models use sequence generation. Retrieval-augmented systems combine search with neural readers or language models.
A useful QA system must handle long contexts, missing evidence, ambiguous questions, and evaluation beyond surface string match.