Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.
Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.
Examples:
Question: Who wrote The Origin of Species?
Answer: Charles DarwinQuestion: What does dropout do?
Answer: It randomly disables units during training to reduce co-adaptation and improve generalization.Question answering systems are used in search, documentation assistants, customer support, education, legal research, biomedical search, and retrieval-augmented generation.
There are several forms of question answering:
| Type | Input | Output |
|---|---|---|
| Closed-book QA | Question only | Answer from model parameters |
| Open-book QA | Question plus context | Answer from provided context |
| Extractive QA | Question plus passage | Span copied from passage |
| Generative QA | Question plus optional context | Generated text |
| Multiple-choice QA | Question plus options | Selected option |
| Retrieval QA | Question plus document collection | Retrieved evidence plus answer |
Modern systems often combine retrieval and generation. The retriever finds relevant passages. The reader or generator uses those passages to produce the answer.
Extractive Question Answering
Extractive QA assumes that the answer appears as a contiguous span in the context passage.
Input:
Question: Where was Alan Turing born?
Context: Alan Turing was born in Maida Vale, London, and studied at King’s College, Cambridge.Output:
Maida Vale, LondonThe model does not generate arbitrary text. It selects a start token and an end token from the context.
Let the tokenized input be
The model predicts two distributions:
and
The predicted answer span is
The constraint prevents invalid spans.
Encoding Question and Context
For transformer encoders such as BERT, the question and context are usually concatenated into one sequence:
[CLS] question tokens [SEP] context tokens [SEP]The model receives token IDs, segment IDs, and an attention mask.
For example:
input_ids # [B, T]
token_type_ids # [B, T]
attention_mask # [B, T]The token_type_ids tensor distinguishes the question segment from the context segment in models that use segment embeddings.
The transformer produces hidden states:
A linear head maps each token representation to two logits:
The first logit is the start score. The second logit is the end score.
import torch
import torch.nn as nn
class ExtractiveQA(nn.Module):
def __init__(self, encoder, hidden_dim: int):
super().__init__()
self.encoder = encoder
self.qa_outputs = nn.Linear(hidden_dim, 2)
def forward(
self,
input_ids,
attention_mask,
token_type_ids=None,
start_positions=None,
end_positions=None,
):
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
)
hidden = outputs.last_hidden_state # [B, T, D]
logits = self.qa_outputs(hidden) # [B, T, 2]
start_logits = logits[..., 0] # [B, T]
end_logits = logits[..., 1] # [B, T]
if start_positions is None or end_positions is None:
return start_logits, end_logits
loss_fn = nn.CrossEntropyLoss()
start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)
loss = (start_loss + end_loss) / 2
return loss, start_logits, end_logitsThis is the standard architecture for span extraction.
Training Objective
For each training example, the dataset provides the correct start and end positions.
The loss is the average of two cross-entropy terms:
where and are the gold start and end positions.
In PyTorch, nn.CrossEntropyLoss expects logits of shape [B, T] and labels of shape [B].
loss_fn = nn.CrossEntropyLoss()
start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)
loss = (start_loss + end_loss) / 2The model learns to assign high probability to the correct span boundaries.
Decoding Answer Spans
At inference time, the model produces start and end logits. We must convert them into a valid span.
A simple decoder chooses the best pair such that:
and the span length does not exceed a maximum length.
def decode_span(start_logits, end_logits, max_answer_len=30):
best_score = None
best_span = (0, 0)
for start in range(len(start_logits)):
max_end = min(len(end_logits), start + max_answer_len)
for end in range(start, max_end):
score = start_logits[start].item() + end_logits[end].item()
if best_score is None or score > best_score:
best_score = score
best_span = (start, end)
return best_spanIn practice, systems often consider the top start positions and top end positions rather than all pairs. This is faster and usually sufficient.
The predicted token span must then be converted back into text using tokenizer offsets.
Handling Long Contexts
Transformer encoders have a maximum sequence length. If the context is longer than the model limit, we split it into overlapping windows.
For example, a document may be divided into windows of 384 tokens with a stride of 128 tokens. Each window is paired with the same question.
question + context window 1
question + context window 2
question + context window 3The model predicts an answer span for each window. The system selects the highest-scoring span across all windows.
This approach is simple and effective, but it has limitations. If the answer requires evidence from distant parts of the document, a fixed window may miss the necessary context. Long-context models, retrieval systems, and hierarchical encoders help with this problem.
No-Answer Questions
Some datasets include questions that cannot be answered from the provided passage.
Example:
Question: What year did Ada Lovelace win the Turing Award?
Context: Ada Lovelace wrote notes on Charles Babbage’s Analytical Engine in the 1840s.There is no valid answer in the context.
A common method is to let the special [CLS] token represent “no answer.” The model predicts start and end positions at [CLS] when no answer exists.
The system compares the best non-empty span score with the no-answer score. If the no-answer score is higher by a threshold, it returns no answer.
def should_answer(best_span_score, no_answer_score, threshold=0.0):
return best_span_score > no_answer_score + thresholdThe threshold is tuned on a validation set.
Generative Question Answering
Generative QA produces answer text directly. Instead of selecting a span, the model generates a sequence of tokens.
The input may be:
question: What is dropout?
context: Dropout randomly masks hidden units during training...
answer:The output may be:
Dropout randomly disables units during training to improve generalization.Encoder-decoder models and decoder-only language models are commonly used for generative QA.
For an answer sequence
the model defines
Training uses teacher forcing and token-level cross-entropy:
Generative QA is more flexible than extractive QA. It can summarize, synthesize, and rephrase. It can also hallucinate, so evidence grounding and evaluation become more important.
Retrieval-Augmented Question Answering
Retrieval-augmented QA uses an external document collection. The system first retrieves relevant passages, then answers using those passages.
A typical pipeline is:
- Receive a question.
- Retrieve candidate passages.
- Rerank passages.
- Feed top passages to a reader or generator.
- Produce an answer with citations or evidence.
The retriever may use sparse search, dense search, or a hybrid method.
Sparse retrieval uses lexical matching, such as BM25. Dense retrieval embeds questions and passages into vectors and compares them by dot product or cosine similarity.
Let be the question embedding and be the embedding of passage . Dense retrieval ranks passages by
Hybrid retrieval often works better than either sparse or dense retrieval alone, especially in technical domains where exact terms matter.
Multiple-Choice Question Answering
In multiple-choice QA, the model selects one answer from a fixed set of options.
Example:
Question: Which model architecture uses self-attention as its central operation?
A. Decision tree
B. Transformer
C. Naive Bayes
D. k-meansThe model scores each option:
where is the question and is option .
The probability of option is
Training uses cross-entropy over the answer choices.
A common implementation concatenates the question with each option, encodes each pair, and applies a scalar scoring head.
class MultipleChoiceQA(nn.Module):
def __init__(self, encoder, hidden_dim: int):
super().__init__()
self.encoder = encoder
self.scorer = nn.Linear(hidden_dim, 1)
def forward(self, input_ids, attention_mask, labels=None):
# input_ids: [B, C, T]
# C is the number of choices.
B, C, T = input_ids.shape
flat_input_ids = input_ids.reshape(B * C, T)
flat_attention_mask = attention_mask.reshape(B * C, T)
outputs = self.encoder(
input_ids=flat_input_ids,
attention_mask=flat_attention_mask,
)
cls_state = outputs.last_hidden_state[:, 0, :] # [B*C, D]
scores = self.scorer(cls_state).reshape(B, C) # [B, C]
if labels is None:
return scores
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(scores, labels)
return loss, scoresEvaluation
Question answering evaluation depends on the task.
For extractive QA, common metrics are exact match and token-level F1.
Exact match requires the predicted answer string to match a gold answer after normalization.
Token-level F1 compares overlapping tokens between the predicted answer and gold answer.
For generative QA, exact string matching is often too strict. Valid answers may be phrased differently. Metrics may include semantic similarity, human judgment, factual consistency checks, citation accuracy, and task-specific scoring.
For retrieval QA, evaluation should measure both retrieval quality and answer quality.
| Component | Common metrics |
|---|---|
| Retriever | Recall@k, MRR, nDCG |
| Reader | Exact match, F1 |
| Generator | Human preference, factuality, citation support |
| End-to-end system | Answer correctness, evidence quality, latency |
A system can fail because the retriever missed the evidence, because the reader selected the wrong span, or because the generator ignored the evidence. Evaluating each component separately makes debugging easier.
Common Failure Modes
Question answering systems have several recurring failure modes.
| Failure mode | Description |
|---|---|
| Retrieval miss | Relevant evidence does not reach the model |
| Boundary error | Extracted span is too short or too long |
| Entity confusion | Model selects the wrong person, date, or organization |
| Negation error | Model ignores “not,” “except,” or “unless” |
| Multi-hop failure | Answer requires combining evidence across passages |
| Hallucination | Generator produces unsupported claims |
| Temporal error | Model uses stale or wrong time context |
| Ambiguous question | Several answers are plausible |
| Unanswerable question | Model answers despite insufficient evidence |
In high-stakes systems, the model should be allowed to abstain. Returning “not enough information” is often better than producing a confident unsupported answer.
Practical Design Choices
The right QA architecture depends on the setting.
| Setting | Suitable approach |
|---|---|
| Answer appears in a short passage | Extractive QA |
| Need fluent explanation | Generative QA |
| Large document collection | Retrieval-augmented QA |
| Fixed answer options | Multiple-choice QA |
| Legal or medical evidence | Retrieval plus extractive or citation-grounded generation |
| Customer support | Retrieval plus generative answer with source links |
Extractive systems are easier to constrain because answers come from the passage. Generative systems are more flexible but need stronger grounding. Retrieval-augmented systems are usually the best default when the answer should depend on external documents.
Summary
Question answering maps questions to answers. Extractive QA selects a span from a context passage. Generative QA produces answer text token by token. Multiple-choice QA scores candidate answers. Retrieval-augmented QA first finds evidence, then answers from that evidence.
In PyTorch, extractive QA is typically implemented with a transformer encoder and a span prediction head. The model predicts start and end logits over token positions. Training uses cross-entropy for both boundaries.
Reliable QA depends on more than the model. Long contexts, unanswerable questions, retrieval quality, answer calibration, and evidence grounding are central design concerns.