Skip to content

Question Answering

Question answering, often abbreviated QA, is the task of producing an answer to a question.

Question answering, often abbreviated QA, is the task of producing an answer to a question. The input may contain only a question, or it may contain both a question and a context passage. The output may be a span from the passage, a generated sentence, a choice from several candidates, or a structured value.

Examples:

QuestionContextAnswer
Who invented the World Wide Web?Tim Berners-Lee invented the World Wide Web in 1989.Tim Berners-Lee
When was PyTorch released?PyTorch was first released in 2016.2016
Is CUDA required for PyTorch?PyTorch can run on CPUs and GPUs.No

QA is a useful benchmark because it tests several capabilities at once: language understanding, retrieval, entity recognition, reasoning over context, and answer generation.

Forms of Question Answering

There are several common QA settings.

TypeInputOutput
Extractive QAQuestion plus passageText span from passage
Generative QAQuestion plus optional contextGenerated answer text
Multiple-choice QAQuestion plus answer optionsSelected option
Open-domain QAQuestion plus large corpusRetrieved evidence plus answer
Conversational QADialogue history plus questionContext-dependent answer
Table QAQuestion plus tableCell, row, or computed value

The model architecture depends on the setting. Extractive QA uses span prediction. Generative QA uses sequence generation. Open-domain QA combines retrieval and reading.

Extractive Question Answering

In extractive QA, the answer must appear as a contiguous span in the context passage.

Example:

Question:
Who created Python?

Context:
Python was created by Guido van Rossum and first released in 1991.

Answer:
Guido van Rossum

The model receives the question and context together. It predicts two positions: the start index and the end index of the answer span.

If the tokenized context is:

IndexToken
0Python
1was
2created
3by
4Guido
5van
6Rossum
7and
8first
9released
10in
111991

then the answer span is:

start=4,end=6. \text{start} = 4,\quad \text{end} = 6.

The model learns to assign high probability to these two positions.

Input Format for Extractive QA

Transformer QA models commonly concatenate the question and context into one sequence:

[CLS] question tokens [SEP] context tokens [SEP]

For example:

[CLS] Who created Python ? [SEP] Python was created by Guido van Rossum . [SEP]

The input IDs have shape:

[B, T]

The model produces hidden states:

[B, T, D]

A span prediction head maps each token representation to two logits:

start_logits: [B, T]
end_logits:   [B, T]

The highest-scoring start and end positions define the predicted answer span.

Span Prediction Head

A simple extractive QA head is a linear layer that maps each token vector to two scores: one start score and one end score.

import torch
import torch.nn as nn

class ExtractiveQAHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.qa_outputs = nn.Linear(hidden_dim, 2)

    def forward(self, hidden_states):
        # hidden_states: [B, T, D]
        logits = self.qa_outputs(hidden_states)
        # logits: [B, T, 2]

        start_logits = logits[:, :, 0]
        end_logits = logits[:, :, 1]
        # start_logits: [B, T]
        # end_logits: [B, T]

        return start_logits, end_logits

The model does not directly generate the answer string. It selects token positions. The selected tokens are then decoded back into text.

Training Loss for Extractive QA

The training target consists of two integer positions:

start_positions: [B]
end_positions:   [B]

The loss is the sum or average of two cross-entropy losses:

loss_fn = nn.CrossEntropyLoss()

start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)

loss = (start_loss + end_loss) / 2

The start position and end position are learned independently, but decoding usually enforces constraints such as:

start <= end
answer_length <= max_answer_length

These constraints prevent invalid spans.

Decoding Answer Spans

At inference time, the model produces start and end logits. The simplest method selects:

s^=argmaxizistart,e^=argmaxjzjend. \hat{s} = \arg\max_i z_i^{start}, \quad \hat{e} = \arg\max_j z_j^{end}.

However, this can produce invalid spans if e^<s^\hat{e} < \hat{s}. A better method searches over valid pairs:

(s^,e^)=argmaxse(zsstart+zeend). (\hat{s}, \hat{e}) = \arg\max_{s \le e} \left(z_s^{start} + z_e^{end}\right).

A practical decoder also limits span length:

def decode_span(start_logits, end_logits, max_answer_length=30):
    best_score = None
    best_span = (0, 0)

    T = start_logits.size(0)

    for start in range(T):
        max_end = min(T, start + max_answer_length)
        for end in range(start, max_end):
            score = start_logits[start] + end_logits[end]
            if best_score is None or score > best_score:
                best_score = score
                best_span = (start, end)

    return best_span

This procedure is simple and reliable for short contexts.

Handling Long Contexts

Many passages are longer than a transformer’s maximum sequence length. A common solution is sliding-window chunking.

The context is split into overlapping windows:

question + context window 1
question + context window 2
question + context window 3

Each window is scored independently. The answer span with the highest score across all windows is selected.

Overlap is important because an answer may cross a chunk boundary. For example, if the maximum context window is 384 tokens, a stride of 128 tokens allows neighboring windows to share content.

This approach increases compute cost because one question may create several model inputs.

No-Answer Questions

Some QA datasets include questions whose answer does not appear in the context.

Example:

Question:
Who founded Rust?

Context:
Python was created by Guido van Rossum.

Answer:
No answer

A common approach uses the [CLS] token as the no-answer position. If the best span score is lower than the no-answer score, the model predicts no answer.

The model must learn not only where the answer is, but whether the passage contains enough evidence to answer.

This is important in real systems. A QA model that always returns an answer may hallucinate or select irrelevant text.

Generative Question Answering

In generative QA, the model writes the answer rather than selecting a span. The answer may copy words from the context, paraphrase them, combine multiple facts, or produce a short explanation.

Example:

Question:
Why do transformers use positional encodings?

Answer:
Because self-attention has no built-in notion of token order, so positional encodings provide information about sequence position.

Generative QA is usually modeled as conditional generation:

P(aq,c)=t=1TP(ata<t,q,c), P(a \mid q, c) = \prod_{t=1}^{T} P(a_t \mid a_{<t}, q, c),

where qq is the question, cc is the context, and aa is the answer.

Encoder-decoder models and decoder-only language models can both be used.

Multiple-Choice QA

Multiple-choice QA gives a fixed set of candidate answers.

Example:

Question:
Which operation computes gradients in PyTorch?

A. optimizer.step()
B. loss.backward()
C. model.eval()
D. torch.no_grad()

The correct answer is B.

A common architecture scores each candidate independently. The input is constructed as:

question + candidate answer

For KK candidates, the model produces KK scores. Cross-entropy loss selects the correct candidate.

class MultipleChoiceHead(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.scorer = nn.Linear(hidden_dim, 1)

    def forward(self, pooled_states):
        # pooled_states: [B, K, D]
        scores = self.scorer(pooled_states).squeeze(-1)
        # scores: [B, K]
        return scores

Multiple-choice QA is easier to evaluate than free-form QA, but it can reward test-taking shortcuts if the candidates contain artifacts.

Open-Domain Question Answering

Open-domain QA answers questions using a large corpus rather than a provided passage.

The system usually has two stages:

question
-> retriever
-> relevant passages
-> reader or generator
-> answer

The retriever finds candidate documents or passages. The reader extracts or generates the answer using those passages.

Retrieval methods include sparse lexical retrieval, dense vector retrieval, and hybrid retrieval.

Retriever typeMain signal
Sparse retrievalKeyword overlap
Dense retrievalEmbedding similarity
Hybrid retrievalBoth lexical and dense signals

Open-domain QA depends heavily on retrieval quality. If the correct evidence is missing from retrieved passages, even a strong reader may fail.

Retrieval-Augmented QA

Retrieval-augmented generation, often abbreviated RAG, combines retrieved evidence with a generative model.

A typical RAG pipeline:

question
-> embed question
-> retrieve top-k passages
-> concatenate passages into context
-> generate answer with citations or evidence

The model conditions on retrieved text instead of relying only on its parameters.

RAG is useful when the answer depends on private documents, recent information, or large knowledge bases. The model’s parametric memory is fixed after training, but retrieval can access updated content.

However, RAG introduces new failure modes:

Failure modeDescription
Retrieval missCorrect evidence is not retrieved
Context overloadToo much irrelevant context distracts the model
Citation mismatchAnswer cites text that does not support it
Conflicting evidenceRetrieved passages disagree
Stale indexCorpus has changed but index has not

Conversational Question Answering

Conversational QA uses dialogue history. A user may ask follow-up questions that depend on previous turns.

Example:

User: Who created Python?
Assistant: Guido van Rossum.
User: When was it first released?

The second question contains the pronoun it, which refers to Python. The system must resolve the reference using dialogue context.

A conversational QA input may include:

previous turns + current question + retrieved context

The model must decide which prior information is relevant.

Conversational QA is harder than single-turn QA because questions may be underspecified, elliptical, or dependent on user intent.

Table Question Answering

Some questions require reasoning over structured tables.

Example:

YearRevenue
202210
202314
202417

Question:

What was the revenue increase from 2022 to 2024?

Answer:

7

This requires selecting two cells and computing a difference. Table QA may involve lookup, filtering, aggregation, comparison, and arithmetic.

Models for table QA may serialize tables as text, use table-aware encoders, or generate executable programs such as SQL.

Evaluation Metrics

QA evaluation depends on the task type.

Extractive QA often uses exact match and token-level F1.

MetricMeaning
Exact matchPredicted answer exactly matches a reference answer
Token F1Overlap between predicted and reference answer tokens
AccuracyUsed for multiple-choice QA
FaithfulnessWhether answer is supported by context
Citation accuracyWhether cited evidence supports the answer

Exact match can be too strict. For example:

Guido van Rossum
van Rossum

may refer to the same answer, but exact match treats them as different.

Generative QA requires more careful evaluation. A fluent answer may be unsupported. A short answer may be correct but phrased differently from the reference. Human evaluation or evidence-based checks are often needed.

Calibration and Abstention

A QA system should sometimes say that it cannot answer. This is especially important when the context lacks evidence or when the retrieved passages are weak.

Abstention can be based on:

SignalExample
No-answer scoreExtractive model predicts [CLS]
Low span confidenceBest span score is weak
Retrieval scoreRetrieved evidence is poor
Entailment checkAnswer unsupported by passage
Uncertainty estimateModel distribution is diffuse

A reliable QA system should prefer abstention over unsupported answers in high-stakes settings.

Common Errors

Question answering systems commonly fail in several ways.

Error typeExample
Wrong spanSelects nearby but incorrect phrase
Boundary errorMisses part of the answer
No-answer failureAnswers when context lacks evidence
Multi-hop failureFails to combine facts from multiple passages
Retrieval failureCorrect document is not retrieved
HallucinationGenerates unsupported answer
Temporal errorUses outdated information
Coreference errorMisunderstands pronouns or follow-ups

For extractive QA, many errors come from span boundaries. For generative QA, many errors come from unsupported synthesis.

Practical PyTorch Dataset Format

A QA dataset usually stores tokenized inputs and answer positions.

For extractive QA:

example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "start_positions": torch.tensor(14, dtype=torch.long),
    "end_positions": torch.tensor(17, dtype=torch.long),
}

For generative QA:

example = {
    "input_ids": torch.tensor([...], dtype=torch.long),
    "attention_mask": torch.tensor([...], dtype=torch.long),
    "labels": torch.tensor([...], dtype=torch.long),
}

In both cases, padding and truncation must be handled carefully. For extractive QA, if the answer span falls outside a truncated window, that example must be dropped, marked as no-answer, or represented in another window.

Summary

Question answering maps questions to answers. Extractive QA predicts answer spans from a passage. Generative QA writes answer text. Multiple-choice QA selects among candidates. Open-domain QA adds retrieval over a large corpus.

Modern QA systems are usually transformer-based. Extractive models use start and end position heads. Generative models use sequence generation. Retrieval-augmented systems combine search with neural readers or language models.

A useful QA system must handle long contexts, missing evidence, ambiguous questions, and evaluation beyond surface string match.