Skip to content

Question Answering

Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.

Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.

Examples:

Question: Who wrote The Origin of Species?
Answer: Charles Darwin
Question: What does dropout do?
Answer: It randomly disables units during training to reduce co-adaptation and improve generalization.

Question answering systems are used in search, documentation assistants, customer support, education, legal research, biomedical search, and retrieval-augmented generation.

There are several forms of question answering:

TypeInputOutput
Closed-book QAQuestion onlyAnswer from model parameters
Open-book QAQuestion plus contextAnswer from provided context
Extractive QAQuestion plus passageSpan copied from passage
Generative QAQuestion plus optional contextGenerated text
Multiple-choice QAQuestion plus optionsSelected option
Retrieval QAQuestion plus document collectionRetrieved evidence plus answer

Modern systems often combine retrieval and generation. The retriever finds relevant passages. The reader or generator uses those passages to produce the answer.

Extractive Question Answering

Extractive QA assumes that the answer appears as a contiguous span in the context passage.

Input:

Question: Where was Alan Turing born?

Context: Alan Turing was born in Maida Vale, London, and studied at King’s College, Cambridge.

Output:

Maida Vale, London

The model does not generate arbitrary text. It selects a start token and an end token from the context.

Let the tokenized input be

x=(x1,x2,,xT). x = (x_1, x_2, \ldots, x_T).

The model predicts two distributions:

pstart(ix) p_{\text{start}}(i \mid x)

and

pend(jx). p_{\text{end}}(j \mid x).

The predicted answer span is

(i^,j^)=argmaxijpstart(ix)pend(jx). (\hat{i}, \hat{j}) = \arg\max_{i \leq j} p_{\text{start}}(i \mid x) p_{\text{end}}(j \mid x).

The constraint iji \leq j prevents invalid spans.

Encoding Question and Context

For transformer encoders such as BERT, the question and context are usually concatenated into one sequence:

[CLS] question tokens [SEP] context tokens [SEP]

The model receives token IDs, segment IDs, and an attention mask.

For example:

input_ids      # [B, T]
token_type_ids # [B, T]
attention_mask # [B, T]

The token_type_ids tensor distinguishes the question segment from the context segment in models that use segment embeddings.

The transformer produces hidden states:

HRB×T×D. H \in \mathbb{R}^{B \times T \times D}.

A linear head maps each token representation to two logits:

ZRB×T×2. Z \in \mathbb{R}^{B \times T \times 2}.

The first logit is the start score. The second logit is the end score.

import torch
import torch.nn as nn

class ExtractiveQA(nn.Module):
    def __init__(self, encoder, hidden_dim: int):
        super().__init__()
        self.encoder = encoder
        self.qa_outputs = nn.Linear(hidden_dim, 2)

    def forward(
        self,
        input_ids,
        attention_mask,
        token_type_ids=None,
        start_positions=None,
        end_positions=None,
    ):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        hidden = outputs.last_hidden_state       # [B, T, D]
        logits = self.qa_outputs(hidden)         # [B, T, 2]

        start_logits = logits[..., 0]            # [B, T]
        end_logits = logits[..., 1]              # [B, T]

        if start_positions is None or end_positions is None:
            return start_logits, end_logits

        loss_fn = nn.CrossEntropyLoss()

        start_loss = loss_fn(start_logits, start_positions)
        end_loss = loss_fn(end_logits, end_positions)

        loss = (start_loss + end_loss) / 2
        return loss, start_logits, end_logits

This is the standard architecture for span extraction.

Training Objective

For each training example, the dataset provides the correct start and end positions.

The loss is the average of two cross-entropy terms:

L=12(logpstart(ix)logpend(jx)), \mathcal{L} = \frac{1}{2} \left( -\log p_{\text{start}}(i^\star \mid x) - \log p_{\text{end}}(j^\star \mid x) \right),

where ii^\star and jj^\star are the gold start and end positions.

In PyTorch, nn.CrossEntropyLoss expects logits of shape [B, T] and labels of shape [B].

loss_fn = nn.CrossEntropyLoss()

start_loss = loss_fn(start_logits, start_positions)
end_loss = loss_fn(end_logits, end_positions)

loss = (start_loss + end_loss) / 2

The model learns to assign high probability to the correct span boundaries.

Decoding Answer Spans

At inference time, the model produces start and end logits. We must convert them into a valid span.

A simple decoder chooses the best pair (i,j)(i,j) such that:

ij i \leq j

and the span length does not exceed a maximum length.

def decode_span(start_logits, end_logits, max_answer_len=30):
    best_score = None
    best_span = (0, 0)

    for start in range(len(start_logits)):
        max_end = min(len(end_logits), start + max_answer_len)

        for end in range(start, max_end):
            score = start_logits[start].item() + end_logits[end].item()

            if best_score is None or score > best_score:
                best_score = score
                best_span = (start, end)

    return best_span

In practice, systems often consider the top kk start positions and top kk end positions rather than all pairs. This is faster and usually sufficient.

The predicted token span must then be converted back into text using tokenizer offsets.

Handling Long Contexts

Transformer encoders have a maximum sequence length. If the context is longer than the model limit, we split it into overlapping windows.

For example, a document may be divided into windows of 384 tokens with a stride of 128 tokens. Each window is paired with the same question.

question + context window 1
question + context window 2
question + context window 3

The model predicts an answer span for each window. The system selects the highest-scoring span across all windows.

This approach is simple and effective, but it has limitations. If the answer requires evidence from distant parts of the document, a fixed window may miss the necessary context. Long-context models, retrieval systems, and hierarchical encoders help with this problem.

No-Answer Questions

Some datasets include questions that cannot be answered from the provided passage.

Example:

Question: What year did Ada Lovelace win the Turing Award?

Context: Ada Lovelace wrote notes on Charles Babbage’s Analytical Engine in the 1840s.

There is no valid answer in the context.

A common method is to let the special [CLS] token represent “no answer.” The model predicts start and end positions at [CLS] when no answer exists.

The system compares the best non-empty span score with the no-answer score. If the no-answer score is higher by a threshold, it returns no answer.

def should_answer(best_span_score, no_answer_score, threshold=0.0):
    return best_span_score > no_answer_score + threshold

The threshold is tuned on a validation set.

Generative Question Answering

Generative QA produces answer text directly. Instead of selecting a span, the model generates a sequence of tokens.

The input may be:

question: What is dropout?
context: Dropout randomly masks hidden units during training...
answer:

The output may be:

Dropout randomly disables units during training to improve generalization.

Encoder-decoder models and decoder-only language models are commonly used for generative QA.

For an answer sequence

y=(y1,y2,,yM), y = (y_1, y_2, \ldots, y_M),

the model defines

p(yx)=m=1Mp(ymx,y<m). p(y \mid x) = \prod_{m=1}^{M} p(y_m \mid x, y_{<m}).

Training uses teacher forcing and token-level cross-entropy:

L=m=1Mlogpθ(ymx,y<m). \mathcal{L} = -\sum_{m=1}^{M} \log p_\theta(y_m^\star \mid x, y_{<m}^\star).

Generative QA is more flexible than extractive QA. It can summarize, synthesize, and rephrase. It can also hallucinate, so evidence grounding and evaluation become more important.

Retrieval-Augmented Question Answering

Retrieval-augmented QA uses an external document collection. The system first retrieves relevant passages, then answers using those passages.

A typical pipeline is:

  1. Receive a question.
  2. Retrieve candidate passages.
  3. Rerank passages.
  4. Feed top passages to a reader or generator.
  5. Produce an answer with citations or evidence.

The retriever may use sparse search, dense search, or a hybrid method.

Sparse retrieval uses lexical matching, such as BM25. Dense retrieval embeds questions and passages into vectors and compares them by dot product or cosine similarity.

Let qq be the question embedding and did_i be the embedding of passage ii. Dense retrieval ranks passages by

s(q,di)=qdi. s(q,d_i) = q^\top d_i.

Hybrid retrieval often works better than either sparse or dense retrieval alone, especially in technical domains where exact terms matter.

Multiple-Choice Question Answering

In multiple-choice QA, the model selects one answer from a fixed set of options.

Example:

Question: Which model architecture uses self-attention as its central operation?

A. Decision tree
B. Transformer
C. Naive Bayes
D. k-means

The model scores each option:

sk=fθ(q,ak), s_k = f_\theta(q, a_k),

where qq is the question and aka_k is option kk.

The probability of option kk is

p(kq)=exp(sk)jexp(sj). p(k \mid q) = \frac{\exp(s_k)} {\sum_j \exp(s_j)}.

Training uses cross-entropy over the answer choices.

A common implementation concatenates the question with each option, encodes each pair, and applies a scalar scoring head.

class MultipleChoiceQA(nn.Module):
    def __init__(self, encoder, hidden_dim: int):
        super().__init__()
        self.encoder = encoder
        self.scorer = nn.Linear(hidden_dim, 1)

    def forward(self, input_ids, attention_mask, labels=None):
        # input_ids: [B, C, T]
        # C is the number of choices.

        B, C, T = input_ids.shape

        flat_input_ids = input_ids.reshape(B * C, T)
        flat_attention_mask = attention_mask.reshape(B * C, T)

        outputs = self.encoder(
            input_ids=flat_input_ids,
            attention_mask=flat_attention_mask,
        )

        cls_state = outputs.last_hidden_state[:, 0, :]  # [B*C, D]
        scores = self.scorer(cls_state).reshape(B, C)   # [B, C]

        if labels is None:
            return scores

        loss_fn = nn.CrossEntropyLoss()
        loss = loss_fn(scores, labels)
        return loss, scores

Evaluation

Question answering evaluation depends on the task.

For extractive QA, common metrics are exact match and token-level F1.

Exact match requires the predicted answer string to match a gold answer after normalization.

Token-level F1 compares overlapping tokens between the predicted answer and gold answer.

For generative QA, exact string matching is often too strict. Valid answers may be phrased differently. Metrics may include semantic similarity, human judgment, factual consistency checks, citation accuracy, and task-specific scoring.

For retrieval QA, evaluation should measure both retrieval quality and answer quality.

ComponentCommon metrics
RetrieverRecall@k, MRR, nDCG
ReaderExact match, F1
GeneratorHuman preference, factuality, citation support
End-to-end systemAnswer correctness, evidence quality, latency

A system can fail because the retriever missed the evidence, because the reader selected the wrong span, or because the generator ignored the evidence. Evaluating each component separately makes debugging easier.

Common Failure Modes

Question answering systems have several recurring failure modes.

Failure modeDescription
Retrieval missRelevant evidence does not reach the model
Boundary errorExtracted span is too short or too long
Entity confusionModel selects the wrong person, date, or organization
Negation errorModel ignores “not,” “except,” or “unless”
Multi-hop failureAnswer requires combining evidence across passages
HallucinationGenerator produces unsupported claims
Temporal errorModel uses stale or wrong time context
Ambiguous questionSeveral answers are plausible
Unanswerable questionModel answers despite insufficient evidence

In high-stakes systems, the model should be allowed to abstain. Returning “not enough information” is often better than producing a confident unsupported answer.

Practical Design Choices

The right QA architecture depends on the setting.

SettingSuitable approach
Answer appears in a short passageExtractive QA
Need fluent explanationGenerative QA
Large document collectionRetrieval-augmented QA
Fixed answer optionsMultiple-choice QA
Legal or medical evidenceRetrieval plus extractive or citation-grounded generation
Customer supportRetrieval plus generative answer with source links

Extractive systems are easier to constrain because answers come from the passage. Generative systems are more flexible but need stronger grounding. Retrieval-augmented systems are usually the best default when the answer should depend on external documents.

Summary

Question answering maps questions to answers. Extractive QA selects a span from a context passage. Generative QA produces answer text token by token. Multiple-choice QA scores candidate answers. Retrieval-augmented QA first finds evidence, then answers from that evidence.

In PyTorch, extractive QA is typically implemented with a transformer encoder and a span prediction head. The model predicts start and end logits over token positions. Training uses cross-entropy for both boundaries.

Reliable QA depends on more than the model. Long contexts, unanswerable questions, retrieval quality, answer calibration, and evidence grounding are central design concerns.