Summarization

Summarization is the task of producing a shorter version of one or more source texts while preserving the important information. The input may be a news article, a scientific paper, a legal document, a support thread, a meeting transcript, a code review, or a set of retrieved passages. The output is a compact text that should be faithful, readable, and appropriate for the user’s purpose.

Examples:

Input: A long news article about an election result.
Output: A paragraph describing who won, by what margin, and what happens next.

Input: A meeting transcript.
Output: Decisions, open questions, and action items.

Summarization is a sequence-to-sequence problem. The model receives a source sequence

x = (x_1, x_2, \ldots, x_T)

and produces a target sequence

y = (y_1, y_2, \ldots, y_M),

where usually

M < T.

The goal is not merely to shorten text. The goal is to select, compress, organize, and express the relevant content.

Extractive and Abstractive Summarization

There are two main forms of summarization.

Type	Description
Extractive summarization	Selects sentences or spans from the source
Abstractive summarization	Generates new wording based on the source

Extractive summarization is closer to ranking. The system chooses important units from the original document. For example, it may select three sentences from a news article.

Abstractive summarization is closer to conditional generation. The system writes a new summary that may paraphrase, combine, or reorder information.

Extractive systems are easier to constrain because every selected sentence comes from the source. Abstractive systems are more flexible but can introduce unsupported claims.

Encoder-Decoder Formulation

Modern abstractive summarization often uses an encoder-decoder transformer.

The encoder reads the source document and produces hidden states:

H = \text{Encoder}(x).

The decoder generates the summary one token at a time:

p(y \mid x) = \prod_{m=1}^{M} p(y_m \mid y_{<m}, x).

Training usually uses teacher forcing. At step $m$ , the decoder receives the gold previous tokens and predicts the next gold token.

The loss is token-level negative log-likelihood:

\mathcal{L} = -\sum_{m=1}^{M} \log p_\theta(y_m^\star \mid y_{<m}^\star, x).

This objective teaches the model to imitate reference summaries. It does not directly optimize factuality, coverage, usefulness, or brevity. Those properties must be handled through data, decoding, evaluation, and system design.

Decoder-Only Summarization

Decoder-only language models can also perform summarization. The source text and instruction are placed in the prompt, and the model continues with the summary.

Example prompt:

Summarize the following article in five bullet points.

Article:
...

The model defines the same autoregressive factorization:

p(y \mid c) = \prod_{m=1}^{M} p(y_m \mid c, y_{<m}),

where $c$ is the prompt containing the instruction and source text.

Decoder-only models are convenient for instruction-following summarization. They can adapt the output format using natural language instructions. Encoder-decoder models are often more efficient when the task is fixed and the source text is long relative to the output.

Building a Summarization Dataset

A supervised summarization dataset contains source-summary pairs:

{
  "source": "Long document text...",
  "summary": "Short summary..."
}

The quality of this dataset determines the behavior of the model. If the references are brief, the model learns brief summaries. If the references include opinions, the model learns opinions. If references contain unsupported details, the model learns to hallucinate.

Important dataset properties include:

Property	Why it matters
Domain	News, legal, medical, code, meetings, and research papers require different summaries
Compression ratio	Controls how much shorter the summary should be
Reference style	Bullets, paragraph, headline, abstract, action items
Factual alignment	Reference should be supported by the source
Recency	For time-sensitive domains, summaries must reflect current conventions
Length distribution	Training length should match deployment length

A dataset for meeting summarization may need decisions and action items. A dataset for scientific summarization may need methods, results, and limitations. A dataset for legal summarization may need parties, claims, holdings, and procedural posture.

Tokenization and Batching

Summarization uses both input tokenization and target tokenization.

For a batch of $B$ examples:

input_ids      # [B, T_src]
attention_mask # [B, T_src]
labels         # [B, T_tgt]

The source length $T_{\text{src}}$ and target length $T_{\text{tgt}}$ may differ.

Padding is used to make examples in a batch the same length. Loss should ignore padding tokens in the labels. In PyTorch and Hugging Face-style training, ignored target positions are often set to -100.

labels[labels == tokenizer.pad_token_id] = -100

This prevents padding tokens from contributing to the loss.

A Minimal PyTorch Wrapper

A practical summarization model often wraps a pretrained encoder-decoder model.

import torch
import torch.nn as nn

class Summarizer(nn.Module):
    def __init__(self, seq2seq_model):
        super().__init__()
        self.model = seq2seq_model

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )

        return outputs.loss, outputs.logits

This looks small because the pretrained model contains the encoder, decoder, attention layers, embeddings, language modeling head, and generation utilities.

A training step is similar to other supervised sequence tasks:

def train_step(model, batch, optimizer, device):
    model.train()

    input_ids = batch["input_ids"].to(device)
    attention_mask = batch["attention_mask"].to(device)
    labels = batch["labels"].to(device)

    optimizer.zero_grad(set_to_none=True)

    loss, logits = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels,
    )

    loss.backward()
    optimizer.step()

    return loss.item()

Decoding Summaries

At inference time, the model must decode a sequence. Common decoding methods include greedy search, beam search, top-k sampling, nucleus sampling, and constrained decoding.

For summarization, deterministic decoding is often preferred. Greedy search and beam search are common because summaries should be stable and faithful.

Method	Behavior
Greedy search	Chooses the highest-probability token at each step
Beam search	Keeps several candidate sequences
Top-k sampling	Samples from the top $k$ tokens
Nucleus sampling	Samples from the smallest set whose probability mass exceeds $p$
Length penalty	Adjusts preference for shorter or longer outputs
Repetition penalty	Reduces repeated phrases

Beam search can improve fluency, but large beam sizes may produce generic summaries. Sampling can produce varied summaries, but it can also increase hallucination. For factual summarization, lower-temperature decoding is usually safer.

Example generation call:

@torch.no_grad()
def summarize(model, tokenizer, text, device):
    model.eval()

    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=1024,
    )

    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    output_ids = model.model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=160,
        num_beams=4,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
    )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Long-Document Summarization

Many documents exceed the context length of a standard model. Long-document summarization needs special handling.

Common strategies include:

Strategy	Description
Truncation	Keep only the beginning of the document
Sliding windows	Summarize overlapping chunks
Map-reduce summarization	Summarize chunks, then summarize the summaries
Hierarchical encoding	Encode chunks, then aggregate at document level
Retrieval-based summarization	Select relevant passages before summarizing
Long-context models	Use models designed for long inputs

Naive truncation is dangerous. In many documents, important information appears near the end: conclusions, decisions, risks, action items, or exceptions.

A simple map-reduce pipeline:

def map_reduce_summarize(summarize_fn, chunks):
    chunk_summaries = []

    for chunk in chunks:
        chunk_summaries.append(summarize_fn(chunk))

    combined = "\n".join(chunk_summaries)
    final_summary = summarize_fn(combined)

    return final_summary

This is useful, but it can lose details. Errors in the first stage can propagate into the final summary.

Controlling Summary Style

A summarizer should match the user’s purpose. The same source may need different outputs.

Purpose	Good output form
Executive reading	Short paragraph with key consequences
Meeting review	Decisions, action items, owners, dates
Research paper	Problem, method, result, limitations
Legal document	Issue, rule, holding, reasoning
Customer support	Problem, resolution, next step
Search result	One-sentence snippet with evidence

With instruction-tuned models, style can be controlled through prompts:

Summarize the document as:
- Decision:
- Evidence:
- Risks:
- Next actions:

For supervised models, style is controlled mainly by training data. If the model is fine-tuned on bullet summaries, it will tend to generate bullets.

Factuality and Hallucination

A summarization model hallucinates when it adds information that is unsupported by the source.

Hallucinations include:

Type	Example
Entity error	Wrong person, company, drug, or place
Number error	Wrong amount, date, percentage, or count
Relation error	Reverses who did what
Causal error	Invents a cause or consequence
Temporal error	Misstates when something happened
Unsupported inference	Adds a conclusion absent from the source

Factuality is central in summarization. A fluent summary can still be wrong.

Methods to reduce hallucination include:

Prefer extractive or evidence-grounded summaries for high-stakes use.
Use retrieval or citation constraints.
Decode conservatively.
Ask the model to quote or cite supporting spans.
Run a separate factual consistency checker.
Use domain-specific evaluation data.
Avoid asking for information that the source does not contain.

A practical rule is simple: the summary should not contain a claim that cannot be traced to the input.

Evaluation Metrics

Summarization evaluation is difficult because many valid summaries may exist for the same source.

Common automatic metrics include:

Metric	Measures	Limitation
ROUGE	N-gram overlap with reference	Rewards surface similarity
BLEU	Precision-oriented overlap	Designed for translation
METEOR	Overlap with stemming/synonyms	Still reference-dependent
BERTScore	Semantic similarity	May miss factual errors
QAEval-style metrics	Answer consistency	Depends on QA quality
Human evaluation	Relevance, coherence, factuality	Expensive

ROUGE is common but incomplete. A summary can have high ROUGE and still contain factual errors. A summary can have low ROUGE and still be useful if it is phrased differently from the reference.

Human evaluation often uses dimensions such as:

Dimension	Meaning
Coverage	Captures important source information
Faithfulness	Does not add unsupported claims
Coherence	Reads naturally
Concision	Avoids unnecessary detail
Usefulness	Serves the intended task

Extractive Baselines

Before training a large abstractive model, build extractive baselines. They are easy to implement and help detect whether the task really requires generation.

A simple baseline ranks sentences by similarity to the document centroid or by term importance. Another baseline selects the first few sentences, which is surprisingly strong for news articles because important information often appears at the beginning.

A minimal lead baseline:

def lead_summary(text, num_sentences=3):
    sentences = split_into_sentences(text)
    return " ".join(sentences[:num_sentences])

A neural system should beat this baseline on the metrics that matter. If it does not, the dataset or evaluation protocol may be weak.

Common Failure Modes

Summarization systems fail in recurring ways.

Failure mode	Description
Hallucination	Adds unsupported facts
Omission	Leaves out central information
Over-compression	Removes necessary context
Redundancy	Repeats the same idea
Entity drift	Confuses names or references
Number drift	Changes numerical values
Style mismatch	Wrong tone, length, or format
Lost chronology	Events appear in the wrong order
Source bias amplification	Repeats biased framing without qualification
Prompt overreach	Answers beyond the supplied document

Long documents add more failure modes. The model may focus on early sections, miss tables, ignore appendices, or confuse similar entities across sections.

Practical System Design

A production summarization system should make several design choices explicitly.

Decision	Typical choices
Input scope	Single document, many documents, retrieved passages
Summary type	Extractive, abstractive, hybrid
Output format	Paragraph, bullets, structured fields
Length control	Token limit, sentence count, section budget
Evidence policy	No citations, inline citations, quoted support
Update behavior	Static summary, incremental summary
Risk tolerance	Creative, conservative, high-faithfulness

For low-risk consumer summaries, an abstractive model may be acceptable. For legal, medical, financial, or compliance settings, evidence-grounded summarization is safer. The system should preserve important source text, expose uncertainty, and support auditability.

Summary

Summarization compresses source text into a shorter output. Extractive summarization selects source spans. Abstractive summarization generates new text. Encoder-decoder models and decoder-only language models are both widely used.

The core training objective is next-token likelihood conditioned on the source. This objective gives a useful model, but it does not guarantee factuality. Reliable summarization requires careful data, decoding, long-context handling, evaluation, and error analysis.

A good summarizer should preserve what matters, omit what does not, and avoid unsupported claims.