Summarization is the task of producing a shorter version of one or more source texts while preserving the important information.
Summarization is the task of producing a shorter version of one or more source texts while preserving the important information. The input may be a news article, a scientific paper, a legal document, a support thread, a meeting transcript, a code review, or a set of retrieved passages. The output is a compact text that should be faithful, readable, and appropriate for the user’s purpose.
Examples:
Input: A long news article about an election result.
Output: A paragraph describing who won, by what margin, and what happens next.Input: A meeting transcript.
Output: Decisions, open questions, and action items.Summarization is a sequence-to-sequence problem. The model receives a source sequence
and produces a target sequence
where usually
The goal is not merely to shorten text. The goal is to select, compress, organize, and express the relevant content.
Extractive and Abstractive Summarization
There are two main forms of summarization.
| Type | Description |
|---|---|
| Extractive summarization | Selects sentences or spans from the source |
| Abstractive summarization | Generates new wording based on the source |
Extractive summarization is closer to ranking. The system chooses important units from the original document. For example, it may select three sentences from a news article.
Abstractive summarization is closer to conditional generation. The system writes a new summary that may paraphrase, combine, or reorder information.
Extractive systems are easier to constrain because every selected sentence comes from the source. Abstractive systems are more flexible but can introduce unsupported claims.
Encoder-Decoder Formulation
Modern abstractive summarization often uses an encoder-decoder transformer.
The encoder reads the source document and produces hidden states:
The decoder generates the summary one token at a time:
Training usually uses teacher forcing. At step , the decoder receives the gold previous tokens and predicts the next gold token.
The loss is token-level negative log-likelihood:
This objective teaches the model to imitate reference summaries. It does not directly optimize factuality, coverage, usefulness, or brevity. Those properties must be handled through data, decoding, evaluation, and system design.
Decoder-Only Summarization
Decoder-only language models can also perform summarization. The source text and instruction are placed in the prompt, and the model continues with the summary.
Example prompt:
Summarize the following article in five bullet points.
Article:
...The model defines the same autoregressive factorization:
where is the prompt containing the instruction and source text.
Decoder-only models are convenient for instruction-following summarization. They can adapt the output format using natural language instructions. Encoder-decoder models are often more efficient when the task is fixed and the source text is long relative to the output.
Building a Summarization Dataset
A supervised summarization dataset contains source-summary pairs:
{
"source": "Long document text...",
"summary": "Short summary..."
}The quality of this dataset determines the behavior of the model. If the references are brief, the model learns brief summaries. If the references include opinions, the model learns opinions. If references contain unsupported details, the model learns to hallucinate.
Important dataset properties include:
| Property | Why it matters |
|---|---|
| Domain | News, legal, medical, code, meetings, and research papers require different summaries |
| Compression ratio | Controls how much shorter the summary should be |
| Reference style | Bullets, paragraph, headline, abstract, action items |
| Factual alignment | Reference should be supported by the source |
| Recency | For time-sensitive domains, summaries must reflect current conventions |
| Length distribution | Training length should match deployment length |
A dataset for meeting summarization may need decisions and action items. A dataset for scientific summarization may need methods, results, and limitations. A dataset for legal summarization may need parties, claims, holdings, and procedural posture.
Tokenization and Batching
Summarization uses both input tokenization and target tokenization.
For a batch of examples:
input_ids # [B, T_src]
attention_mask # [B, T_src]
labels # [B, T_tgt]The source length and target length may differ.
Padding is used to make examples in a batch the same length. Loss should ignore padding tokens in the labels. In PyTorch and Hugging Face-style training, ignored target positions are often set to -100.
labels[labels == tokenizer.pad_token_id] = -100This prevents padding tokens from contributing to the loss.
A Minimal PyTorch Wrapper
A practical summarization model often wraps a pretrained encoder-decoder model.
import torch
import torch.nn as nn
class Summarizer(nn.Module):
def __init__(self, seq2seq_model):
super().__init__()
self.model = seq2seq_model
def forward(self, input_ids, attention_mask, labels=None):
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels,
)
return outputs.loss, outputs.logitsThis looks small because the pretrained model contains the encoder, decoder, attention layers, embeddings, language modeling head, and generation utilities.
A training step is similar to other supervised sequence tasks:
def train_step(model, batch, optimizer, device):
model.train()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
optimizer.zero_grad(set_to_none=True)
loss, logits = model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels,
)
loss.backward()
optimizer.step()
return loss.item()Decoding Summaries
At inference time, the model must decode a sequence. Common decoding methods include greedy search, beam search, top-k sampling, nucleus sampling, and constrained decoding.
For summarization, deterministic decoding is often preferred. Greedy search and beam search are common because summaries should be stable and faithful.
| Method | Behavior |
|---|---|
| Greedy search | Chooses the highest-probability token at each step |
| Beam search | Keeps several candidate sequences |
| Top-k sampling | Samples from the top tokens |
| Nucleus sampling | Samples from the smallest set whose probability mass exceeds |
| Length penalty | Adjusts preference for shorter or longer outputs |
| Repetition penalty | Reduces repeated phrases |
Beam search can improve fluency, but large beam sizes may produce generic summaries. Sampling can produce varied summaries, but it can also increase hallucination. For factual summarization, lower-temperature decoding is usually safer.
Example generation call:
@torch.no_grad()
def summarize(model, tokenizer, text, device):
model.eval()
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=1024,
)
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)
output_ids = model.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=160,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)Long-Document Summarization
Many documents exceed the context length of a standard model. Long-document summarization needs special handling.
Common strategies include:
| Strategy | Description |
|---|---|
| Truncation | Keep only the beginning of the document |
| Sliding windows | Summarize overlapping chunks |
| Map-reduce summarization | Summarize chunks, then summarize the summaries |
| Hierarchical encoding | Encode chunks, then aggregate at document level |
| Retrieval-based summarization | Select relevant passages before summarizing |
| Long-context models | Use models designed for long inputs |
Naive truncation is dangerous. In many documents, important information appears near the end: conclusions, decisions, risks, action items, or exceptions.
A simple map-reduce pipeline:
def map_reduce_summarize(summarize_fn, chunks):
chunk_summaries = []
for chunk in chunks:
chunk_summaries.append(summarize_fn(chunk))
combined = "\n".join(chunk_summaries)
final_summary = summarize_fn(combined)
return final_summaryThis is useful, but it can lose details. Errors in the first stage can propagate into the final summary.
Controlling Summary Style
A summarizer should match the user’s purpose. The same source may need different outputs.
| Purpose | Good output form |
|---|---|
| Executive reading | Short paragraph with key consequences |
| Meeting review | Decisions, action items, owners, dates |
| Research paper | Problem, method, result, limitations |
| Legal document | Issue, rule, holding, reasoning |
| Customer support | Problem, resolution, next step |
| Search result | One-sentence snippet with evidence |
With instruction-tuned models, style can be controlled through prompts:
Summarize the document as:
- Decision:
- Evidence:
- Risks:
- Next actions:For supervised models, style is controlled mainly by training data. If the model is fine-tuned on bullet summaries, it will tend to generate bullets.
Factuality and Hallucination
A summarization model hallucinates when it adds information that is unsupported by the source.
Hallucinations include:
| Type | Example |
|---|---|
| Entity error | Wrong person, company, drug, or place |
| Number error | Wrong amount, date, percentage, or count |
| Relation error | Reverses who did what |
| Causal error | Invents a cause or consequence |
| Temporal error | Misstates when something happened |
| Unsupported inference | Adds a conclusion absent from the source |
Factuality is central in summarization. A fluent summary can still be wrong.
Methods to reduce hallucination include:
- Prefer extractive or evidence-grounded summaries for high-stakes use.
- Use retrieval or citation constraints.
- Decode conservatively.
- Ask the model to quote or cite supporting spans.
- Run a separate factual consistency checker.
- Use domain-specific evaluation data.
- Avoid asking for information that the source does not contain.
A practical rule is simple: the summary should not contain a claim that cannot be traced to the input.
Evaluation Metrics
Summarization evaluation is difficult because many valid summaries may exist for the same source.
Common automatic metrics include:
| Metric | Measures | Limitation |
|---|---|---|
| ROUGE | N-gram overlap with reference | Rewards surface similarity |
| BLEU | Precision-oriented overlap | Designed for translation |
| METEOR | Overlap with stemming/synonyms | Still reference-dependent |
| BERTScore | Semantic similarity | May miss factual errors |
| QAEval-style metrics | Answer consistency | Depends on QA quality |
| Human evaluation | Relevance, coherence, factuality | Expensive |
ROUGE is common but incomplete. A summary can have high ROUGE and still contain factual errors. A summary can have low ROUGE and still be useful if it is phrased differently from the reference.
Human evaluation often uses dimensions such as:
| Dimension | Meaning |
|---|---|
| Coverage | Captures important source information |
| Faithfulness | Does not add unsupported claims |
| Coherence | Reads naturally |
| Concision | Avoids unnecessary detail |
| Usefulness | Serves the intended task |
Extractive Baselines
Before training a large abstractive model, build extractive baselines. They are easy to implement and help detect whether the task really requires generation.
A simple baseline ranks sentences by similarity to the document centroid or by term importance. Another baseline selects the first few sentences, which is surprisingly strong for news articles because important information often appears at the beginning.
A minimal lead baseline:
def lead_summary(text, num_sentences=3):
sentences = split_into_sentences(text)
return " ".join(sentences[:num_sentences])A neural system should beat this baseline on the metrics that matter. If it does not, the dataset or evaluation protocol may be weak.
Common Failure Modes
Summarization systems fail in recurring ways.
| Failure mode | Description |
|---|---|
| Hallucination | Adds unsupported facts |
| Omission | Leaves out central information |
| Over-compression | Removes necessary context |
| Redundancy | Repeats the same idea |
| Entity drift | Confuses names or references |
| Number drift | Changes numerical values |
| Style mismatch | Wrong tone, length, or format |
| Lost chronology | Events appear in the wrong order |
| Source bias amplification | Repeats biased framing without qualification |
| Prompt overreach | Answers beyond the supplied document |
Long documents add more failure modes. The model may focus on early sections, miss tables, ignore appendices, or confuse similar entities across sections.
Practical System Design
A production summarization system should make several design choices explicitly.
| Decision | Typical choices |
|---|---|
| Input scope | Single document, many documents, retrieved passages |
| Summary type | Extractive, abstractive, hybrid |
| Output format | Paragraph, bullets, structured fields |
| Length control | Token limit, sentence count, section budget |
| Evidence policy | No citations, inline citations, quoted support |
| Update behavior | Static summary, incremental summary |
| Risk tolerance | Creative, conservative, high-faithfulness |
For low-risk consumer summaries, an abstractive model may be acceptable. For legal, medical, financial, or compliance settings, evidence-grounded summarization is safer. The system should preserve important source text, expose uncertainty, and support auditability.
Summary
Summarization compresses source text into a shorter output. Extractive summarization selects source spans. Abstractive summarization generates new text. Encoder-decoder models and decoder-only language models are both widely used.
The core training objective is next-token likelihood conditioned on the source. This objective gives a useful model, but it does not guarantee factuality. Reliable summarization requires careful data, decoding, long-context handling, evaluation, and error analysis.
A good summarizer should preserve what matters, omit what does not, and avoid unsupported claims.