Skip to content

Pretraining Objectives

A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.

A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case. In language modeling, the objective is usually self-supervised: labels are created directly from raw text.

No human annotator needs to label each example. The corpus itself supplies the targets.

For example, in next-token prediction, the input is

x1:t x_{1:t}

and the target is

xt+1. x_{t+1}.

In masked language modeling, the input is a corrupted sequence and the target is the missing token.

Pretraining objectives matter because they shape the representations, capabilities, and limitations of the model.

Self-Supervised Learning

Self-supervised learning creates supervision from the data itself.

For language, common self-supervised tasks include:

ObjectivePrediction task
Causal language modelingPredict the next token
Masked language modelingPredict hidden tokens
DenoisingReconstruct corrupted text
Sequence-to-sequence pretrainingGenerate missing or transformed spans
Contrastive learningDistinguish related text pairs from unrelated pairs

The model learns useful representations because solving these tasks requires syntax, semantics, facts, discourse patterns, and world knowledge encoded in text.

Causal Language Modeling

Causal language modeling trains the model to predict each token from previous tokens:

L=t=1Tlogpθ(xtx1:t1). \mathcal{L} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

This is the standard objective for GPT-style decoder-only models.

The attention mask is causal. Position tt can attend only to positions up to tt.

Causal language modeling is naturally aligned with generation. At inference time, the model generates one token at a time using the same conditional structure learned during training.

Masked Language Modeling

Masked language modeling corrupts an input sequence by hiding selected tokens. The model predicts the original tokens using bidirectional context.

Let MM be the set of masked positions. The loss is

L=tMlogpθ(xtx~), \mathcal{L} = - \sum_{t \in M} \log p_\theta(x_t \mid \tilde{x}),

where x~\tilde{x} is the corrupted sequence.

This objective is common for encoder-only models. It is strong for representation learning and understanding tasks because each token can use both left and right context.

It is less direct for free-form generation because it does not define the same left-to-right sequence factorization as causal language modeling.

Denoising Objectives

Denoising objectives train a model to reconstruct original text from corrupted text.

Corruptions may include:

CorruptionExample
Token maskingReplace tokens with mask tokens
Token deletionRemove tokens
Span maskingHide contiguous spans
Text infillingReplace spans with sentinel markers
Sentence permutationShuffle sentence order

A denoising objective can be written as

L=logpθ(xx~), \mathcal{L} = - \log p_\theta(x \mid \tilde{x}),

where xx is the original sequence and x~\tilde{x} is the corrupted version.

Encoder-decoder models often use this form. The encoder reads corrupted input, and the decoder generates the missing or original text.

Prefix Language Modeling

Prefix language modeling gives the model a visible prefix and trains it to generate the continuation.

If the sequence is divided into a prefix x1:kx_{1:k} and target continuation xk+1:Tx_{k+1:T}, the loss is

L=t=k+1Tlogpθ(xtx1:t1). \mathcal{L} = - \sum_{t=k+1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

The prefix is context. The continuation is predicted.

This objective is useful for tasks where the model receives a prompt, document, or instruction and must generate an output.

Sequence-to-Sequence Objectives

Sequence-to-sequence pretraining maps an input sequence to an output sequence.

Examples include:

TaskInputOutput
Translation-style pretrainingSource textTarget text
Summarization-style pretrainingLong textShort text
Text infillingCorrupted textMissing spans
Denoising autoencodingCorrupted textOriginal text

The general loss is

L=t=1Tylogpθ(yty1:t1,x). \mathcal{L} = - \sum_{t=1}^{T_y} \log p_\theta(y_t \mid y_{1:t-1}, x).

Encoder-decoder transformers use this objective naturally. The encoder represents the input. The decoder autoregressively generates the output.

Contrastive Objectives

Contrastive learning trains a model to bring related examples closer and push unrelated examples apart.

For text, a positive pair may be:

Pair typeExample
Query and relevant documentSearch training
Sentence and paraphraseSemantic similarity
Image and captionMultimodal learning
Question and answerRetrieval training

A common contrastive objective is InfoNCE:

L=logexp(sim(q,k+)/τ)exp(sim(q,k+)/τ)+i=1Nexp(sim(q,ki)/τ). \mathcal{L} = - \log \frac{ \exp(\text{sim}(q,k^+)/\tau) }{ \exp(\text{sim}(q,k^+)/\tau) + \sum_{i=1}^{N} \exp(\text{sim}(q,k_i^-)/\tau) }.

Here qq is a query representation, k+k^+ is a positive key, kik_i^- are negative keys, sim\text{sim} is a similarity function, and τ\tau is a temperature.

Contrastive objectives are central to retrieval models, embedding models, and multimodal systems.

Next Sentence and Sentence Order Objectives

Some early encoder models used auxiliary sentence-level objectives.

Next sentence prediction asks whether two text segments occur consecutively in the corpus.

Sentence order prediction asks whether two segments are in the correct order.

These objectives try to teach relationships between sentences. In practice, their usefulness depends on the model, data, and task. Many later systems removed or replaced them with stronger pretraining signals, larger batches, better masking, or contrastive objectives.

Multi-Objective Pretraining

Some models combine several objectives.

For example, a model might train with:

L=λ1LLM+λ2Lcontrastive+λ3Lalignment. \mathcal{L} = \lambda_1 \mathcal{L}_{\text{LM}} + \lambda_2 \mathcal{L}_{\text{contrastive}} + \lambda_3 \mathcal{L}_{\text{alignment}}.

The weights λi\lambda_i control the contribution of each objective.

Multi-objective training is common in multimodal models. A vision-language model may use image-text contrastive loss, caption generation loss, and masked token loss in the same training recipe.

The benefit is broader capability. The cost is more difficult optimization and more sensitive data mixing.

Objective and Architecture Matching

Different objectives fit different architectures.

ArchitectureCommon objective
Encoder-only transformerMasked language modeling, contrastive learning
Decoder-only transformerCausal language modeling
Encoder-decoder transformerDenoising, sequence-to-sequence pretraining
Dual encoderContrastive retrieval objective
Multimodal encoder-decoderCaptioning, contrastive alignment, denoising

The objective determines which information is visible at each position and which outputs are trained.

A mismatch can produce awkward systems. For example, an encoder-only masked model can fill blanks well, but it is not naturally suited to long autoregressive generation.

Pretraining Data and Objective Coupling

The objective only works well when paired with suitable data.

Causal language modeling benefits from large, diverse text corpora because it learns broad next-token statistics.

Masked language modeling benefits from clean documents and sentence-like structure because bidirectional context is central.

Contrastive learning requires reliable positive and negative pairs. Bad positives teach the model wrong similarity. False negatives can hurt embedding quality.

Denoising objectives need corruption patterns that are hard enough to teach structure but not so hard that reconstruction becomes impossible.

The objective, data, tokenizer, architecture, and optimizer form one training system.

PyTorch Sketch: Causal LM Objective

A causal language model receives tokens and predicts the next token.

import torch
import torch.nn.functional as F

B = 32
T = 128
V = 50000

tokens = torch.randint(0, V, (B, T + 1))

x = tokens[:, :-1]
y = tokens[:, 1:]

logits = model(x)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    y.reshape(B * T),
)

The labels are the input shifted left by one position.

PyTorch Sketch: Masked LM Objective

A masked language model predicts only selected positions.

import torch
import torch.nn.functional as F

B = 32
T = 128
V = 50000
mask_token_id = 103

input_ids = torch.randint(0, V, (B, T))
labels = input_ids.clone()

selected = torch.rand(B, T) < 0.15

masked_input = input_ids.clone()
masked_input[selected] = mask_token_id

labels[~selected] = -100

logits = model(masked_input)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    labels.reshape(B * T),
    ignore_index=-100,
)

Only masked positions contribute to the loss.

PyTorch Sketch: Contrastive Objective

A contrastive model maps paired examples into vectors.

import torch
import torch.nn.functional as F

B = 64
D = 768

query = torch.randn(B, D)
key = torch.randn(B, D)

query = F.normalize(query, dim=-1)
key = F.normalize(key, dim=-1)

temperature = 0.07

logits = query @ key.T / temperature
labels = torch.arange(B)

loss = F.cross_entropy(logits, labels)

Each query is trained to match the key at the same batch index. Other keys in the batch act as negatives.

Choosing an Objective

The objective should match the intended use.

GoalSuitable objective
Open-ended generationCausal language modeling
Text understandingMasked language modeling
Translation or summarizationSequence-to-sequence objective
Search and retrievalContrastive objective
Multimodal alignmentContrastive plus generative objectives
Robust representation learningDenoising objective

No objective is universally best. The correct choice depends on the target behavior, available data, architecture, and compute budget.

Summary

Pretraining objectives define how a model learns from raw data. Causal language modeling predicts the next token. Masked language modeling predicts hidden tokens using bidirectional context. Denoising reconstructs corrupted inputs. Sequence-to-sequence objectives map inputs to generated outputs. Contrastive objectives learn representation spaces by comparing positive and negative pairs.

Modern deep learning systems often combine objectives, but the basic principle remains the same: choose a prediction task whose solution requires the representations and behaviors the final system should have.