Pretraining Objectives

A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case. In language modeling, the objective is usually self-supervised: labels are created directly from raw text.

No human annotator needs to label each example. The corpus itself supplies the targets.

For example, in next-token prediction, the input is

x_{1:t}

and the target is

x_{t+1}.

In masked language modeling, the input is a corrupted sequence and the target is the missing token.

Pretraining objectives matter because they shape the representations, capabilities, and limitations of the model.

Self-Supervised Learning

Self-supervised learning creates supervision from the data itself.

For language, common self-supervised tasks include:

Objective	Prediction task
Causal language modeling	Predict the next token
Masked language modeling	Predict hidden tokens
Denoising	Reconstruct corrupted text
Sequence-to-sequence pretraining	Generate missing or transformed spans
Contrastive learning	Distinguish related text pairs from unrelated pairs

The model learns useful representations because solving these tasks requires syntax, semantics, facts, discourse patterns, and world knowledge encoded in text.

Causal Language Modeling

Causal language modeling trains the model to predict each token from previous tokens:

\mathcal{L} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

This is the standard objective for GPT-style decoder-only models.

The attention mask is causal. Position $t$ can attend only to positions up to $t$ .

Causal language modeling is naturally aligned with generation. At inference time, the model generates one token at a time using the same conditional structure learned during training.

Masked Language Modeling

Masked language modeling corrupts an input sequence by hiding selected tokens. The model predicts the original tokens using bidirectional context.

Let $M$ be the set of masked positions. The loss is

\mathcal{L} = - \sum_{t \in M} \log p_\theta(x_t \mid \tilde{x}),

where $\tilde{x}$ is the corrupted sequence.

This objective is common for encoder-only models. It is strong for representation learning and understanding tasks because each token can use both left and right context.

It is less direct for free-form generation because it does not define the same left-to-right sequence factorization as causal language modeling.

Denoising Objectives

Denoising objectives train a model to reconstruct original text from corrupted text.

Corruptions may include:

Corruption	Example
Token masking	Replace tokens with mask tokens
Token deletion	Remove tokens
Span masking	Hide contiguous spans
Text infilling	Replace spans with sentinel markers
Sentence permutation	Shuffle sentence order

A denoising objective can be written as

\mathcal{L} = - \log p_\theta(x \mid \tilde{x}),

where $x$ is the original sequence and $\tilde{x}$ is the corrupted version.

Encoder-decoder models often use this form. The encoder reads corrupted input, and the decoder generates the missing or original text.

Prefix Language Modeling

Prefix language modeling gives the model a visible prefix and trains it to generate the continuation.

If the sequence is divided into a prefix $x_{1:k}$ and target continuation $x_{k+1:T}$ , the loss is

\mathcal{L} = - \sum_{t=k+1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

The prefix is context. The continuation is predicted.

This objective is useful for tasks where the model receives a prompt, document, or instruction and must generate an output.

Sequence-to-Sequence Objectives

Sequence-to-sequence pretraining maps an input sequence to an output sequence.

Examples include:

Task	Input	Output
Translation-style pretraining	Source text	Target text
Summarization-style pretraining	Long text	Short text
Text infilling	Corrupted text	Missing spans
Denoising autoencoding	Corrupted text	Original text

The general loss is

\mathcal{L} = - \sum_{t=1}^{T_y} \log p_\theta(y_t \mid y_{1:t-1}, x).

Encoder-decoder transformers use this objective naturally. The encoder represents the input. The decoder autoregressively generates the output.

Contrastive Objectives

Contrastive learning trains a model to bring related examples closer and push unrelated examples apart.

For text, a positive pair may be:

Pair type	Example
Query and relevant document	Search training
Sentence and paraphrase	Semantic similarity
Image and caption	Multimodal learning
Question and answer	Retrieval training

A common contrastive objective is InfoNCE:

\mathcal{L} = - \log \frac{ \exp(\text{sim}(q,k^+)/\tau) }{ \exp(\text{sim}(q,k^+)/\tau) + \sum_{i=1}^{N} \exp(\text{sim}(q,k_i^-)/\tau) }.

Here $q$ is a query representation, $k^+$ is a positive key, $k_i^-$ are negative keys, $\text{sim}$ is a similarity function, and $\tau$ is a temperature.

Contrastive objectives are central to retrieval models, embedding models, and multimodal systems.

Next Sentence and Sentence Order Objectives

Some early encoder models used auxiliary sentence-level objectives.

Next sentence prediction asks whether two text segments occur consecutively in the corpus.

Sentence order prediction asks whether two segments are in the correct order.

These objectives try to teach relationships between sentences. In practice, their usefulness depends on the model, data, and task. Many later systems removed or replaced them with stronger pretraining signals, larger batches, better masking, or contrastive objectives.

Multi-Objective Pretraining

Some models combine several objectives.

For example, a model might train with:

\mathcal{L} = \lambda_1 \mathcal{L}_{\text{LM}} + \lambda_2 \mathcal{L}_{\text{contrastive}} + \lambda_3 \mathcal{L}_{\text{alignment}}.

The weights $\lambda_i$ control the contribution of each objective.

Multi-objective training is common in multimodal models. A vision-language model may use image-text contrastive loss, caption generation loss, and masked token loss in the same training recipe.

The benefit is broader capability. The cost is more difficult optimization and more sensitive data mixing.

Objective and Architecture Matching

Different objectives fit different architectures.

Architecture	Common objective
Encoder-only transformer	Masked language modeling, contrastive learning
Decoder-only transformer	Causal language modeling
Encoder-decoder transformer	Denoising, sequence-to-sequence pretraining
Dual encoder	Contrastive retrieval objective
Multimodal encoder-decoder	Captioning, contrastive alignment, denoising

The objective determines which information is visible at each position and which outputs are trained.

A mismatch can produce awkward systems. For example, an encoder-only masked model can fill blanks well, but it is not naturally suited to long autoregressive generation.

Pretraining Data and Objective Coupling

The objective only works well when paired with suitable data.

Causal language modeling benefits from large, diverse text corpora because it learns broad next-token statistics.

Masked language modeling benefits from clean documents and sentence-like structure because bidirectional context is central.

Contrastive learning requires reliable positive and negative pairs. Bad positives teach the model wrong similarity. False negatives can hurt embedding quality.

Denoising objectives need corruption patterns that are hard enough to teach structure but not so hard that reconstruction becomes impossible.

The objective, data, tokenizer, architecture, and optimizer form one training system.

PyTorch Sketch: Causal LM Objective

A causal language model receives tokens and predicts the next token.

import torch
import torch.nn.functional as F

B = 32
T = 128
V = 50000

tokens = torch.randint(0, V, (B, T + 1))

x = tokens[:, :-1]
y = tokens[:, 1:]

logits = model(x)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    y.reshape(B * T),
)

The labels are the input shifted left by one position.

PyTorch Sketch: Masked LM Objective

A masked language model predicts only selected positions.

import torch
import torch.nn.functional as F

B = 32
T = 128
V = 50000
mask_token_id = 103

input_ids = torch.randint(0, V, (B, T))
labels = input_ids.clone()

selected = torch.rand(B, T) < 0.15

masked_input = input_ids.clone()
masked_input[selected] = mask_token_id

labels[~selected] = -100

logits = model(masked_input)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    labels.reshape(B * T),
    ignore_index=-100,
)

Only masked positions contribute to the loss.

PyTorch Sketch: Contrastive Objective

A contrastive model maps paired examples into vectors.

import torch
import torch.nn.functional as F

B = 64
D = 768

query = torch.randn(B, D)
key = torch.randn(B, D)

query = F.normalize(query, dim=-1)
key = F.normalize(key, dim=-1)

temperature = 0.07

logits = query @ key.T / temperature
labels = torch.arange(B)

loss = F.cross_entropy(logits, labels)

Each query is trained to match the key at the same batch index. Other keys in the batch act as negatives.

Choosing an Objective

The objective should match the intended use.

Goal	Suitable objective
Open-ended generation	Causal language modeling
Text understanding	Masked language modeling
Translation or summarization	Sequence-to-sequence objective
Search and retrieval	Contrastive objective
Multimodal alignment	Contrastive plus generative objectives
Robust representation learning	Denoising objective

No objective is universally best. The correct choice depends on the target behavior, available data, architecture, and compute budget.

Summary

Pretraining objectives define how a model learns from raw data. Causal language modeling predicts the next token. Masked language modeling predicts hidden tokens using bidirectional context. Denoising reconstructs corrupted inputs. Sequence-to-sequence objectives map inputs to generated outputs. Contrastive objectives learn representation spaces by comparing positive and negative pairs.

Modern deep learning systems often combine objectives, but the basic principle remains the same: choose a prediction task whose solution requires the representations and behaviors the final system should have.