A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.
A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case. In language modeling, the objective is usually self-supervised: labels are created directly from raw text.
No human annotator needs to label each example. The corpus itself supplies the targets.
For example, in next-token prediction, the input is
and the target is
In masked language modeling, the input is a corrupted sequence and the target is the missing token.
Pretraining objectives matter because they shape the representations, capabilities, and limitations of the model.
Self-Supervised Learning
Self-supervised learning creates supervision from the data itself.
For language, common self-supervised tasks include:
| Objective | Prediction task |
|---|---|
| Causal language modeling | Predict the next token |
| Masked language modeling | Predict hidden tokens |
| Denoising | Reconstruct corrupted text |
| Sequence-to-sequence pretraining | Generate missing or transformed spans |
| Contrastive learning | Distinguish related text pairs from unrelated pairs |
The model learns useful representations because solving these tasks requires syntax, semantics, facts, discourse patterns, and world knowledge encoded in text.
Causal Language Modeling
Causal language modeling trains the model to predict each token from previous tokens:
This is the standard objective for GPT-style decoder-only models.
The attention mask is causal. Position can attend only to positions up to .
Causal language modeling is naturally aligned with generation. At inference time, the model generates one token at a time using the same conditional structure learned during training.
Masked Language Modeling
Masked language modeling corrupts an input sequence by hiding selected tokens. The model predicts the original tokens using bidirectional context.
Let be the set of masked positions. The loss is
where is the corrupted sequence.
This objective is common for encoder-only models. It is strong for representation learning and understanding tasks because each token can use both left and right context.
It is less direct for free-form generation because it does not define the same left-to-right sequence factorization as causal language modeling.
Denoising Objectives
Denoising objectives train a model to reconstruct original text from corrupted text.
Corruptions may include:
| Corruption | Example |
|---|---|
| Token masking | Replace tokens with mask tokens |
| Token deletion | Remove tokens |
| Span masking | Hide contiguous spans |
| Text infilling | Replace spans with sentinel markers |
| Sentence permutation | Shuffle sentence order |
A denoising objective can be written as
where is the original sequence and is the corrupted version.
Encoder-decoder models often use this form. The encoder reads corrupted input, and the decoder generates the missing or original text.
Prefix Language Modeling
Prefix language modeling gives the model a visible prefix and trains it to generate the continuation.
If the sequence is divided into a prefix and target continuation , the loss is
The prefix is context. The continuation is predicted.
This objective is useful for tasks where the model receives a prompt, document, or instruction and must generate an output.
Sequence-to-Sequence Objectives
Sequence-to-sequence pretraining maps an input sequence to an output sequence.
Examples include:
| Task | Input | Output |
|---|---|---|
| Translation-style pretraining | Source text | Target text |
| Summarization-style pretraining | Long text | Short text |
| Text infilling | Corrupted text | Missing spans |
| Denoising autoencoding | Corrupted text | Original text |
The general loss is
Encoder-decoder transformers use this objective naturally. The encoder represents the input. The decoder autoregressively generates the output.
Contrastive Objectives
Contrastive learning trains a model to bring related examples closer and push unrelated examples apart.
For text, a positive pair may be:
| Pair type | Example |
|---|---|
| Query and relevant document | Search training |
| Sentence and paraphrase | Semantic similarity |
| Image and caption | Multimodal learning |
| Question and answer | Retrieval training |
A common contrastive objective is InfoNCE:
Here is a query representation, is a positive key, are negative keys, is a similarity function, and is a temperature.
Contrastive objectives are central to retrieval models, embedding models, and multimodal systems.
Next Sentence and Sentence Order Objectives
Some early encoder models used auxiliary sentence-level objectives.
Next sentence prediction asks whether two text segments occur consecutively in the corpus.
Sentence order prediction asks whether two segments are in the correct order.
These objectives try to teach relationships between sentences. In practice, their usefulness depends on the model, data, and task. Many later systems removed or replaced them with stronger pretraining signals, larger batches, better masking, or contrastive objectives.
Multi-Objective Pretraining
Some models combine several objectives.
For example, a model might train with:
The weights control the contribution of each objective.
Multi-objective training is common in multimodal models. A vision-language model may use image-text contrastive loss, caption generation loss, and masked token loss in the same training recipe.
The benefit is broader capability. The cost is more difficult optimization and more sensitive data mixing.
Objective and Architecture Matching
Different objectives fit different architectures.
| Architecture | Common objective |
|---|---|
| Encoder-only transformer | Masked language modeling, contrastive learning |
| Decoder-only transformer | Causal language modeling |
| Encoder-decoder transformer | Denoising, sequence-to-sequence pretraining |
| Dual encoder | Contrastive retrieval objective |
| Multimodal encoder-decoder | Captioning, contrastive alignment, denoising |
The objective determines which information is visible at each position and which outputs are trained.
A mismatch can produce awkward systems. For example, an encoder-only masked model can fill blanks well, but it is not naturally suited to long autoregressive generation.
Pretraining Data and Objective Coupling
The objective only works well when paired with suitable data.
Causal language modeling benefits from large, diverse text corpora because it learns broad next-token statistics.
Masked language modeling benefits from clean documents and sentence-like structure because bidirectional context is central.
Contrastive learning requires reliable positive and negative pairs. Bad positives teach the model wrong similarity. False negatives can hurt embedding quality.
Denoising objectives need corruption patterns that are hard enough to teach structure but not so hard that reconstruction becomes impossible.
The objective, data, tokenizer, architecture, and optimizer form one training system.
PyTorch Sketch: Causal LM Objective
A causal language model receives tokens and predicts the next token.
import torch
import torch.nn.functional as F
B = 32
T = 128
V = 50000
tokens = torch.randint(0, V, (B, T + 1))
x = tokens[:, :-1]
y = tokens[:, 1:]
logits = model(x) # [B, T, V]
loss = F.cross_entropy(
logits.reshape(B * T, V),
y.reshape(B * T),
)The labels are the input shifted left by one position.
PyTorch Sketch: Masked LM Objective
A masked language model predicts only selected positions.
import torch
import torch.nn.functional as F
B = 32
T = 128
V = 50000
mask_token_id = 103
input_ids = torch.randint(0, V, (B, T))
labels = input_ids.clone()
selected = torch.rand(B, T) < 0.15
masked_input = input_ids.clone()
masked_input[selected] = mask_token_id
labels[~selected] = -100
logits = model(masked_input) # [B, T, V]
loss = F.cross_entropy(
logits.reshape(B * T, V),
labels.reshape(B * T),
ignore_index=-100,
)Only masked positions contribute to the loss.
PyTorch Sketch: Contrastive Objective
A contrastive model maps paired examples into vectors.
import torch
import torch.nn.functional as F
B = 64
D = 768
query = torch.randn(B, D)
key = torch.randn(B, D)
query = F.normalize(query, dim=-1)
key = F.normalize(key, dim=-1)
temperature = 0.07
logits = query @ key.T / temperature
labels = torch.arange(B)
loss = F.cross_entropy(logits, labels)Each query is trained to match the key at the same batch index. Other keys in the batch act as negatives.
Choosing an Objective
The objective should match the intended use.
| Goal | Suitable objective |
|---|---|
| Open-ended generation | Causal language modeling |
| Text understanding | Masked language modeling |
| Translation or summarization | Sequence-to-sequence objective |
| Search and retrieval | Contrastive objective |
| Multimodal alignment | Contrastive plus generative objectives |
| Robust representation learning | Denoising objective |
No objective is universally best. The correct choice depends on the target behavior, available data, architecture, and compute budget.
Summary
Pretraining objectives define how a model learns from raw data. Causal language modeling predicts the next token. Masked language modeling predicts hidden tokens using bidirectional context. Denoising reconstructs corrupted inputs. Sequence-to-sequence objectives map inputs to generated outputs. Contrastive objectives learn representation spaces by comparing positive and negative pairs.
Modern deep learning systems often combine objectives, but the basic principle remains the same: choose a prediction task whose solution requires the representations and behaviors the final system should have.