Skip to content

Pretraining Objectives

A large language model is trained in two broad phases. The first phase is pretraining.

A large language model is trained in two broad phases. The first phase is pretraining. The second phase is adaptation. During pretraining, the model learns general statistical structure from a large corpus. During adaptation, the pretrained model is specialized for a task, instruction-following behavior, dialogue, tool use, or preference alignment.

A pretraining objective defines the learning problem used before task-specific supervision. It tells the model what to predict, what loss to minimize, and which parts of the input are visible during prediction.

The most important pretraining objectives for language models are:

ObjectiveMain ideaCommon model type
Autoregressive language modelingPredict the next token from previous tokensGPT-style decoder models
Masked language modelingPredict hidden tokens from surrounding contextBERT-style encoder models
Denoising sequence modelingReconstruct clean text from corrupted textT5, BART-style encoder-decoder models
Prefix language modelingPredict continuation tokens from a visible prefixEncoder-decoder or decoder variants
Contrastive pretrainingLearn representations by comparing positive and negative pairsRetrieval and embedding models

The choice of pretraining objective strongly affects what the model can do naturally. Autoregressive models are well suited for generation. Masked language models are well suited for representation learning. Encoder-decoder denoising models are well suited for conditional generation, such as translation, summarization, and text rewriting.

Tokens and Sequences

A language model does not operate directly on raw text. Text is first converted into a sequence of tokens. A token may be a word, subword, byte sequence, or character-like unit.

Let a text sequence be represented as

x=(x1,x2,,xT), x = (x_1, x_2, \ldots, x_T),

where each xtx_t is a token from a vocabulary VV, and TT is the sequence length.

The model receives tokens as integer IDs. These IDs are mapped to embedding vectors, then processed by a neural network. The pretraining objective defines which token distributions the model must predict.

For most large language models, the output at each position is a probability distribution over the vocabulary:

pθ(xtcontext), p_\theta(x_t \mid \text{context}),

where θ\theta denotes the model parameters.

The loss is usually a cross-entropy loss between the predicted distribution and the true token.

Autoregressive Language Modeling

Autoregressive language modeling is the standard objective for decoder-only large language models.

The model learns to predict each token from the tokens before it:

pθ(xtx1,x2,,xt1). p_\theta(x_t \mid x_1, x_2, \ldots, x_{t-1}).

The probability of the full sequence is factorized as

pθ(x1,x2,,xT)=t=1Tpθ(xtx<t). p_\theta(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{<t}).

Here x<tx_{<t} means all tokens before position tt.

The training loss is the negative log-likelihood:

L(θ)=t=1Tlogpθ(xtx<t). \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

In practice, the model predicts all next-token targets in parallel during training. The input sequence is shifted relative to the target sequence.

For example, given the text:

Deep learning models predict tokens

The training pairs are:

ContextTarget
Deeplearning
Deep learningmodels
Deep learning modelspredict
Deep learning models predicttokens

The transformer uses a causal attention mask so that position tt cannot attend to positions after tt. This prevents the model from seeing the answer during training.

Autoregressive pretraining has a simple advantage: generation uses the same conditional distribution as training. At inference time, the model repeatedly samples or selects the next token:

xt+1pθ(x1,,xt). x_{t+1} \sim p_\theta(\cdot \mid x_1,\ldots,x_t).

This makes the objective naturally aligned with text generation.

Causal Masking

A decoder-only transformer uses self-attention, but attention is restricted by a causal mask. Without a causal mask, token xtx_t could attend to future tokens xt+1,xt+2,x_{t+1}, x_{t+2}, \ldots. That would make next-token prediction trivial and invalid.

The causal mask allows each position to attend only to itself and earlier positions.

For a sequence of length TT, the attention mask has a triangular structure:

Mij={0,ji,,j>i. M_{ij} = \begin{cases} 0, & j \leq i, \\ -\infty, & j > i. \end{cases}

Before the softmax, masked positions receive a very negative value. After the softmax, their attention weight becomes approximately zero.

This constraint creates a left-to-right model of text. The model learns to compress all relevant previous information into hidden states that predict the future.

Masked Language Modeling

Masked language modeling is used by encoder-style models. Instead of predicting the next token, the model sees a corrupted sequence and predicts selected missing tokens.

Given a sequence

x=(x1,,xT), x = (x_1,\ldots,x_T),

we choose a subset of positions MM. Tokens at those positions are replaced by a special mask token, a random token, or sometimes left unchanged. The model receives the corrupted sequence x~\tilde{x}, then predicts the original tokens at masked positions.

The loss is

L(θ)=tMlogpθ(xtx~). \mathcal{L}(\theta) = -\sum_{t \in M} \log p_\theta(x_t \mid \tilde{x}).

The key difference from autoregressive modeling is that masked language models can use both left and right context. For example:

The capital of France is [MASK].

The model can use the full sentence to predict Paris.

Masked language modeling is effective for representation learning because every token representation can depend on surrounding tokens in both directions. This is useful for classification, extraction, retrieval, and sentence-pair tasks.

However, masked language models are less natural for open-ended generation. During generation, the model does not automatically define a left-to-right probability over complete sequences in the same direct way as an autoregressive model.

Denoising Sequence Modeling

Denoising sequence modeling generalizes masked language modeling. Instead of masking individual tokens, the training process corrupts the input in a richer way. The model must reconstruct the original clean sequence.

Corruptions may include:

CorruptionDescription
Token maskingReplace tokens with mask symbols
Span maskingReplace consecutive spans with sentinel tokens
Token deletionRemove some tokens
Sentence permutationReorder sentences
Text infillingFill in missing spans
Noise injectionAdd random or misleading tokens

Denoising objectives are common in encoder-decoder models. The encoder reads the corrupted input. The decoder generates the clean output.

Let xx be the original sequence and x~\tilde{x} be the corrupted sequence. The model learns

pθ(xx~). p_\theta(x \mid \tilde{x}).

The loss is

L(θ)=logpθ(xx~). \mathcal{L}(\theta) = -\log p_\theta(x \mid \tilde{x}).

This objective trains the model to map incomplete or noisy text into coherent text. It is useful for summarization, translation, editing, and other conditional generation tasks.

Prefix Language Modeling

Prefix language modeling gives the model a visible prefix and asks it to generate the continuation.

Given a sequence split into two parts,

x=(xprefix,xtarget), x = (x_{\text{prefix}}, x_{\text{target}}),

the model predicts

pθ(xtargetxprefix). p_\theta(x_{\text{target}} \mid x_{\text{prefix}}).

This is similar to ordinary autoregressive modeling, but the prefix may be encoded bidirectionally in some architectures. The target portion is still generated left to right.

The objective is useful when the model must condition on an input and produce an output. For example:

Input: Translate to French: I like machine learning.
Output: J'aime l'apprentissage automatique.

The prefix contains the task instruction and source text. The target contains the desired completion.

Prefix objectives form a bridge between language modeling and instruction-style conditional generation.

Contrastive Pretraining

Contrastive pretraining is common for embedding models and retrieval systems. Instead of predicting tokens, the model learns to place related items close together in representation space and unrelated items farther apart.

A training example may contain a query qq, a positive document d+d^+, and several negative documents d1,,dkd^-_1,\ldots,d^-_k. The model computes embeddings and similarity scores:

s(q,d)=sim(fθ(q),fθ(d)). s(q,d) = \mathrm{sim}(f_\theta(q), f_\theta(d)).

A common contrastive loss is

L=logexp(s(q,d+)/τ)exp(s(q,d+)/τ)+i=1kexp(s(q,di)/τ). \mathcal{L} = -\log \frac{\exp(s(q,d^+)/\tau)} {\exp(s(q,d^+)/\tau) + \sum_{i=1}^{k}\exp(s(q,d_i^-)/\tau)}.

Here τ\tau is a temperature parameter.

This objective is central for dense retrieval, semantic search, reranking, and retrieval-augmented generation. A generative language model predicts tokens. An embedding model predicts useful geometry.

Objective and Architecture

The pretraining objective and architecture usually match each other.

ArchitectureTypical objectiveStrength
Encoder-only transformerMasked language modelingUnderstanding and representation
Decoder-only transformerAutoregressive language modelingOpen-ended generation
Encoder-decoder transformerDenoising or sequence-to-sequence modelingConditional generation
Dual encoderContrastive learningRetrieval and embedding search

An encoder-only model can look at the full input at once. This makes it strong for tasks where the entire input is known before prediction.

A decoder-only model uses causal attention. This makes it strong for generation, since each new token depends only on previous tokens.

An encoder-decoder model separates input understanding from output generation. This is useful when input and output have different structures.

A dual encoder maps two objects into a shared vector space. This is efficient for large-scale search because document embeddings can be precomputed.

Pretraining Data Distribution

The pretraining objective defines the mathematical loss. The dataset defines the distribution over which the loss is optimized.

A model trained on web pages, books, code, academic papers, dialogue, and structured data learns different behavior from a model trained only on short formal text. Data mixture matters.

Important dimensions of pretraining data include:

DimensionEffect
DomainDetermines what the model knows and how it writes
Language coverageDetermines multilingual ability
Code fractionAffects reasoning, tool use, and programming ability
Quality filteringAffects coherence and factuality
DeduplicationReduces memorization and benchmark contamination
Time rangeDetermines temporal coverage
Safety filteringRemoves or reduces harmful content patterns

Pretraining does not give the model direct access to truth. It gives the model statistical regularities from the training corpus. A model may learn many facts, but its objective is still token prediction or reconstruction, not truth verification.

This distinction matters. A pretrained model can produce fluent false statements because the pretraining loss rewards likely text, not guaranteed correctness.

Loss, Perplexity, and Scaling

For autoregressive language models, the average negative log-likelihood per token is commonly used:

L=1Tt=1Tlogpθ(xtx<t). \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}).

Perplexity is the exponential of this loss:

PPL=exp(L). \mathrm{PPL} = \exp(\mathcal{L}).

Lower perplexity means the model assigns higher probability to the observed text. Perplexity is useful for comparing models on the same dataset and tokenizer. It becomes less reliable when tokenizers, datasets, or evaluation conditions differ.

In large-scale pretraining, loss tends to decrease predictably as model size, dataset size, and compute increase. This behavior is described by scaling laws. Scaling laws help estimate how much compute and data are needed to reach a target loss.

However, lower pretraining loss does not guarantee better behavior in every downstream setting. Instruction following, factuality, safety, tool use, and reasoning may require additional adaptation or different evaluation methods.

PyTorch View of Autoregressive Pretraining

A decoder language model produces logits of shape

[B, T, V]

where:

SymbolMeaning
BBatch size
TSequence length
VVocabulary size

The target tokens have shape

[B, T]

For next-token prediction, inputs and targets are shifted:

import torch
import torch.nn.functional as F

# token_ids: [B, T]
inputs = token_ids[:, :-1]   # [B, T - 1]
targets = token_ids[:, 1:]   # [B, T - 1]

# model outputs logits: [B, T - 1, V]
logits = model(inputs)

loss = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),  # [(B * (T - 1)), V]
    targets.reshape(-1)                   # [(B * (T - 1))]
)

The reshaping is needed because cross_entropy expects class scores of shape [N, V] and labels of shape [N].

Each position in each sequence becomes one classification problem over the vocabulary.

PyTorch View of Masked Language Modeling

For masked language modeling, only selected positions contribute to the loss.

# input_ids: corrupted input, shape [B, T]
# labels: original token IDs at masked positions, -100 elsewhere

logits = model(input_ids)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),
    labels.reshape(-1),
    ignore_index=-100
)

The label value -100 is commonly used to mark positions that should be ignored by the loss. Only masked positions update the model for the prediction objective.

This differs from autoregressive training, where almost every token position usually contributes to the loss.

What Pretraining Teaches

Pretraining teaches several kinds of structure at once.

At the surface level, the model learns spelling, syntax, punctuation, formatting, and local phrase patterns.

At the semantic level, it learns associations among entities, concepts, events, and relations.

At the discourse level, it learns genre, style, argument structure, dialogue patterns, code structure, and document organization.

At the computational level, it may learn procedures that are represented in text, such as arithmetic patterns, program traces, proofs, and stepwise explanations.

The objective itself may be simple, but the data distribution is rich. Predicting tokens across trillions of examples forces the model to represent many latent variables that explain text.

Limits of Pretraining Objectives

Pretraining objectives have important limits.

First, token prediction rewards plausibility. It does not directly reward truth, honesty, usefulness, or harmlessness.

Second, the model learns from a fixed corpus. It cannot know events after the data cutoff unless connected to retrieval or tools.

Third, the model may memorize rare strings, especially if they appear repeatedly in training data.

Fourth, the model may learn social biases, unsafe instructions, or low-quality patterns present in the corpus.

Fifth, the model may learn shortcuts. A low loss can hide brittle reasoning, shallow pattern matching, or benchmark contamination.

For these reasons, pretraining is only the first stage in building a useful large language model. Later stages may include supervised fine-tuning, instruction tuning, reinforcement learning from preferences, rejection sampling, constitutional training, retrieval augmentation, tool training, and safety evaluation.

Summary

A pretraining objective defines how a language model learns from large unlabeled or weakly structured text corpora.

Autoregressive language modeling predicts the next token from previous tokens. It is the dominant objective for decoder-only large language models and works naturally for generation.

Masked language modeling predicts hidden tokens from bidirectional context. It is effective for encoder models and representation learning.

Denoising sequence modeling reconstructs clean text from corrupted text. It is common for encoder-decoder models and conditional generation.

Contrastive pretraining learns useful embedding spaces for retrieval and semantic matching.

The objective, architecture, tokenizer, data mixture, and compute budget together determine what a pretrained model can learn. Pretraining gives the model broad linguistic and statistical competence. Later adaptation turns that competence into more controlled behavior.