# Pretraining Objectives

A large language model is trained in two broad phases. The first phase is pretraining. The second phase is adaptation. During pretraining, the model learns general statistical structure from a large corpus. During adaptation, the pretrained model is specialized for a task, instruction-following behavior, dialogue, tool use, or preference alignment.

A pretraining objective defines the learning problem used before task-specific supervision. It tells the model what to predict, what loss to minimize, and which parts of the input are visible during prediction.

The most important pretraining objectives for language models are:

| Objective | Main idea | Common model type |
|---|---|---|
| Autoregressive language modeling | Predict the next token from previous tokens | GPT-style decoder models |
| Masked language modeling | Predict hidden tokens from surrounding context | BERT-style encoder models |
| Denoising sequence modeling | Reconstruct clean text from corrupted text | T5, BART-style encoder-decoder models |
| Prefix language modeling | Predict continuation tokens from a visible prefix | Encoder-decoder or decoder variants |
| Contrastive pretraining | Learn representations by comparing positive and negative pairs | Retrieval and embedding models |

The choice of pretraining objective strongly affects what the model can do naturally. Autoregressive models are well suited for generation. Masked language models are well suited for representation learning. Encoder-decoder denoising models are well suited for conditional generation, such as translation, summarization, and text rewriting.

### Tokens and Sequences

A language model does not operate directly on raw text. Text is first converted into a sequence of tokens. A token may be a word, subword, byte sequence, or character-like unit.

Let a text sequence be represented as

$$
x = (x_1, x_2, \ldots, x_T),
$$

where each $x_t$ is a token from a vocabulary $V$, and $T$ is the sequence length.

The model receives tokens as integer IDs. These IDs are mapped to embedding vectors, then processed by a neural network. The pretraining objective defines which token distributions the model must predict.

For most large language models, the output at each position is a probability distribution over the vocabulary:

$$
p_\theta(x_t \mid \text{context}),
$$

where $\theta$ denotes the model parameters.

The loss is usually a cross-entropy loss between the predicted distribution and the true token.

### Autoregressive Language Modeling

Autoregressive language modeling is the standard objective for decoder-only large language models.

The model learns to predict each token from the tokens before it:

$$
p_\theta(x_t \mid x_1, x_2, \ldots, x_{t-1}).
$$

The probability of the full sequence is factorized as

$$
p_\theta(x_1, x_2, \ldots, x_T) =
\prod_{t=1}^{T}
p_\theta(x_t \mid x_{<t}).
$$

Here $x_{<t}$ means all tokens before position $t$.

The training loss is the negative log-likelihood:

$$
\mathcal{L}(\theta) =
-\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{<t}).
$$

In practice, the model predicts all next-token targets in parallel during training. The input sequence is shifted relative to the target sequence.

For example, given the text:

```text
Deep learning models predict tokens
```

The training pairs are:

| Context | Target |
|---|---|
| `Deep` | `learning` |
| `Deep learning` | `models` |
| `Deep learning models` | `predict` |
| `Deep learning models predict` | `tokens` |

The transformer uses a causal attention mask so that position $t$ cannot attend to positions after $t$. This prevents the model from seeing the answer during training.

Autoregressive pretraining has a simple advantage: generation uses the same conditional distribution as training. At inference time, the model repeatedly samples or selects the next token:

$$
x_{t+1} \sim p_\theta(\cdot \mid x_1,\ldots,x_t).
$$

This makes the objective naturally aligned with text generation.

### Causal Masking

A decoder-only transformer uses self-attention, but attention is restricted by a causal mask. Without a causal mask, token $x_t$ could attend to future tokens $x_{t+1}, x_{t+2}, \ldots$. That would make next-token prediction trivial and invalid.

The causal mask allows each position to attend only to itself and earlier positions.

For a sequence of length $T$, the attention mask has a triangular structure:

$$
M_{ij} =
\begin{cases}
0, & j \leq i, \\
-\infty, & j > i.
\end{cases}
$$

Before the softmax, masked positions receive a very negative value. After the softmax, their attention weight becomes approximately zero.

This constraint creates a left-to-right model of text. The model learns to compress all relevant previous information into hidden states that predict the future.

### Masked Language Modeling

Masked language modeling is used by encoder-style models. Instead of predicting the next token, the model sees a corrupted sequence and predicts selected missing tokens.

Given a sequence

$$
x = (x_1,\ldots,x_T),
$$

we choose a subset of positions $M$. Tokens at those positions are replaced by a special mask token, a random token, or sometimes left unchanged. The model receives the corrupted sequence $\tilde{x}$, then predicts the original tokens at masked positions.

The loss is

$$
\mathcal{L}(\theta) =
-\sum_{t \in M}
\log p_\theta(x_t \mid \tilde{x}).
$$

The key difference from autoregressive modeling is that masked language models can use both left and right context. For example:

```text
The capital of France is [MASK].
```

The model can use the full sentence to predict `Paris`.

Masked language modeling is effective for representation learning because every token representation can depend on surrounding tokens in both directions. This is useful for classification, extraction, retrieval, and sentence-pair tasks.

However, masked language models are less natural for open-ended generation. During generation, the model does not automatically define a left-to-right probability over complete sequences in the same direct way as an autoregressive model.

### Denoising Sequence Modeling

Denoising sequence modeling generalizes masked language modeling. Instead of masking individual tokens, the training process corrupts the input in a richer way. The model must reconstruct the original clean sequence.

Corruptions may include:

| Corruption | Description |
|---|---|
| Token masking | Replace tokens with mask symbols |
| Span masking | Replace consecutive spans with sentinel tokens |
| Token deletion | Remove some tokens |
| Sentence permutation | Reorder sentences |
| Text infilling | Fill in missing spans |
| Noise injection | Add random or misleading tokens |

Denoising objectives are common in encoder-decoder models. The encoder reads the corrupted input. The decoder generates the clean output.

Let $x$ be the original sequence and $\tilde{x}$ be the corrupted sequence. The model learns

$$
p_\theta(x \mid \tilde{x}).
$$

The loss is

$$
\mathcal{L}(\theta) =
-\log p_\theta(x \mid \tilde{x}).
$$

This objective trains the model to map incomplete or noisy text into coherent text. It is useful for summarization, translation, editing, and other conditional generation tasks.

### Prefix Language Modeling

Prefix language modeling gives the model a visible prefix and asks it to generate the continuation.

Given a sequence split into two parts,

$$
x = (x_{\text{prefix}}, x_{\text{target}}),
$$

the model predicts

$$
p_\theta(x_{\text{target}} \mid x_{\text{prefix}}).
$$

This is similar to ordinary autoregressive modeling, but the prefix may be encoded bidirectionally in some architectures. The target portion is still generated left to right.

The objective is useful when the model must condition on an input and produce an output. For example:

```text
Input: Translate to French: I like machine learning.
Output: J'aime l'apprentissage automatique.
```

The prefix contains the task instruction and source text. The target contains the desired completion.

Prefix objectives form a bridge between language modeling and instruction-style conditional generation.

### Contrastive Pretraining

Contrastive pretraining is common for embedding models and retrieval systems. Instead of predicting tokens, the model learns to place related items close together in representation space and unrelated items farther apart.

A training example may contain a query $q$, a positive document $d^+$, and several negative documents $d^-_1,\ldots,d^-_k$. The model computes embeddings and similarity scores:

$$
s(q,d) = \mathrm{sim}(f_\theta(q), f_\theta(d)).
$$

A common contrastive loss is

$$
\mathcal{L} =
-\log
\frac{\exp(s(q,d^+)/\tau)}
{\exp(s(q,d^+)/\tau) + \sum_{i=1}^{k}\exp(s(q,d_i^-)/\tau)}.
$$

Here $\tau$ is a temperature parameter.

This objective is central for dense retrieval, semantic search, reranking, and retrieval-augmented generation. A generative language model predicts tokens. An embedding model predicts useful geometry.

### Objective and Architecture

The pretraining objective and architecture usually match each other.

| Architecture | Typical objective | Strength |
|---|---|---|
| Encoder-only transformer | Masked language modeling | Understanding and representation |
| Decoder-only transformer | Autoregressive language modeling | Open-ended generation |
| Encoder-decoder transformer | Denoising or sequence-to-sequence modeling | Conditional generation |
| Dual encoder | Contrastive learning | Retrieval and embedding search |

An encoder-only model can look at the full input at once. This makes it strong for tasks where the entire input is known before prediction.

A decoder-only model uses causal attention. This makes it strong for generation, since each new token depends only on previous tokens.

An encoder-decoder model separates input understanding from output generation. This is useful when input and output have different structures.

A dual encoder maps two objects into a shared vector space. This is efficient for large-scale search because document embeddings can be precomputed.

### Pretraining Data Distribution

The pretraining objective defines the mathematical loss. The dataset defines the distribution over which the loss is optimized.

A model trained on web pages, books, code, academic papers, dialogue, and structured data learns different behavior from a model trained only on short formal text. Data mixture matters.

Important dimensions of pretraining data include:

| Dimension | Effect |
|---|---|
| Domain | Determines what the model knows and how it writes |
| Language coverage | Determines multilingual ability |
| Code fraction | Affects reasoning, tool use, and programming ability |
| Quality filtering | Affects coherence and factuality |
| Deduplication | Reduces memorization and benchmark contamination |
| Time range | Determines temporal coverage |
| Safety filtering | Removes or reduces harmful content patterns |

Pretraining does not give the model direct access to truth. It gives the model statistical regularities from the training corpus. A model may learn many facts, but its objective is still token prediction or reconstruction, not truth verification.

This distinction matters. A pretrained model can produce fluent false statements because the pretraining loss rewards likely text, not guaranteed correctness.

### Loss, Perplexity, and Scaling

For autoregressive language models, the average negative log-likelihood per token is commonly used:

$$
\mathcal{L} =
-\frac{1}{T}
\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{<t}).
$$

Perplexity is the exponential of this loss:

$$
\mathrm{PPL} = \exp(\mathcal{L}).
$$

Lower perplexity means the model assigns higher probability to the observed text. Perplexity is useful for comparing models on the same dataset and tokenizer. It becomes less reliable when tokenizers, datasets, or evaluation conditions differ.

In large-scale pretraining, loss tends to decrease predictably as model size, dataset size, and compute increase. This behavior is described by scaling laws. Scaling laws help estimate how much compute and data are needed to reach a target loss.

However, lower pretraining loss does not guarantee better behavior in every downstream setting. Instruction following, factuality, safety, tool use, and reasoning may require additional adaptation or different evaluation methods.

### PyTorch View of Autoregressive Pretraining

A decoder language model produces logits of shape

```python
[B, T, V]
```

where:

| Symbol | Meaning |
|---|---|
| `B` | Batch size |
| `T` | Sequence length |
| `V` | Vocabulary size |

The target tokens have shape

```python
[B, T]
```

For next-token prediction, inputs and targets are shifted:

```python
import torch
import torch.nn.functional as F

# token_ids: [B, T]
inputs = token_ids[:, :-1]   # [B, T - 1]
targets = token_ids[:, 1:]   # [B, T - 1]

# model outputs logits: [B, T - 1, V]
logits = model(inputs)

loss = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),  # [(B * (T - 1)), V]
    targets.reshape(-1)                   # [(B * (T - 1))]
)
```

The reshaping is needed because `cross_entropy` expects class scores of shape `[N, V]` and labels of shape `[N]`.

Each position in each sequence becomes one classification problem over the vocabulary.

### PyTorch View of Masked Language Modeling

For masked language modeling, only selected positions contribute to the loss.

```python
# input_ids: corrupted input, shape [B, T]
# labels: original token IDs at masked positions, -100 elsewhere

logits = model(input_ids)  # [B, T, V]

loss = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),
    labels.reshape(-1),
    ignore_index=-100
)
```

The label value `-100` is commonly used to mark positions that should be ignored by the loss. Only masked positions update the model for the prediction objective.

This differs from autoregressive training, where almost every token position usually contributes to the loss.

### What Pretraining Teaches

Pretraining teaches several kinds of structure at once.

At the surface level, the model learns spelling, syntax, punctuation, formatting, and local phrase patterns.

At the semantic level, it learns associations among entities, concepts, events, and relations.

At the discourse level, it learns genre, style, argument structure, dialogue patterns, code structure, and document organization.

At the computational level, it may learn procedures that are represented in text, such as arithmetic patterns, program traces, proofs, and stepwise explanations.

The objective itself may be simple, but the data distribution is rich. Predicting tokens across trillions of examples forces the model to represent many latent variables that explain text.

### Limits of Pretraining Objectives

Pretraining objectives have important limits.

First, token prediction rewards plausibility. It does not directly reward truth, honesty, usefulness, or harmlessness.

Second, the model learns from a fixed corpus. It cannot know events after the data cutoff unless connected to retrieval or tools.

Third, the model may memorize rare strings, especially if they appear repeatedly in training data.

Fourth, the model may learn social biases, unsafe instructions, or low-quality patterns present in the corpus.

Fifth, the model may learn shortcuts. A low loss can hide brittle reasoning, shallow pattern matching, or benchmark contamination.

For these reasons, pretraining is only the first stage in building a useful large language model. Later stages may include supervised fine-tuning, instruction tuning, reinforcement learning from preferences, rejection sampling, constitutional training, retrieval augmentation, tool training, and safety evaluation.

### Summary

A pretraining objective defines how a language model learns from large unlabeled or weakly structured text corpora.

Autoregressive language modeling predicts the next token from previous tokens. It is the dominant objective for decoder-only large language models and works naturally for generation.

Masked language modeling predicts hidden tokens from bidirectional context. It is effective for encoder models and representation learning.

Denoising sequence modeling reconstructs clean text from corrupted text. It is common for encoder-decoder models and conditional generation.

Contrastive pretraining learns useful embedding spaces for retrieval and semantic matching.

The objective, architecture, tokenizer, data mixture, and compute budget together determine what a pretrained model can learn. Pretraining gives the model broad linguistic and statistical competence. Later adaptation turns that competence into more controlled behavior.

