# Autoregressive Modeling

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.

Given a token sequence

$$
x_{1:T} = (x_1, x_2, \ldots, x_T),
$$

an autoregressive language model factorizes its probability as

$$
p_\theta(x_{1:T}) =
\prod_{t=1}^{T}
p_\theta(x_t \mid x_{1:t-1}).
$$

This is the same chain-rule factorization introduced in statistical language modeling. The difference is the parameterization. Classical models approximate the conditional distribution with count tables. Modern autoregressive models use neural networks, usually transformers.

### Next-Token Prediction

The basic training problem is next-token prediction. Given a prefix, the model predicts the following token.

For the sequence

$$
\text{deep learning models generalize}
$$

the training examples are:

| Input prefix | Target token |
|---|---|
| deep | learning |
| deep learning | models |
| deep learning models | generalize |

The model learns a conditional distribution over the vocabulary at every position:

$$
p_\theta(x_{t+1} \mid x_{1:t}).
$$

In practice, all positions are trained in parallel. A transformer receives a full sequence and predicts the next token for each position, while a causal mask prevents the model from looking at future tokens.

### Causal Masking

Autoregressive models must respect temporal order. When predicting token $x_t$, the model may use only earlier tokens $x_{1:t-1}$. It must not use $x_t$ itself or any later token.

In a transformer, this constraint is enforced by a causal attention mask.

Without a mask, token $t$ could attend to all positions:

$$
1,2,\ldots,T.
$$

With a causal mask, token $t$ can attend only to

$$
1,2,\ldots,t.
$$

The attention score matrix has shape

$$
T \times T.
$$

The causal mask sets future positions to negative infinity before the softmax. For example, for $T=5$, the allowed attention pattern is:

$$
\begin{bmatrix}
1 & 0 & 0 & 0 & 0 \\
1 & 1 & 0 & 0 & 0 \\
1 & 1 & 1 & 0 & 0 \\
1 & 1 & 1 & 1 & 0 \\
1 & 1 & 1 & 1 & 1
\end{bmatrix}.
$$

A value of 1 means attention is allowed. A value of 0 means attention is blocked.

In PyTorch:

```python
import torch

T = 5

mask = torch.tril(torch.ones(T, T))
print(mask)
```

For attention logits:

```python
scores = torch.randn(T, T)

masked_scores = scores.masked_fill(
    mask == 0,
    float("-inf")
)
```

After softmax, blocked positions receive probability zero.

### Training Objective

Autoregressive models are usually trained by maximum likelihood. Given a dataset of token sequences, the objective is

$$
\max_\theta
\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{1:t-1}).
$$

Equivalently, we minimize the negative log-likelihood:

$$
\mathcal{L}(\theta) = -
\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{1:t-1}).
$$

For a batch of sequences

$$
X \in \mathbb{N}^{B \times T},
$$

the model produces logits

$$
Z \in \mathbb{R}^{B \times T \times |V|}.
$$

The target is the same sequence shifted left by one position.

```python
tokens = torch.randint(0, vocab_size, (B, T + 1))

x = tokens[:, :-1]
y = tokens[:, 1:]
```

The model receives `x` and predicts `y`.

```python
logits = model(x)  # [B, T, V]

loss = torch.nn.functional.cross_entropy(
    logits.reshape(B * T, vocab_size),
    y.reshape(B * T),
)
```

This is the standard pretraining objective for GPT-style language models.

### Teacher Forcing

During training, the model conditions on the true previous tokens. This is called teacher forcing.

For example, when predicting the fourth token, the model receives the correct first three tokens, even if it would have generated a different third token during inference.

Training context:

$$
p_\theta(x_4 \mid x_1,x_2,x_3).
$$

Inference context:

$$
p_\theta(\hat{x}_4 \mid \hat{x}_1,\hat{x}_2,\hat{x}_3).
$$

The hat notation indicates model-generated tokens.

Teacher forcing makes training efficient because every position in a sequence can be supervised at once. It also creates a mismatch between training and generation. At inference time, errors can compound because the model must condition on its own outputs.

Despite this mismatch, teacher forcing remains the standard method for large-scale language model pretraining.

### Generation as Repeated Sampling

Autoregressive generation proceeds one token at a time.

Given an initial prompt

$$
x_{1:k},
$$

the model computes

$$
p_\theta(x_{k+1} \mid x_{1:k}).
$$

A next token is selected from this distribution. The selected token is appended to the context. The process repeats:

$$
x_{k+2} \sim p_\theta(x_{k+2} \mid x_{1:k+1}),
$$

$$
x_{k+3} \sim p_\theta(x_{k+3} \mid x_{1:k+2}).
$$

This loop continues until a stop token is generated or a maximum length is reached.

Minimal generation loop:

```python
import torch
import torch.nn.functional as F

@torch.no_grad()
def generate(model, prompt, max_new_tokens):
    x = prompt

    for _ in range(max_new_tokens):
        logits = model(x)

        next_logits = logits[:, -1, :]
        probs = F.softmax(next_logits, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)

        x = torch.cat([x, next_token], dim=1)

    return x
```

The key line is:

```python
next_logits = logits[:, -1, :]
```

Only the final position is used to choose the next token.

### Decoding Strategies

The model outputs a probability distribution. Decoding is the procedure used to choose the next token.

The simplest method is greedy decoding:

$$
\hat{x}_{t+1} =
\arg\max_i p_\theta(x_{t+1}=i \mid x_{1:t}).
$$

Greedy decoding always selects the most likely token. It is deterministic and efficient, but often produces repetitive or dull text.

Sampling draws from the probability distribution:

$$
x_{t+1} \sim p_\theta(\cdot \mid x_{1:t}).
$$

Sampling produces more diverse outputs but may choose low-quality tokens.

Temperature modifies the logits before softmax:

$$
p_i =
\frac{\exp(z_i / \tau)}
{\sum_j \exp(z_j / \tau)}.
$$

Here $\tau$ is the temperature.

| Temperature | Effect |
|---:|---|
| $\tau < 1$ | Sharper distribution, more deterministic |
| $\tau = 1$ | Original distribution |
| $\tau > 1$ | Flatter distribution, more random |

Top-$k$ sampling keeps only the $k$ highest-probability tokens and samples among them.

Top-$p$, or nucleus sampling, keeps the smallest set of tokens whose cumulative probability exceeds $p$.

These methods control the tradeoff between coherence and diversity.

### Beam Search

Beam search keeps several partial generations at once. At each step, it expands each candidate sequence and keeps the best-scoring beams.

For a sequence $x_{1:T}$, the score is usually the log probability:

$$
\log p_\theta(x_{1:T}) =
\sum_{t=1}^{T}
\log p_\theta(x_t \mid x_{1:t-1}).
$$

Beam search is common in machine translation and speech recognition. It is less common for open-ended large language model generation because it can produce generic or repetitive text.

A length penalty is often added because raw sequence probability tends to prefer shorter outputs.

### Exposure Bias and Error Accumulation

Training uses true prefixes. Inference uses generated prefixes. This difference is called exposure bias.

Suppose a model generates an incorrect token at time $t$. The next prediction uses a context that may never appear in the training data. This can move the model further away from the distribution it learned.

The effect is especially visible in long generation. Small errors can accumulate into incoherence, repetition, contradiction, or topic drift.

Several approaches address this problem:

| Approach | Description |
|---|---|
| Scheduled sampling | Mix true and generated tokens during training |
| Sequence-level objectives | Optimize full generated sequences |
| Reinforcement learning | Optimize reward functions over sampled outputs |
| Better decoding | Reduce low-quality continuation paths |

Modern instruction-tuned models rely heavily on better pretraining, supervised fine-tuning, preference optimization, and decoding constraints rather than replacing teacher forcing.

### Context Length

An autoregressive model can condition only on tokens inside its context window.

If the context length is $L$, then the model computes

$$
p_\theta(x_t \mid x_{t-L:t-1})
$$

for positions far into a sequence.

A longer context window allows the model to use more previous information. It also increases memory and computation, especially for standard self-attention, whose cost grows quadratically with sequence length:

$$
O(T^2).
$$

Long-context models use methods such as sparse attention, sliding windows, memory tokens, recurrence, compressed context, and retrieval augmentation to extend usable context.

### Autoregressive Models and Parallelism

Training and generation have different parallelism properties.

During training, all token predictions in a sequence can be computed in parallel because the true sequence is known. A causal mask prevents information leakage from future tokens.

During generation, tokens must be produced sequentially. The model cannot generate token $t+1$ before token $t$ exists.

This creates an inference bottleneck.

To improve generation speed, systems use techniques such as:

| Technique | Purpose |
|---|---|
| KV caching | Reuse previous attention keys and values |
| Speculative decoding | Draft multiple tokens with a smaller model |
| Quantization | Reduce memory bandwidth and compute cost |
| Batching | Serve multiple requests together |
| Tensor parallelism | Split computation across devices |

Autoregressive models are therefore easy to train in parallel but relatively expensive to decode one token at a time.

### Autoregressive Modeling Beyond Text

Autoregressive modeling applies to any data that can be represented as a sequence.

Examples include:

| Domain | Sequence elements |
|---|---|
| Text | Tokens, subwords, bytes |
| Audio | Samples, frames, codes |
| Images | Pixels, patches, discrete visual tokens |
| Video | Frames, patches, latent codes |
| Code | Tokens |
| Actions | Control commands |
| Molecules | Atoms or string tokens |

For images, an autoregressive model might generate pixels or patches in raster order. For audio, it might generate waveform samples or compressed audio tokens. For multimodal systems, the model may generate text conditioned on image, audio, or video embeddings.

The essential structure remains the same:

$$
p(x_{1:T}) =
\prod_{t=1}^{T}
p(x_t \mid x_{1:t-1}).
$$

### Strengths and Limitations

Autoregressive modeling has several strengths.

It gives a valid probability distribution over sequences. It supports open-ended generation. It is compatible with maximum likelihood training. It scales well with transformers. It naturally handles variable-length outputs.

It also has limitations.

Generation is sequential. Long outputs are expensive. Errors can accumulate. The model can overfit to local next-token prediction while failing at long-horizon planning. It can produce fluent text without grounded truth. It can be sensitive to decoding settings.

Modern language models address these limitations with larger context windows, retrieval, tool use, preference optimization, inference-time search, and external verification.

Autoregressive modeling remains the central foundation for GPT-style language models. Its power comes from a simple training signal repeated at enormous scale: predict the next token.