Autoregressive Modeling

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.

Given a token sequence

x_{1:T} = (x_1, x_2, \ldots, x_T),

an autoregressive language model factorizes its probability as

p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}).

This is the same chain-rule factorization introduced in statistical language modeling. The difference is the parameterization. Classical models approximate the conditional distribution with count tables. Modern autoregressive models use neural networks, usually transformers.

Next-Token Prediction

The basic training problem is next-token prediction. Given a prefix, the model predicts the following token.

For the sequence

\text{deep learning models generalize}

the training examples are:

Input prefix	Target token
deep	learning
deep learning	models
deep learning models	generalize

The model learns a conditional distribution over the vocabulary at every position:

p_\theta(x_{t+1} \mid x_{1:t}).

In practice, all positions are trained in parallel. A transformer receives a full sequence and predicts the next token for each position, while a causal mask prevents the model from looking at future tokens.

Causal Masking

Autoregressive models must respect temporal order. When predicting token $x_t$ , the model may use only earlier tokens $x_{1:t-1}$ . It must not use $x_t$ itself or any later token.

In a transformer, this constraint is enforced by a causal attention mask.

Without a mask, token $t$ could attend to all positions:

1,2,\ldots,T.

With a causal mask, token $t$ can attend only to

1,2,\ldots,t.

The attention score matrix has shape

T \times T.

The causal mask sets future positions to negative infinity before the softmax. For example, for $T=5$ , the allowed attention pattern is:

\begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix}.

A value of 1 means attention is allowed. A value of 0 means attention is blocked.

In PyTorch:

import torch

T = 5

mask = torch.tril(torch.ones(T, T))
print(mask)

For attention logits:

scores = torch.randn(T, T)

masked_scores = scores.masked_fill(
    mask == 0,
    float("-inf")
)

After softmax, blocked positions receive probability zero.

Training Objective

Autoregressive models are usually trained by maximum likelihood. Given a dataset of token sequences, the objective is

\max_\theta \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

Equivalently, we minimize the negative log-likelihood:

\mathcal{L}(\theta) = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

For a batch of sequences

X \in \mathbb{N}^{B \times T},

the model produces logits

Z \in \mathbb{R}^{B \times T \times |V|}.

The target is the same sequence shifted left by one position.

tokens = torch.randint(0, vocab_size, (B, T + 1))

x = tokens[:, :-1]
y = tokens[:, 1:]

The model receives x and predicts y.

logits = model(x)  # [B, T, V]

loss = torch.nn.functional.cross_entropy(
    logits.reshape(B * T, vocab_size),
    y.reshape(B * T),
)

This is the standard pretraining objective for GPT-style language models.

Teacher Forcing

During training, the model conditions on the true previous tokens. This is called teacher forcing.

For example, when predicting the fourth token, the model receives the correct first three tokens, even if it would have generated a different third token during inference.

Training context:

p_\theta(x_4 \mid x_1,x_2,x_3).

Inference context:

p_\theta(\hat{x}_4 \mid \hat{x}_1,\hat{x}_2,\hat{x}_3).

The hat notation indicates model-generated tokens.

Teacher forcing makes training efficient because every position in a sequence can be supervised at once. It also creates a mismatch between training and generation. At inference time, errors can compound because the model must condition on its own outputs.

Despite this mismatch, teacher forcing remains the standard method for large-scale language model pretraining.

Generation as Repeated Sampling

Autoregressive generation proceeds one token at a time.

Given an initial prompt

x_{1:k},

the model computes

p_\theta(x_{k+1} \mid x_{1:k}).

A next token is selected from this distribution. The selected token is appended to the context. The process repeats:

x_{k+2} \sim p_\theta(x_{k+2} \mid x_{1:k+1}),

x_{k+3} \sim p_\theta(x_{k+3} \mid x_{1:k+2}).

This loop continues until a stop token is generated or a maximum length is reached.

Minimal generation loop:

import torch
import torch.nn.functional as F

@torch.no_grad()
def generate(model, prompt, max_new_tokens):
    x = prompt

    for _ in range(max_new_tokens):
        logits = model(x)

        next_logits = logits[:, -1, :]
        probs = F.softmax(next_logits, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)

        x = torch.cat([x, next_token], dim=1)

    return x

The key line is:

next_logits = logits[:, -1, :]

Only the final position is used to choose the next token.

Decoding Strategies

The model outputs a probability distribution. Decoding is the procedure used to choose the next token.

The simplest method is greedy decoding:

\hat{x}_{t+1} = \arg\max_i p_\theta(x_{t+1}=i \mid x_{1:t}).

Greedy decoding always selects the most likely token. It is deterministic and efficient, but often produces repetitive or dull text.

Sampling draws from the probability distribution:

x_{t+1} \sim p_\theta(\cdot \mid x_{1:t}).

Sampling produces more diverse outputs but may choose low-quality tokens.

Temperature modifies the logits before softmax:

p_i = \frac{\exp(z_i / \tau)} {\sum_j \exp(z_j / \tau)}.

Here $\tau$ is the temperature.

Temperature	Effect
$\tau < 1$	Sharper distribution, more deterministic
$\tau = 1$	Original distribution
$\tau > 1$	Flatter distribution, more random

Top- $k$ sampling keeps only the $k$ highest-probability tokens and samples among them.

Top- $p$ , or nucleus sampling, keeps the smallest set of tokens whose cumulative probability exceeds $p$ .

These methods control the tradeoff between coherence and diversity.

Beam Search

Beam search keeps several partial generations at once. At each step, it expands each candidate sequence and keeps the best-scoring beams.

For a sequence $x_{1:T}$ , the score is usually the log probability:

\log p_\theta(x_{1:T}) = \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

Beam search is common in machine translation and speech recognition. It is less common for open-ended large language model generation because it can produce generic or repetitive text.

A length penalty is often added because raw sequence probability tends to prefer shorter outputs.

Exposure Bias and Error Accumulation

Training uses true prefixes. Inference uses generated prefixes. This difference is called exposure bias.

Suppose a model generates an incorrect token at time $t$ . The next prediction uses a context that may never appear in the training data. This can move the model further away from the distribution it learned.

The effect is especially visible in long generation. Small errors can accumulate into incoherence, repetition, contradiction, or topic drift.

Several approaches address this problem:

Approach	Description
Scheduled sampling	Mix true and generated tokens during training
Sequence-level objectives	Optimize full generated sequences
Reinforcement learning	Optimize reward functions over sampled outputs
Better decoding	Reduce low-quality continuation paths

Modern instruction-tuned models rely heavily on better pretraining, supervised fine-tuning, preference optimization, and decoding constraints rather than replacing teacher forcing.

Context Length

An autoregressive model can condition only on tokens inside its context window.

If the context length is $L$ , then the model computes

p_\theta(x_t \mid x_{t-L:t-1})

for positions far into a sequence.

A longer context window allows the model to use more previous information. It also increases memory and computation, especially for standard self-attention, whose cost grows quadratically with sequence length:

O(T^2).

Long-context models use methods such as sparse attention, sliding windows, memory tokens, recurrence, compressed context, and retrieval augmentation to extend usable context.

Autoregressive Models and Parallelism

Training and generation have different parallelism properties.

During training, all token predictions in a sequence can be computed in parallel because the true sequence is known. A causal mask prevents information leakage from future tokens.

During generation, tokens must be produced sequentially. The model cannot generate token $t+1$ before token $t$ exists.

This creates an inference bottleneck.

To improve generation speed, systems use techniques such as:

Technique	Purpose
KV caching	Reuse previous attention keys and values
Speculative decoding	Draft multiple tokens with a smaller model
Quantization	Reduce memory bandwidth and compute cost
Batching	Serve multiple requests together
Tensor parallelism	Split computation across devices

Autoregressive models are therefore easy to train in parallel but relatively expensive to decode one token at a time.

Autoregressive Modeling Beyond Text

Autoregressive modeling applies to any data that can be represented as a sequence.

Examples include:

Domain	Sequence elements
Text	Tokens, subwords, bytes
Audio	Samples, frames, codes
Images	Pixels, patches, discrete visual tokens
Video	Frames, patches, latent codes
Code	Tokens
Actions	Control commands
Molecules	Atoms or string tokens

For images, an autoregressive model might generate pixels or patches in raster order. For audio, it might generate waveform samples or compressed audio tokens. For multimodal systems, the model may generate text conditioned on image, audio, or video embeddings.

The essential structure remains the same:

p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}).

Strengths and Limitations

Autoregressive modeling has several strengths.

It gives a valid probability distribution over sequences. It supports open-ended generation. It is compatible with maximum likelihood training. It scales well with transformers. It naturally handles variable-length outputs.

It also has limitations.

Generation is sequential. Long outputs are expensive. Errors can accumulate. The model can overfit to local next-token prediction while failing at long-horizon planning. It can produce fluent text without grounded truth. It can be sensitive to decoding settings.

Modern language models address these limitations with larger context windows, retrieval, tool use, preference optimization, inference-time search, and external verification.

Autoregressive modeling remains the central foundation for GPT-style language models. Its power comes from a simple training signal repeated at enormous scale: predict the next token.