Skip to content

Autoregressive Modeling

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.

Given a token sequence

x1:T=(x1,x2,,xT), x_{1:T} = (x_1, x_2, \ldots, x_T),

an autoregressive language model factorizes its probability as

pθ(x1:T)=t=1Tpθ(xtx1:t1). p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}).

This is the same chain-rule factorization introduced in statistical language modeling. The difference is the parameterization. Classical models approximate the conditional distribution with count tables. Modern autoregressive models use neural networks, usually transformers.

Next-Token Prediction

The basic training problem is next-token prediction. Given a prefix, the model predicts the following token.

For the sequence

deep learning models generalize \text{deep learning models generalize}

the training examples are:

Input prefixTarget token
deeplearning
deep learningmodels
deep learning modelsgeneralize

The model learns a conditional distribution over the vocabulary at every position:

pθ(xt+1x1:t). p_\theta(x_{t+1} \mid x_{1:t}).

In practice, all positions are trained in parallel. A transformer receives a full sequence and predicts the next token for each position, while a causal mask prevents the model from looking at future tokens.

Causal Masking

Autoregressive models must respect temporal order. When predicting token xtx_t, the model may use only earlier tokens x1:t1x_{1:t-1}. It must not use xtx_t itself or any later token.

In a transformer, this constraint is enforced by a causal attention mask.

Without a mask, token tt could attend to all positions:

1,2,,T. 1,2,\ldots,T.

With a causal mask, token tt can attend only to

1,2,,t. 1,2,\ldots,t.

The attention score matrix has shape

T×T. T \times T.

The causal mask sets future positions to negative infinity before the softmax. For example, for T=5T=5, the allowed attention pattern is:

[1000011000111001111011111]. \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix}.

A value of 1 means attention is allowed. A value of 0 means attention is blocked.

In PyTorch:

import torch

T = 5

mask = torch.tril(torch.ones(T, T))
print(mask)

For attention logits:

scores = torch.randn(T, T)

masked_scores = scores.masked_fill(
    mask == 0,
    float("-inf")
)

After softmax, blocked positions receive probability zero.

Training Objective

Autoregressive models are usually trained by maximum likelihood. Given a dataset of token sequences, the objective is

maxθt=1Tlogpθ(xtx1:t1). \max_\theta \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

Equivalently, we minimize the negative log-likelihood:

L(θ)=t=1Tlogpθ(xtx1:t1). \mathcal{L}(\theta) = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

For a batch of sequences

XNB×T, X \in \mathbb{N}^{B \times T},

the model produces logits

ZRB×T×V. Z \in \mathbb{R}^{B \times T \times |V|}.

The target is the same sequence shifted left by one position.

tokens = torch.randint(0, vocab_size, (B, T + 1))

x = tokens[:, :-1]
y = tokens[:, 1:]

The model receives x and predicts y.

logits = model(x)  # [B, T, V]

loss = torch.nn.functional.cross_entropy(
    logits.reshape(B * T, vocab_size),
    y.reshape(B * T),
)

This is the standard pretraining objective for GPT-style language models.

Teacher Forcing

During training, the model conditions on the true previous tokens. This is called teacher forcing.

For example, when predicting the fourth token, the model receives the correct first three tokens, even if it would have generated a different third token during inference.

Training context:

pθ(x4x1,x2,x3). p_\theta(x_4 \mid x_1,x_2,x_3).

Inference context:

pθ(x^4x^1,x^2,x^3). p_\theta(\hat{x}_4 \mid \hat{x}_1,\hat{x}_2,\hat{x}_3).

The hat notation indicates model-generated tokens.

Teacher forcing makes training efficient because every position in a sequence can be supervised at once. It also creates a mismatch between training and generation. At inference time, errors can compound because the model must condition on its own outputs.

Despite this mismatch, teacher forcing remains the standard method for large-scale language model pretraining.

Generation as Repeated Sampling

Autoregressive generation proceeds one token at a time.

Given an initial prompt

x1:k, x_{1:k},

the model computes

pθ(xk+1x1:k). p_\theta(x_{k+1} \mid x_{1:k}).

A next token is selected from this distribution. The selected token is appended to the context. The process repeats:

xk+2pθ(xk+2x1:k+1), x_{k+2} \sim p_\theta(x_{k+2} \mid x_{1:k+1}), xk+3pθ(xk+3x1:k+2). x_{k+3} \sim p_\theta(x_{k+3} \mid x_{1:k+2}).

This loop continues until a stop token is generated or a maximum length is reached.

Minimal generation loop:

import torch
import torch.nn.functional as F

@torch.no_grad()
def generate(model, prompt, max_new_tokens):
    x = prompt

    for _ in range(max_new_tokens):
        logits = model(x)

        next_logits = logits[:, -1, :]
        probs = F.softmax(next_logits, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)

        x = torch.cat([x, next_token], dim=1)

    return x

The key line is:

next_logits = logits[:, -1, :]

Only the final position is used to choose the next token.

Decoding Strategies

The model outputs a probability distribution. Decoding is the procedure used to choose the next token.

The simplest method is greedy decoding:

x^t+1=argmaxipθ(xt+1=ix1:t). \hat{x}_{t+1} = \arg\max_i p_\theta(x_{t+1}=i \mid x_{1:t}).

Greedy decoding always selects the most likely token. It is deterministic and efficient, but often produces repetitive or dull text.

Sampling draws from the probability distribution:

xt+1pθ(x1:t). x_{t+1} \sim p_\theta(\cdot \mid x_{1:t}).

Sampling produces more diverse outputs but may choose low-quality tokens.

Temperature modifies the logits before softmax:

pi=exp(zi/τ)jexp(zj/τ). p_i = \frac{\exp(z_i / \tau)} {\sum_j \exp(z_j / \tau)}.

Here τ\tau is the temperature.

TemperatureEffect
τ<1\tau < 1Sharper distribution, more deterministic
τ=1\tau = 1Original distribution
τ>1\tau > 1Flatter distribution, more random

Top-kk sampling keeps only the kk highest-probability tokens and samples among them.

Top-pp, or nucleus sampling, keeps the smallest set of tokens whose cumulative probability exceeds pp.

These methods control the tradeoff between coherence and diversity.

Beam Search

Beam search keeps several partial generations at once. At each step, it expands each candidate sequence and keeps the best-scoring beams.

For a sequence x1:Tx_{1:T}, the score is usually the log probability:

logpθ(x1:T)=t=1Tlogpθ(xtx1:t1). \log p_\theta(x_{1:T}) = \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

Beam search is common in machine translation and speech recognition. It is less common for open-ended large language model generation because it can produce generic or repetitive text.

A length penalty is often added because raw sequence probability tends to prefer shorter outputs.

Exposure Bias and Error Accumulation

Training uses true prefixes. Inference uses generated prefixes. This difference is called exposure bias.

Suppose a model generates an incorrect token at time tt. The next prediction uses a context that may never appear in the training data. This can move the model further away from the distribution it learned.

The effect is especially visible in long generation. Small errors can accumulate into incoherence, repetition, contradiction, or topic drift.

Several approaches address this problem:

ApproachDescription
Scheduled samplingMix true and generated tokens during training
Sequence-level objectivesOptimize full generated sequences
Reinforcement learningOptimize reward functions over sampled outputs
Better decodingReduce low-quality continuation paths

Modern instruction-tuned models rely heavily on better pretraining, supervised fine-tuning, preference optimization, and decoding constraints rather than replacing teacher forcing.

Context Length

An autoregressive model can condition only on tokens inside its context window.

If the context length is LL, then the model computes

pθ(xtxtL:t1) p_\theta(x_t \mid x_{t-L:t-1})

for positions far into a sequence.

A longer context window allows the model to use more previous information. It also increases memory and computation, especially for standard self-attention, whose cost grows quadratically with sequence length:

O(T2). O(T^2).

Long-context models use methods such as sparse attention, sliding windows, memory tokens, recurrence, compressed context, and retrieval augmentation to extend usable context.

Autoregressive Models and Parallelism

Training and generation have different parallelism properties.

During training, all token predictions in a sequence can be computed in parallel because the true sequence is known. A causal mask prevents information leakage from future tokens.

During generation, tokens must be produced sequentially. The model cannot generate token t+1t+1 before token tt exists.

This creates an inference bottleneck.

To improve generation speed, systems use techniques such as:

TechniquePurpose
KV cachingReuse previous attention keys and values
Speculative decodingDraft multiple tokens with a smaller model
QuantizationReduce memory bandwidth and compute cost
BatchingServe multiple requests together
Tensor parallelismSplit computation across devices

Autoregressive models are therefore easy to train in parallel but relatively expensive to decode one token at a time.

Autoregressive Modeling Beyond Text

Autoregressive modeling applies to any data that can be represented as a sequence.

Examples include:

DomainSequence elements
TextTokens, subwords, bytes
AudioSamples, frames, codes
ImagesPixels, patches, discrete visual tokens
VideoFrames, patches, latent codes
CodeTokens
ActionsControl commands
MoleculesAtoms or string tokens

For images, an autoregressive model might generate pixels or patches in raster order. For audio, it might generate waveform samples or compressed audio tokens. For multimodal systems, the model may generate text conditioned on image, audio, or video embeddings.

The essential structure remains the same:

p(x1:T)=t=1Tp(xtx1:t1). p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}).

Strengths and Limitations

Autoregressive modeling has several strengths.

It gives a valid probability distribution over sequences. It supports open-ended generation. It is compatible with maximum likelihood training. It scales well with transformers. It naturally handles variable-length outputs.

It also has limitations.

Generation is sequential. Long outputs are expensive. Errors can accumulate. The model can overfit to local next-token prediction while failing at long-horizon planning. It can produce fluent text without grounded truth. It can be sensitive to decoding settings.

Modern language models address these limitations with larger context windows, retrieval, tool use, preference optimization, inference-time search, and external verification.

Autoregressive modeling remains the central foundation for GPT-style language models. Its power comes from a simple training signal repeated at enormous scale: predict the next token.