Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.
Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.
Given a token sequence
an autoregressive language model factorizes its probability as
This is the same chain-rule factorization introduced in statistical language modeling. The difference is the parameterization. Classical models approximate the conditional distribution with count tables. Modern autoregressive models use neural networks, usually transformers.
Next-Token Prediction
The basic training problem is next-token prediction. Given a prefix, the model predicts the following token.
For the sequence
the training examples are:
| Input prefix | Target token |
|---|---|
| deep | learning |
| deep learning | models |
| deep learning models | generalize |
The model learns a conditional distribution over the vocabulary at every position:
In practice, all positions are trained in parallel. A transformer receives a full sequence and predicts the next token for each position, while a causal mask prevents the model from looking at future tokens.
Causal Masking
Autoregressive models must respect temporal order. When predicting token , the model may use only earlier tokens . It must not use itself or any later token.
In a transformer, this constraint is enforced by a causal attention mask.
Without a mask, token could attend to all positions:
With a causal mask, token can attend only to
The attention score matrix has shape
The causal mask sets future positions to negative infinity before the softmax. For example, for , the allowed attention pattern is:
A value of 1 means attention is allowed. A value of 0 means attention is blocked.
In PyTorch:
import torch
T = 5
mask = torch.tril(torch.ones(T, T))
print(mask)For attention logits:
scores = torch.randn(T, T)
masked_scores = scores.masked_fill(
mask == 0,
float("-inf")
)After softmax, blocked positions receive probability zero.
Training Objective
Autoregressive models are usually trained by maximum likelihood. Given a dataset of token sequences, the objective is
Equivalently, we minimize the negative log-likelihood:
For a batch of sequences
the model produces logits
The target is the same sequence shifted left by one position.
tokens = torch.randint(0, vocab_size, (B, T + 1))
x = tokens[:, :-1]
y = tokens[:, 1:]The model receives x and predicts y.
logits = model(x) # [B, T, V]
loss = torch.nn.functional.cross_entropy(
logits.reshape(B * T, vocab_size),
y.reshape(B * T),
)This is the standard pretraining objective for GPT-style language models.
Teacher Forcing
During training, the model conditions on the true previous tokens. This is called teacher forcing.
For example, when predicting the fourth token, the model receives the correct first three tokens, even if it would have generated a different third token during inference.
Training context:
Inference context:
The hat notation indicates model-generated tokens.
Teacher forcing makes training efficient because every position in a sequence can be supervised at once. It also creates a mismatch between training and generation. At inference time, errors can compound because the model must condition on its own outputs.
Despite this mismatch, teacher forcing remains the standard method for large-scale language model pretraining.
Generation as Repeated Sampling
Autoregressive generation proceeds one token at a time.
Given an initial prompt
the model computes
A next token is selected from this distribution. The selected token is appended to the context. The process repeats:
This loop continues until a stop token is generated or a maximum length is reached.
Minimal generation loop:
import torch
import torch.nn.functional as F
@torch.no_grad()
def generate(model, prompt, max_new_tokens):
x = prompt
for _ in range(max_new_tokens):
logits = model(x)
next_logits = logits[:, -1, :]
probs = F.softmax(next_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
x = torch.cat([x, next_token], dim=1)
return xThe key line is:
next_logits = logits[:, -1, :]Only the final position is used to choose the next token.
Decoding Strategies
The model outputs a probability distribution. Decoding is the procedure used to choose the next token.
The simplest method is greedy decoding:
Greedy decoding always selects the most likely token. It is deterministic and efficient, but often produces repetitive or dull text.
Sampling draws from the probability distribution:
Sampling produces more diverse outputs but may choose low-quality tokens.
Temperature modifies the logits before softmax:
Here is the temperature.
| Temperature | Effect |
|---|---|
| Sharper distribution, more deterministic | |
| Original distribution | |
| Flatter distribution, more random |
Top- sampling keeps only the highest-probability tokens and samples among them.
Top-, or nucleus sampling, keeps the smallest set of tokens whose cumulative probability exceeds .
These methods control the tradeoff between coherence and diversity.
Beam Search
Beam search keeps several partial generations at once. At each step, it expands each candidate sequence and keeps the best-scoring beams.
For a sequence , the score is usually the log probability:
Beam search is common in machine translation and speech recognition. It is less common for open-ended large language model generation because it can produce generic or repetitive text.
A length penalty is often added because raw sequence probability tends to prefer shorter outputs.
Exposure Bias and Error Accumulation
Training uses true prefixes. Inference uses generated prefixes. This difference is called exposure bias.
Suppose a model generates an incorrect token at time . The next prediction uses a context that may never appear in the training data. This can move the model further away from the distribution it learned.
The effect is especially visible in long generation. Small errors can accumulate into incoherence, repetition, contradiction, or topic drift.
Several approaches address this problem:
| Approach | Description |
|---|---|
| Scheduled sampling | Mix true and generated tokens during training |
| Sequence-level objectives | Optimize full generated sequences |
| Reinforcement learning | Optimize reward functions over sampled outputs |
| Better decoding | Reduce low-quality continuation paths |
Modern instruction-tuned models rely heavily on better pretraining, supervised fine-tuning, preference optimization, and decoding constraints rather than replacing teacher forcing.
Context Length
An autoregressive model can condition only on tokens inside its context window.
If the context length is , then the model computes
for positions far into a sequence.
A longer context window allows the model to use more previous information. It also increases memory and computation, especially for standard self-attention, whose cost grows quadratically with sequence length:
Long-context models use methods such as sparse attention, sliding windows, memory tokens, recurrence, compressed context, and retrieval augmentation to extend usable context.
Autoregressive Models and Parallelism
Training and generation have different parallelism properties.
During training, all token predictions in a sequence can be computed in parallel because the true sequence is known. A causal mask prevents information leakage from future tokens.
During generation, tokens must be produced sequentially. The model cannot generate token before token exists.
This creates an inference bottleneck.
To improve generation speed, systems use techniques such as:
| Technique | Purpose |
|---|---|
| KV caching | Reuse previous attention keys and values |
| Speculative decoding | Draft multiple tokens with a smaller model |
| Quantization | Reduce memory bandwidth and compute cost |
| Batching | Serve multiple requests together |
| Tensor parallelism | Split computation across devices |
Autoregressive models are therefore easy to train in parallel but relatively expensive to decode one token at a time.
Autoregressive Modeling Beyond Text
Autoregressive modeling applies to any data that can be represented as a sequence.
Examples include:
| Domain | Sequence elements |
|---|---|
| Text | Tokens, subwords, bytes |
| Audio | Samples, frames, codes |
| Images | Pixels, patches, discrete visual tokens |
| Video | Frames, patches, latent codes |
| Code | Tokens |
| Actions | Control commands |
| Molecules | Atoms or string tokens |
For images, an autoregressive model might generate pixels or patches in raster order. For audio, it might generate waveform samples or compressed audio tokens. For multimodal systems, the model may generate text conditioned on image, audio, or video embeddings.
The essential structure remains the same:
Strengths and Limitations
Autoregressive modeling has several strengths.
It gives a valid probability distribution over sequences. It supports open-ended generation. It is compatible with maximum likelihood training. It scales well with transformers. It naturally handles variable-length outputs.
It also has limitations.
Generation is sequential. Long outputs are expensive. Errors can accumulate. The model can overfit to local next-token prediction while failing at long-horizon planning. It can produce fluent text without grounded truth. It can be sensitive to decoding settings.
Modern language models address these limitations with larger context windows, retrieval, tool use, preference optimization, inference-time search, and external verification.
Autoregressive modeling remains the central foundation for GPT-style language models. Its power comes from a simple training signal repeated at enormous scale: predict the next token.