Sequence models process ordered data. The input is not one independent vector, but a series:
Sequence models process ordered data. The input is not one independent vector, but a series:
Examples include text tokens, audio frames, video frames, time-series samples, event logs, trajectories, and program tokens. The model must represent both local content and position-dependent context.
A sequence model usually computes hidden states:
or contextual states:
The output may be one value for the whole sequence, one value per position, or a generated continuation.
Common Sequence Tasks
| Task | Input | Output |
|---|---|---|
| Classification | Whole sequence | One label |
| Tagging | Sequence | Label per token |
| Forecasting | Past values | Future values |
| Translation | Source sequence | Target sequence |
| Language modeling | Previous tokens | Next token |
| Control | State history | Action sequence |
The loss depends on the task. For language modeling, the standard objective predicts the next token:
AD differentiates this loss through all operations that produced the token probabilities.
Recurrent Models
A recurrent neural network updates a hidden state step by step:
The same parameters are reused at every time step. During backpropagation, gradients from all time steps accumulate into the same parameter buffers:
This is weight sharing across time. AD systems handle it naturally if the computation is unrolled into a graph and parameter uses point to the same underlying tensor.
Backpropagation Through Time
Training a recurrent model uses backpropagation through time. The recurrence is expanded into a computation graph:
h0 -> h1 -> h2 -> ... -> hTThe backward pass walks the graph in reverse:
hT -> hT-1 -> ... -> h0Long sequences create two problems. First, activation memory grows with sequence length because hidden states must be saved. Second, gradients involve long products of Jacobians, which can vanish or explode.
Truncated backpropagation through time limits the backward span:
process K steps
backward through those K steps
detach hidden state
continueDetaching the hidden state cuts the graph. The next segment still receives the numeric hidden value, but gradients no longer flow into earlier segments.
Gated Recurrent Networks
LSTM and GRU architectures were introduced to improve gradient flow in recurrent models. They add gates that control what is stored, forgotten, and exposed.
An LSTM has a cell state with additive updates. This additive path gives gradients a more stable route across time than a plain repeated nonlinear transformation.
From an AD perspective, gated networks are ordinary differentiable programs. Their importance is numerical and architectural: they shape the Jacobians through which backpropagation propagates adjoints.
Attention Models
Attention computes each output position as a weighted combination of other positions. A basic scaled dot-product attention block is:
The backward pass differentiates through matrix multiplication, scaling, softmax, and another matrix multiplication.
Attention removes the strict sequential dependency of recurrence. All positions can interact in one layer, and many operations parallelize well on accelerators. The cost is quadratic in sequence length for standard full attention.
Masks and Causality
Language models use causal masks so position cannot attend to future positions. The attention scores for forbidden positions are set to a very negative value before softmax.
Conceptually:
if j > t:
score[t, j] = -infinityThe mask changes the computation graph by removing future-token influence. Gradients do not flow through masked attention probabilities because those probabilities are forced to zero after softmax.
Padding masks are also common. They prevent padded tokens from affecting the loss or attention context.
Loss Over Sequences
Sequence losses often sum or average over positions:
Here, is the set of positions included in the loss, and is its size. Padding positions are usually excluded.
This normalization affects gradient scale. Averaging over tokens keeps gradient magnitude more stable across variable-length sequences. Summing over tokens makes longer examples contribute larger gradients.
Teacher Forcing
In autoregressive training, teacher forcing feeds the ground-truth previous token rather than the model’s sampled previous token.
Training computes:
Generation computes:
where earlier tokens may have been sampled from the model.
This creates a train-generation mismatch. AD only sees the training computation. It differentiates the teacher-forced loss, not the discrete sampling process used during inference.
Discrete Tokens
Text models use integer token IDs. Tokenization and lookup are not differentiable with respect to the token ID. Instead, the ID selects an embedding vector:
Gradients flow into the selected rows of the embedding matrix . They do not flow into the integer token itself.
When the same token appears multiple times, gradient contributions to its embedding row are summed. This is another case of shared parameter accumulation.
Memory in Sequence Training
Sequence training is memory-intensive because activations scale with batch size, sequence length, hidden dimension, and number of layers.
A rough shape is:
Here, is batch size, is sequence length, is hidden width, and is layer count.
Training systems use several techniques:
| Technique | Purpose |
|---|---|
| Gradient accumulation | Larger effective batch |
| Activation checkpointing | Lower activation memory |
| Mixed precision | Lower memory and faster compute |
| Sequence packing | Reduce padding waste |
| Flash attention | Reduce attention memory traffic |
| Truncated BPTT | Limit recurrent graph length |
These techniques change execution cost and memory layout. They should preserve the intended gradient, except when truncation intentionally changes it.
Sequence Models and AD
Sequence models stress AD systems because they combine shared parameters, long dependency chains, dynamic lengths, masks, indexing, and large tensor kernels.
The AD contract remains the same:
forward:
compute sequence loss
backward:
propagate adjoints through the executed sequence computation
optimizer:
update shared parameters from accumulated gradientsFor recurrent models, the challenge is long graph depth. For attention models, the challenge is tensor size and memory bandwidth. In both cases, AD supplies local derivative propagation; the architecture and runtime determine whether training is stable and efficient.