Skip to content

Sequence Models

Sequence models process ordered data. The input is not one independent vector, but a series:

Sequence models process ordered data. The input is not one independent vector, but a series:

x1,x2,,xT. x_1, x_2, \ldots, x_T.

Examples include text tokens, audio frames, video frames, time-series samples, event logs, trajectories, and program tokens. The model must represent both local content and position-dependent context.

A sequence model usually computes hidden states:

ht=fθ(xt,ht1) h_t = f_\theta(x_t, h_{t-1})

or contextual states:

ht=fθ(x1,,xT,t). h_t = f_\theta(x_1,\ldots,x_T,t).

The output may be one value for the whole sequence, one value per position, or a generated continuation.

Common Sequence Tasks

TaskInputOutput
ClassificationWhole sequenceOne label
TaggingSequenceLabel per token
ForecastingPast valuesFuture values
TranslationSource sequenceTarget sequence
Language modelingPrevious tokensNext token
ControlState historyAction sequence

The loss depends on the task. For language modeling, the standard objective predicts the next token:

L(θ)=t=1Tlogpθ(xt+1x1,,xt). L(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_{t+1}\mid x_1,\ldots,x_t).

AD differentiates this loss through all operations that produced the token probabilities.

Recurrent Models

A recurrent neural network updates a hidden state step by step:

ht=ϕ(Wxxt+Whht1+b). h_t = \phi(W_x x_t + W_h h_{t-1} + b).

The same parameters are reused at every time step. During backpropagation, gradients from all time steps accumulate into the same parameter buffers:

LWh=t=1TLtWh. \frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_h}.

This is weight sharing across time. AD systems handle it naturally if the computation is unrolled into a graph and parameter uses point to the same underlying tensor.

Backpropagation Through Time

Training a recurrent model uses backpropagation through time. The recurrence is expanded into a computation graph:

h0 -> h1 -> h2 -> ... -> hT

The backward pass walks the graph in reverse:

hT -> hT-1 -> ... -> h0

Long sequences create two problems. First, activation memory grows with sequence length because hidden states must be saved. Second, gradients involve long products of Jacobians, which can vanish or explode.

Truncated backpropagation through time limits the backward span:

process K steps
backward through those K steps
detach hidden state
continue

Detaching the hidden state cuts the graph. The next segment still receives the numeric hidden value, but gradients no longer flow into earlier segments.

Gated Recurrent Networks

LSTM and GRU architectures were introduced to improve gradient flow in recurrent models. They add gates that control what is stored, forgotten, and exposed.

An LSTM has a cell state with additive updates. This additive path gives gradients a more stable route across time than a plain repeated nonlinear transformation.

From an AD perspective, gated networks are ordinary differentiable programs. Their importance is numerical and architectural: they shape the Jacobians through which backpropagation propagates adjoints.

Attention Models

Attention computes each output position as a weighted combination of other positions. A basic scaled dot-product attention block is:

Q=XWQ,K=XWK,V=XWV, Q = XW_Q,\quad K = XW_K,\quad V = XW_V, A=softmax(QKd), A = \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right), Y=AV. Y = AV.

The backward pass differentiates through matrix multiplication, scaling, softmax, and another matrix multiplication.

Attention removes the strict sequential dependency of recurrence. All positions can interact in one layer, and many operations parallelize well on accelerators. The cost is quadratic in sequence length for standard full attention.

Masks and Causality

Language models use causal masks so position tt cannot attend to future positions. The attention scores for forbidden positions are set to a very negative value before softmax.

Conceptually:

if j > t:
    score[t, j] = -infinity

The mask changes the computation graph by removing future-token influence. Gradients do not flow through masked attention probabilities because those probabilities are forced to zero after softmax.

Padding masks are also common. They prevent padded tokens from affecting the loss or attention context.

Loss Over Sequences

Sequence losses often sum or average over positions:

L=1MtTt. L = \frac{1}{M} \sum_{t \in \mathcal{T}} \ell_t.

Here, T\mathcal{T} is the set of positions included in the loss, and MM is its size. Padding positions are usually excluded.

This normalization affects gradient scale. Averaging over tokens keeps gradient magnitude more stable across variable-length sequences. Summing over tokens makes longer examples contribute larger gradients.

Teacher Forcing

In autoregressive training, teacher forcing feeds the ground-truth previous token rather than the model’s sampled previous token.

Training computes:

pθ(xt+1x1,,xt). p_\theta(x_{t+1}\mid x_1,\ldots,x_t).

Generation computes:

pθ(x^t+1x^1,,x^t), p_\theta(\hat{x}_{t+1}\mid \hat{x}_1,\ldots,\hat{x}_t),

where earlier tokens may have been sampled from the model.

This creates a train-generation mismatch. AD only sees the training computation. It differentiates the teacher-forced loss, not the discrete sampling process used during inference.

Discrete Tokens

Text models use integer token IDs. Tokenization and lookup are not differentiable with respect to the token ID. Instead, the ID selects an embedding vector:

et=E[xt]. e_t = E[x_t].

Gradients flow into the selected rows of the embedding matrix EE. They do not flow into the integer token itself.

When the same token appears multiple times, gradient contributions to its embedding row are summed. This is another case of shared parameter accumulation.

Memory in Sequence Training

Sequence training is memory-intensive because activations scale with batch size, sequence length, hidden dimension, and number of layers.

A rough shape is:

activation memoryBTdL. \text{activation memory} \propto B \cdot T \cdot d \cdot L.

Here, BB is batch size, TT is sequence length, dd is hidden width, and LL is layer count.

Training systems use several techniques:

TechniquePurpose
Gradient accumulationLarger effective batch
Activation checkpointingLower activation memory
Mixed precisionLower memory and faster compute
Sequence packingReduce padding waste
Flash attentionReduce attention memory traffic
Truncated BPTTLimit recurrent graph length

These techniques change execution cost and memory layout. They should preserve the intended gradient, except when truncation intentionally changes it.

Sequence Models and AD

Sequence models stress AD systems because they combine shared parameters, long dependency chains, dynamic lengths, masks, indexing, and large tensor kernels.

The AD contract remains the same:

forward:
    compute sequence loss

backward:
    propagate adjoints through the executed sequence computation

optimizer:
    update shared parameters from accumulated gradients

For recurrent models, the challenge is long graph depth. For attention models, the challenge is tensor size and memory bandwidth. In both cases, AD supplies local derivative propagation; the architecture and runtime determine whether training is stable and efficient.