Recurrent networks reuse the same parameters at every time step. This makes them compact, but it also changes how gradients are computed. A parameter such as affects not one layer, but every recurrent transition in the sequence.
Backpropagation through time, usually abbreviated BPTT, is the method used to train recurrent neural networks. It applies ordinary backpropagation to the unrolled recurrent graph.
The Unrolled Graph
A recurrent network is defined recursively:
For a sequence of length , this recurrence expands into
The same parameter set appears at every step. After unrolling, the recurrent model becomes a deep feedforward graph with shared weights.
If the model produces a loss at every step, the total loss may be written as
The training problem is to compute
Gradients with Shared Parameters
Because the same parameters are used repeatedly, the total gradient is the sum of contributions from all time steps:
However, this expression hides an important detail. The loss at time depends on earlier hidden states through the recurrence. Therefore, depends on both directly and indirectly.
For example, depends on:
Since each hidden state was computed using , the gradient must flow backward through the entire chain.
Gradient Flow Through Hidden States
Suppose the loss depends only on the final hidden state:
Then the gradient with respect to an earlier hidden state is computed by repeated application of the chain rule:
This product of Jacobian matrices determines how strongly information from step influences the final loss.
If the product grows, gradients explode. If the product shrinks, gradients vanish. This is the central training difficulty of basic recurrent networks.
A Simple RNN Example
Consider the simple recurrent update:
Let
Then
The derivative of with respect to is
The gradient over many steps contains products of terms like this:
The repeated multiplication explains why recurrent networks can be unstable over long sequences.
Loss at Every Time Step
Many sequence tasks produce an output at every step. Examples include language modeling and sequence labeling.
The model may compute
and a loss
The total loss is usually the mean or sum over time:
In PyTorch, this often looks like:
logits, h_n = rnn_model(x)
# logits: [B, T, V]
# targets: [B, T]
loss = criterion(
logits.reshape(B * T, V),
targets.reshape(B * T),
)The reshape converts the sequence problem into a batch of classification problems.
BPTT in PyTorch
PyTorch builds the computation graph dynamically as the forward pass executes. When a recurrent layer processes a sequence, PyTorch records the operations needed for backpropagation.
A minimal training step:
optimizer.zero_grad()
output, h_n = rnn(x)
logits = classifier(output)
loss = criterion(
logits.reshape(B * T, num_classes),
targets.reshape(B * T),
)
loss.backward()
optimizer.step()The call to loss.backward() performs backpropagation through the unrolled recurrent computation.
The user does not manually write BPTT. PyTorch’s autograd system computes it from the graph.
Truncated Backpropagation Through Time
For long sequences, full BPTT can be expensive. It requires storing hidden states and intermediate activations for every time step.
If is large, memory usage grows with sequence length.
Truncated backpropagation through time limits how far gradients flow backward. Instead of backpropagating through an entire sequence, we split the sequence into shorter chunks.
For example, a sequence of length 1000 may be processed in chunks of 100:
steps 1-100
steps 101-200
steps 201-300
...The hidden state is carried forward between chunks, but the gradient graph is detached.
In PyTorch:
h = None
for x_chunk, y_chunk in chunks:
optimizer.zero_grad()
output, h = rnn(x_chunk, h)
logits = classifier(output)
loss = criterion(
logits.reshape(-1, num_classes),
y_chunk.reshape(-1),
)
loss.backward()
optimizer.step()
h = h.detach()The line
h = h.detach()breaks the computation graph. The numerical hidden state is reused, but gradients do not flow into earlier chunks.
This reduces memory cost and makes long sequence training practical.
Detaching Hidden States
Detaching a tensor keeps its value but removes its connection to earlier computation.
This matters in recurrent training because hidden states connect chunks over time.
Without detaching, PyTorch would attempt to backpropagate through every previous chunk. That may cause high memory usage or an error when the graph has already been freed.
Example:
h = h.detach()For LSTMs, the hidden state has two tensors, so both must be detached:
h = tuple(state.detach() for state in h)Detaching is a mathematical choice. It says: use the past state as context, but do not update parameters based on dependencies before this boundary.
Exploding Gradients
Because BPTT multiplies many Jacobians, gradients can become very large. This is called exploding gradients.
Symptoms include:
| Symptom | Description |
|---|---|
Loss becomes nan | Numerical overflow occurred |
| Training becomes unstable | Loss jumps suddenly |
| Parameters grow rapidly | Weight norms become very large |
| Gradients spike | Gradient norm increases sharply |
A standard remedy is gradient clipping.
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=1.0,
)Gradient clipping rescales gradients when their norm exceeds a threshold. It does not solve every optimization problem, but it is widely used in recurrent training.
Vanishing Gradients
Vanishing gradients occur when the product of Jacobians becomes very small. Then early time steps receive little learning signal.
The model may learn short-range patterns but fail to capture long-range dependencies.
For example, in a sentence:
The keys to the cabinet near the door are missing.The verb “are” depends on “keys,” which appears many words earlier. A basic RNN may struggle to preserve this dependency.
Gated recurrent architectures were designed to reduce this problem. LSTMs and GRUs introduce additive memory paths and gates that regulate information flow.
Computational Cost
For a sequence length , hidden size , and batch size , a simple RNN requires approximately:
computation when the recurrent matrix multiplication dominates.
Memory for full BPTT grows roughly with
for stored hidden states, plus additional memory for intermediate activations.
This linear dependence on is one reason truncated BPTT is important for long sequences.
BPTT Versus Transformer Training
Recurrent networks process time steps sequentially. The hidden state at step depends on the hidden state at step . This limits parallelism.
Transformers replace recurrence with attention. During training, all positions can be processed in parallel, although causal masks may restrict which positions can attend to which other positions.
This difference is one reason transformers became dominant in large-scale language modeling.
However, recurrence still has useful properties:
| Property | Recurrent networks | Transformers |
|---|---|---|
| Training parallelism over time | Limited | High |
| Memory per token during inference | Compact hidden state | Growing key-value cache |
| Natural streaming | Strong | Requires careful caching |
| Long sequence training | Hard with full BPTT | Expensive due to attention |
RNNs remain useful for streaming signals, embedded systems, online prediction, and cases where compact state matters.
Summary
Backpropagation through time trains recurrent neural networks by applying the chain rule to the unrolled recurrent graph.
Because recurrent parameters are shared across time, their gradients collect contributions from many sequence positions. Gradients must pass through repeated hidden-state transitions, which creates the possibility of exploding and vanishing gradients.
Full BPTT is exact for the unrolled sequence but can be expensive for long inputs. Truncated BPTT reduces memory and computation by limiting the length of the gradient path. In PyTorch, this is usually implemented by detaching hidden states between chunks.