Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time.
Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time. In principle, the hidden state can preserve information from arbitrarily distant positions. In practice, standard recurrent networks often fail to learn long-range dependencies.
The main reason is the vanishing gradient problem.
During training, gradients must propagate backward through many recurrent steps. Repeated multiplication by small derivatives causes the gradient magnitude to shrink exponentially. When this happens, early time steps receive almost no learning signal.
This section studies why vanishing gradients occur, how they affect learning, and why they motivated gated recurrent architectures such as LSTMs and GRUs.
Long-Range Dependency Problems
Consider the sentence:
The trophy would not fit in the suitcase because it was too large.To determine what “it” refers to, the model must preserve semantic information across several words.
Now consider a much longer example:
In the article published last month by the research group that recently joined the institute after several years abroad, the lead scientist explained that the original experiment failed because ...The phrase “lead scientist” may affect interpretation many words later.
A recurrent model should ideally preserve relevant information over long distances. However, standard RNNs often emphasize recent information and gradually forget earlier context.
The problem is not only memory capacity. It is primarily an optimization problem caused by gradient dynamics.
Gradient Propagation Through Time
Recall the recurrent update:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“h_t=\tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)”}}
Suppose the loss depends on the final hidden state:
The gradient with respect to an earlier hidden state is
The key term is the repeated product of Jacobians.
For the tanh RNN:
Therefore:
If these matrices tend to shrink vectors, the gradient norm decreases exponentially as it moves backward through time.
Why Gradients Shrink
The derivative of the hyperbolic tangent satisfies
The derivative becomes very small when the activation saturates near or .
genui{“math_block_widget_always_prefetch_v2”:{“content”:“y=\tanh(x)”}}
Suppose a derivative factor has average magnitude . After 20 recurrent steps:
The learning signal effectively disappears.
Even if the recurrent matrix has moderate norm, repeated multiplication can still drive gradients toward zero.
This is the essence of vanishing gradients.
A Simple Scalar Example
Consider the scalar recurrence:
Then:
The derivative of with respect to is
If
then
exponentially fast.
Example:
| 0.9 | 10 | 0.35 |
| 0.9 | 50 | 0.005 |
| 0.9 | 100 | 0.000026 |
| 0.5 | 20 |
Even mild contraction becomes severe over long sequences.
Saturating Activations
The problem becomes worse when activation functions saturate.
For sigmoid activation:
the derivative satisfies
genui{“math_block_widget_always_prefetch_v2”:{“content”:“y=\frac{1}{1+e^{-x}}”}}
Repeated multiplication by numbers smaller than 1 rapidly reduces gradient magnitude.
This is one reason early recurrent networks using sigmoid activations were difficult to train.
Tanh activation improves the situation somewhat because its derivative can reach 1 near zero:
However, tanh still saturates for large positive or negative activations.
Effects on Learning
Vanishing gradients create several practical problems.
Failure to Learn Long Dependencies
The model may only use recent context.
For example, in language modeling, it may predict based mainly on nearby words rather than distant semantic structure.
Slow Convergence
Earlier sequence positions receive weak learning signals. Parameters affecting long-term behavior update slowly.
Bias Toward Local Patterns
The network tends to focus on short-term correlations because they produce stronger gradients.
Memory Loss
Hidden states gradually overwrite earlier information.
Hidden State Dynamics
The hidden state evolves recursively:
If the recurrent transformation contracts state space, hidden trajectories converge toward a narrow region.
This causes memory collapse.
Suppose:
Then repeated recurrent multiplication suppresses earlier components:
The network forgets earlier information regardless of sequence content.
Exploding Gradients
The opposite problem also exists.
If the recurrent transformation expands too strongly:
then gradients may grow exponentially:
This creates exploding gradients.
Exploding gradients cause:
| Problem | Effect |
|---|---|
| Numerical instability | Overflow and nan losses |
| Large parameter updates | Training divergence |
| Unstable optimization | Oscillating loss |
Gradient clipping is commonly used to reduce this problem:
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=1.0,
)Vanishing gradients are usually harder to solve because clipping cannot restore missing signal.
Spectral Radius and Stability
The recurrent matrix largely determines gradient behavior.
The key quantity is the spectral radius:
which is the magnitude of the largest eigenvalue.
If
gradients tend to vanish.
If
gradients tend to explode.
Stable recurrent training often requires careful initialization and normalization to keep dynamics near equilibrium.
Orthogonal initialization is one common strategy:
for name, param in rnn.named_parameters():
if "weight_hh" in name:
torch.nn.init.orthogonal_(param)Orthogonal matrices preserve vector norms more effectively than arbitrary random matrices.
Gradient Flow Visualization
Conceptually, gradient propagation behaves like repeated scaling.
Suppose the average scaling factor is:
| Scaling factor | Behavior |
|---|---|
| 0.5 | rapid decay |
| 0.9 | slow decay |
| 1.0 | preserved magnitude |
| 1.1 | exponential growth |
Even small deviations from 1 become significant over many time steps.
This sensitivity makes recurrent optimization difficult.
Empirical Symptoms
When training a standard RNN, vanishing gradients may appear as:
| Symptom | Observation |
|---|---|
| Poor long-term memory | Model ignores distant context |
| Short repetitive outputs | Language model repeats local patterns |
| Low training progress | Loss plateaus early |
| Sensitivity to sequence length | Performance degrades rapidly for long sequences |
Monitoring gradient norms can help diagnose the problem.
Example:
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.norm().item()
print(total_norm)Very small norms may indicate vanishing gradients.
Architectural Solutions
Several architectural changes were developed to address vanishing gradients.
Gated Recurrent Networks
LSTMs and GRUs introduce gates that regulate information flow.
These architectures create additive memory paths that preserve gradients more effectively.
Residual Connections
Residual pathways reduce gradient attenuation by creating shorter paths through the computation graph.
Attention Mechanisms
Attention bypasses repeated recurrence entirely.
Instead of compressing all history into one hidden state, the model directly accesses earlier positions.
Transformers became dominant partly because attention avoids long recurrent gradient chains.
Optimization Strategies
Several practical methods improve recurrent training.
Gradient Clipping
Prevents exploding gradients.
Careful Initialization
Orthogonal and identity initialization help preserve signal magnitude.
Layer Normalization
Stabilizes hidden-state distributions.
Shorter Unrolling Lengths
Truncated BPTT reduces optimization depth.
Better Activations
Modern architectures often avoid strongly saturating nonlinearities.
Why LSTMs Were Important
Before LSTMs, training recurrent networks on long sequences was extremely difficult.
LSTMs introduced controlled memory updates:
- information can be preserved,
- overwritten,
- or forgotten explicitly.
The memory cell creates a path where gradients can flow with reduced attenuation.
This was a major breakthrough for sequence modeling before transformers became dominant.
Summary
Vanishing gradients occur because recurrent training requires repeated multiplication of Jacobian matrices across time. When these transformations contract vector magnitudes, gradients decay exponentially as they move backward through the sequence.
As a result, standard recurrent networks struggle to learn long-range dependencies. They tend to focus on recent inputs and gradually forget distant context.
Exploding gradients arise from the opposite effect: repeated expansion of gradient magnitude. Gradient clipping can reduce exploding gradients, but vanishing gradients require architectural changes such as gating, normalization, residual pathways, or attention.
These limitations motivated the development of LSTMs and GRUs, which are the subject of the next chapter.