Vanishing Gradients in RNNs

Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time. In principle, the hidden state can preserve information from arbitrarily distant positions. In practice, standard recurrent networks often fail to learn long-range dependencies.

The main reason is the vanishing gradient problem.

During training, gradients must propagate backward through many recurrent steps. Repeated multiplication by small derivatives causes the gradient magnitude to shrink exponentially. When this happens, early time steps receive almost no learning signal.

This section studies why vanishing gradients occur, how they affect learning, and why they motivated gated recurrent architectures such as LSTMs and GRUs.

Long-Range Dependency Problems

Consider the sentence:

The trophy would not fit in the suitcase because it was too large.

To determine what “it” refers to, the model must preserve semantic information across several words.

Now consider a much longer example:

In the article published last month by the research group that recently joined the institute after several years abroad, the lead scientist explained that the original experiment failed because ...

The phrase “lead scientist” may affect interpretation many words later.

A recurrent model should ideally preserve relevant information over long distances. However, standard RNNs often emphasize recent information and gradually forget earlier context.

The problem is not only memory capacity. It is primarily an optimization problem caused by gradient dynamics.

Gradient Propagation Through Time

Recall the recurrent update:

h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“h_t=\tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)”}}

Suppose the loss depends on the final hidden state:

L = \ell(h_T).

The gradient with respect to an earlier hidden state is

\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}.

The key term is the repeated product of Jacobians.

For the tanh RNN:

\frac{\partial h_k}{\partial h_{k-1}} = \operatorname{diag}(1 - h_k^2)W_{hh}.

Therefore:

\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{k=t+1}^{T} \operatorname{diag}(1-h_k^2)W_{hh}.

If these matrices tend to shrink vectors, the gradient norm decreases exponentially as it moves backward through time.

Why Gradients Shrink

The derivative of the hyperbolic tangent satisfies

0 \leq 1 - \tanh^2(x) \leq 1.

The derivative becomes very small when the activation saturates near $-1$ or $1$ .

genui{“math_block_widget_always_prefetch_v2”:{“content”:“y=\tanh(x)”}}

Suppose a derivative factor has average magnitude $0.5$ . After 20 recurrent steps:

0.5^{20} \approx 9.5 \times 10^{-7}.

The learning signal effectively disappears.

Even if the recurrent matrix $W_{hh}$ has moderate norm, repeated multiplication can still drive gradients toward zero.

This is the essence of vanishing gradients.

A Simple Scalar Example

Consider the scalar recurrence:

h_t = wh_{t-1}.

Then:

h_t = w^t h_0.

The derivative of $h_t$ with respect to $h_0$ is

\frac{\partial h_t}{\partial h_0} = w^t.

|w| < 1,

then

w^t \to 0

exponentially fast.

Example:

$w$	$t$	$w^t$
0.9	10	0.35
0.9	50	0.005
0.9	100	0.000026
0.5	20	$9.5\times10^{-7}$

Even mild contraction becomes severe over long sequences.

Saturating Activations

The problem becomes worse when activation functions saturate.

For sigmoid activation:

\sigma(x)=\frac{1}{1+e^{-x}}

the derivative satisfies

0 < \sigma'(x) \leq 0.25.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“y=\frac{1}{1+e^{-x}}”}}

Repeated multiplication by numbers smaller than 1 rapidly reduces gradient magnitude.

This is one reason early recurrent networks using sigmoid activations were difficult to train.

Tanh activation improves the situation somewhat because its derivative can reach 1 near zero:

\tanh'(0)=1.

However, tanh still saturates for large positive or negative activations.

Effects on Learning

Vanishing gradients create several practical problems.

Failure to Learn Long Dependencies

The model may only use recent context.

For example, in language modeling, it may predict based mainly on nearby words rather than distant semantic structure.

Slow Convergence

Earlier sequence positions receive weak learning signals. Parameters affecting long-term behavior update slowly.

Bias Toward Local Patterns

The network tends to focus on short-term correlations because they produce stronger gradients.

Memory Loss

Hidden states gradually overwrite earlier information.

Hidden State Dynamics

The hidden state evolves recursively:

h_t = f(h_{t-1}, x_t).

If the recurrent transformation contracts state space, hidden trajectories converge toward a narrow region.

This causes memory collapse.

Suppose:

\|W_{hh}\| < 1.

Then repeated recurrent multiplication suppresses earlier components:

W_{hh}^t h_0 \to 0.

The network forgets earlier information regardless of sequence content.

Exploding Gradients

The opposite problem also exists.

If the recurrent transformation expands too strongly:

\|W_{hh}\| > 1,

then gradients may grow exponentially:

\|W_{hh}^t\| \to \infty.

This creates exploding gradients.

Exploding gradients cause:

Problem	Effect
Numerical instability	Overflow and `nan` losses
Large parameter updates	Training divergence
Unstable optimization	Oscillating loss

Gradient clipping is commonly used to reduce this problem:

torch.nn.utils.clip_grad_norm_(
    model.parameters(),
    max_norm=1.0,
)

Vanishing gradients are usually harder to solve because clipping cannot restore missing signal.

Spectral Radius and Stability

The recurrent matrix $W_{hh}$ largely determines gradient behavior.

The key quantity is the spectral radius:

\rho(W_{hh}),

which is the magnitude of the largest eigenvalue.

\rho(W_{hh}) < 1,

gradients tend to vanish.

\rho(W_{hh}) > 1,

gradients tend to explode.

Stable recurrent training often requires careful initialization and normalization to keep dynamics near equilibrium.

Orthogonal initialization is one common strategy:

for name, param in rnn.named_parameters():
    if "weight_hh" in name:
        torch.nn.init.orthogonal_(param)

Orthogonal matrices preserve vector norms more effectively than arbitrary random matrices.

Gradient Flow Visualization

Conceptually, gradient propagation behaves like repeated scaling.

Suppose the average scaling factor is:

Scaling factor	Behavior
0.5	rapid decay
0.9	slow decay
1.0	preserved magnitude
1.1	exponential growth

Even small deviations from 1 become significant over many time steps.

This sensitivity makes recurrent optimization difficult.

Empirical Symptoms

When training a standard RNN, vanishing gradients may appear as:

Symptom	Observation
Poor long-term memory	Model ignores distant context
Short repetitive outputs	Language model repeats local patterns
Low training progress	Loss plateaus early
Sensitivity to sequence length	Performance degrades rapidly for long sequences

Monitoring gradient norms can help diagnose the problem.

Example:

total_norm = 0.0

for p in model.parameters():
    if p.grad is not None:
        total_norm += p.grad.norm().item()

print(total_norm)

Very small norms may indicate vanishing gradients.

Architectural Solutions

Several architectural changes were developed to address vanishing gradients.

Gated Recurrent Networks

LSTMs and GRUs introduce gates that regulate information flow.

These architectures create additive memory paths that preserve gradients more effectively.

Residual Connections

Residual pathways reduce gradient attenuation by creating shorter paths through the computation graph.

Attention Mechanisms

Attention bypasses repeated recurrence entirely.

Instead of compressing all history into one hidden state, the model directly accesses earlier positions.

Transformers became dominant partly because attention avoids long recurrent gradient chains.

Optimization Strategies

Several practical methods improve recurrent training.

Gradient Clipping

Prevents exploding gradients.

Careful Initialization

Orthogonal and identity initialization help preserve signal magnitude.

Layer Normalization

Stabilizes hidden-state distributions.

Shorter Unrolling Lengths

Truncated BPTT reduces optimization depth.

Better Activations

Modern architectures often avoid strongly saturating nonlinearities.

Why LSTMs Were Important

Before LSTMs, training recurrent networks on long sequences was extremely difficult.

LSTMs introduced controlled memory updates:

information can be preserved,
overwritten,
or forgotten explicitly.

The memory cell creates a path where gradients can flow with reduced attenuation.

This was a major breakthrough for sequence modeling before transformers became dominant.

Summary

Vanishing gradients occur because recurrent training requires repeated multiplication of Jacobian matrices across time. When these transformations contract vector magnitudes, gradients decay exponentially as they move backward through the sequence.

As a result, standard recurrent networks struggle to learn long-range dependencies. They tend to focus on recent inputs and gradually forget distant context.

Exploding gradients arise from the opposite effect: repeated expansion of gradient magnitude. Gradient clipping can reduce exploding gradients, but vanishing gradients require architectural changes such as gating, normalization, residual pathways, or attention.

These limitations motivated the development of LSTMs and GRUs, which are the subject of the next chapter.