Gradient-based optimization relies on propagating derivative information through many layers, time steps, or computational transformations. In deep systems, these gradients...
Gradient-based optimization relies on propagating derivative information through many layers, time steps, or computational transformations. In deep systems, these gradients often become extremely small or extremely large during reverse propagation.
These two pathologies are known as:
- vanishing gradients,
- exploding gradients.
They are among the most important stability problems in automatic differentiation and deep learning.
The phenomenon is fundamentally mathematical. Reverse mode propagates products of Jacobians backward through a computational graph. Repeated multiplication causes exponential contraction or exponential expansion depending on the local derivative structure.
Gradient Propagation as Jacobian Products
Suppose a computation consists of states:
The total derivative is:
Reverse mode propagates gradients through the transpose of these Jacobians:
Thus the gradient magnitude depends on repeated matrix multiplication.
If the Jacobians tend to contract vectors, gradients vanish.
If the Jacobians tend to expand vectors, gradients explode.
Scalar Example
Consider a simple recurrence:
Then:
The derivative is:
If:
then:
Gradients vanish exponentially.
If:
then:
Gradients explode exponentially.
This simple example already captures the core mechanism.
Matrix Form
In higher dimensions:
with Jacobians:
The gradient depends on:
The dominant behavior is governed by singular values and eigenstructure.
If singular values are mostly below :
- gradients shrink.
If singular values exceed :
- gradients grow.
The effect compounds exponentially with depth.
Relation to Lyapunov Dynamics
Gradient propagation resembles dynamical systems evolution.
Small perturbations evolve according to:
This connects gradient stability to:
- Lyapunov exponents,
- chaotic systems,
- stability theory,
- dynamical systems analysis.
The average logarithmic singular value determines asymptotic behavior.
Vanishing Gradients
Vanishing gradients occur when:
The optimization signal disappears before reaching early layers or earlier time steps.
This makes learning extremely slow or impossible.
Sigmoid Example
The sigmoid activation:
has derivative:
Its maximum derivative is:
Thus each layer multiplies gradients by at most .
For layers:
After only 20 layers:
The gradient effectively disappears.
Saturation
Sigmoid and tanh activations saturate.
For large positive or negative inputs:
The local Jacobian becomes nearly singular.
Gradients collapse.
This historically made deep sigmoid networks difficult to train.
Recurrent Neural Networks
Vanishing gradients are especially severe in recurrent systems.
Suppose a recurrent update:
The gradient across time involves:
Long sequences therefore create exponential decay or growth.
This prevented early RNNs from learning long-range dependencies effectively.
Exploding Gradients
Exploding gradients occur when:
Small perturbations become enormous.
Optimization becomes unstable.
Symptoms include:
| Symptom | Effect |
|---|---|
| Huge parameter updates | Divergence |
| NaN gradients | Invalid arithmetic |
| Oscillating loss | Failure to converge |
| Overflow | Numerical instability |
Example of Exponential Growth
Suppose each Jacobian has norm:
Then after 100 layers:
Tiny perturbations amplify dramatically.
Exploding in RNNs
Exploding gradients were historically common in recurrent networks.
Long sequences create repeated matrix multiplication:
If the spectral radius satisfies:
the system becomes unstable.
Conditioning and Singular Values
Gradient stability depends strongly on singular values.
Suppose:
is the largest singular value.
Then:
Repeated norms above create explosion.
Repeated norms below create vanishing.
Stable propagation ideally keeps singular values near:
Deep Linear Networks
Even purely linear networks exhibit these effects.
Consider:
The gradient depends on the same matrix products.
Nonlinearity is not required for vanishing or explosion.
Depth alone creates the problem.
Residual Connections
Residual networks improve gradient flow using skip connections:
The Jacobian becomes:
The identity term helps preserve gradient magnitude.
Gradients can propagate through the skip path even when contracts.
This was a major breakthrough enabling very deep networks.
ReLU Activations
ReLU:
has derivative:
Positive activations preserve gradient magnitude better than sigmoid activations.
This significantly reduced vanishing-gradient problems in feedforward networks.
However, ReLU introduces dead neurons when activations remain negative.
Orthogonal Initialization
Initialization strongly affects gradient stability.
If weight matrices are orthogonal:
then singular values equal .
This preserves vector norms.
Orthogonal initialization therefore improves gradient propagation in deep systems.
Xavier Initialization
Xavier initialization attempts to preserve activation variance:
The goal is to stabilize both forward activations and backward gradients.
He Initialization
For ReLU networks:
This compensates for the sparsity induced by ReLU activations.
Batch Normalization
Batch normalization rescales activations:
This improves conditioning and stabilizes gradient propagation.
Normalization reduces internal covariate shift and smooths optimization dynamics.
Layer Normalization
Transformers commonly use layer normalization:
This stabilizes hidden-state magnitude across layers.
Gradient Clipping
Gradient clipping prevents catastrophic explosion.
Global norm clipping:
This limits update magnitude.
Clipping does not eliminate the underlying instability, but it prevents numerical divergence.
Gated Recurrent Units
LSTM and GRU architectures were designed specifically to combat vanishing gradients.
LSTM memory cells maintain additive state updates:
The additive pathway preserves gradients more effectively than repeated multiplicative recurrence.
Spectral Radius Control
For recurrent matrices:
stability depends on:
the spectral radius.
Constraining:
helps maintain stable gradient flow.
Attention Mechanisms
Attention partially bypasses long sequential chains.
Instead of propagating information only through recurrence, attention creates shorter dependency paths.
This substantially improves gradient flow across long contexts.
Transformers therefore avoid some classic recurrent instability problems.
Hessian Perspective
Gradient explosion often corresponds to sharp curvature.
Suppose Hessian:
Large eigenvalues imply:
- extreme local sensitivity,
- unstable optimization trajectories.
Vanishing gradients often correspond to flat regions where curvature becomes tiny.
Chaotic Dynamics
Some systems intrinsically amplify perturbations.
Examples:
- differentiable simulators,
- fluid dynamics,
- chaotic recurrent systems,
- long-horizon reinforcement learning.
In such systems, exploding gradients may reflect real physical sensitivity rather than poor implementation.
Mixed Precision Effects
Low precision worsens both problems.
Vanishing:
- tiny gradients underflow to zero.
Explosion:
- large gradients overflow to infinity.
Loss scaling and higher precision accumulators mitigate these effects.
Gradient Noise
Stochastic optimization introduces gradient noise:
Vanishing gradients become dominated by noise.
Exploding gradients amplify noise dramatically.
This destabilizes training dynamics.
Diagnostics
Useful diagnostics include:
| Diagnostic | Meaning |
|---|---|
| Gradient norm | Detect explosion/vanishing |
| Layer-wise gradient scale | Localize instability |
| Activation histograms | Detect saturation |
| Singular value analysis | Study propagation |
| NaN detection | Catch overflow |
| Loss spikes | Detect instability |
Dynamical Isometry
An ideal deep network approximately preserves vector norms:
This condition is called dynamical isometry.
Networks near dynamical isometry exhibit:
- stable gradients,
- faster optimization,
- improved trainability.
Automatic Differentiation Perspective
Automatic differentiation itself is mathematically exact under exact arithmetic.
Vanishing and exploding gradients arise because:
- the derivative structure of the program contains repeated contractions or expansions,
- floating point arithmetic amplifies these scaling effects,
- optimization depends on propagating finite numerical signals through deep graphs.
AD faithfully computes the unstable gradients implied by the computational structure.
Core Idea
Gradient vanishing and explosion arise from repeated Jacobian multiplication during reverse propagation. Deep or recurrent computational graphs naturally amplify or suppress gradients exponentially depending on their local derivative structure. Stable differentiable systems therefore require architectural, numerical, and optimization strategies that preserve gradient magnitude and maintain well-conditioned backward dynamics.