Gradient Vanishing and Explosion

Gradient-based optimization relies on propagating derivative information through many layers, time steps, or computational transformations. In deep systems, these gradients often become extremely small or extremely large during reverse propagation.

These two pathologies are known as:

vanishing gradients,
exploding gradients.

They are among the most important stability problems in automatic differentiation and deep learning.

The phenomenon is fundamentally mathematical. Reverse mode propagates products of Jacobians backward through a computational graph. Repeated multiplication causes exponential contraction or exponential expansion depending on the local derivative structure.

Gradient Propagation as Jacobian Products

Suppose a computation consists of states:

x_0 \to x_1 \to x_2 \to \cdots \to x_n.

The total derivative is:

\frac{\partial x_n}{\partial x_0} = \prod_{k=1}^n \frac{\partial x_k}{\partial x_{k-1}}.

Reverse mode propagates gradients through the transpose of these Jacobians:

\bar{x}_{k-1} = J_k^T \bar{x}_k.

Thus the gradient magnitude depends on repeated matrix multiplication.

If the Jacobians tend to contract vectors, gradients vanish.

If the Jacobians tend to expand vectors, gradients explode.

Scalar Example

Consider a simple recurrence:

x_k = a x_{k-1}.

Then:

x_n = a^n x_0.

The derivative is:

\frac{\partial x_n}{\partial x_0} = a^n.

If:

|a| < 1,

then:

a^n \to 0.

Gradients vanish exponentially.

If:

|a| > 1,

then:

a^n \to \infty.

Gradients explode exponentially.

This simple example already captures the core mechanism.

Matrix Form

In higher dimensions:

x_k = f_k(x_{k-1}),

with Jacobians:

J_k = \frac{\partial x_k}{\partial x_{k-1}}.

The gradient depends on:

J_n J_{n-1} \cdots J_1.

The dominant behavior is governed by singular values and eigenstructure.

If singular values are mostly below $1$ :

gradients shrink.

If singular values exceed $1$ :

gradients grow.

The effect compounds exponentially with depth.

Relation to Lyapunov Dynamics

Gradient propagation resembles dynamical systems evolution.

Small perturbations evolve according to:

\delta x_n = \left( \prod_{k=1}^n J_k \right) \delta x_0.

This connects gradient stability to:

Lyapunov exponents,
chaotic systems,
stability theory,
dynamical systems analysis.

The average logarithmic singular value determines asymptotic behavior.

Vanishing Gradients

Vanishing gradients occur when:

\left\| \frac{\partial x_n}{\partial x_0} \right\| \to 0.

The optimization signal disappears before reaching early layers or earlier time steps.

This makes learning extremely slow or impossible.

Sigmoid Example

The sigmoid activation:

\sigma(x) = \frac{1}{1+e^{-x}}

has derivative:

\sigma'(x) = \sigma(x)(1-\sigma(x)).

Its maximum derivative is:

0.25.

Thus each layer multiplies gradients by at most $0.25$ .

For $n$ layers:

0.25^n.

After only 20 layers:

0.25^{20} \approx 9.1 \times 10^{-13}.

The gradient effectively disappears.

Saturation

Sigmoid and tanh activations saturate.

For large positive or negative inputs:

\sigma'(x) \approx 0.

The local Jacobian becomes nearly singular.

Gradients collapse.

This historically made deep sigmoid networks difficult to train.

Recurrent Neural Networks

Vanishing gradients are especially severe in recurrent systems.

Suppose a recurrent update:

h_t = f(W h_{t-1} + Ux_t).

The gradient across time involves:

\prod_t \frac{\partial h_t}{\partial h_{t-1}}.

Long sequences therefore create exponential decay or growth.

This prevented early RNNs from learning long-range dependencies effectively.

Exploding Gradients

Exploding gradients occur when:

\left\| \frac{\partial x_n}{\partial x_0} \right\| \to \infty.

Small perturbations become enormous.

Optimization becomes unstable.

Symptoms include:

Symptom	Effect
Huge parameter updates	Divergence
NaN gradients	Invalid arithmetic
Oscillating loss	Failure to converge
Overflow	Numerical instability

Example of Exponential Growth

Suppose each Jacobian has norm:

1.1.

Then after 100 layers:

1.1^{100} \approx 13780.

Tiny perturbations amplify dramatically.

Exploding in RNNs

Exploding gradients were historically common in recurrent networks.

Long sequences create repeated matrix multiplication:

W^t.

If the spectral radius satisfies:

\rho(W) > 1,

the system becomes unstable.

Conditioning and Singular Values

Gradient stability depends strongly on singular values.

Suppose:

\sigma_{\max}(J_k)

is the largest singular value.

Then:

\left\| \prod_k J_k \right\| \leq \prod_k \|J_k\|.

Repeated norms above $1$ create explosion.

Repeated norms below $1$ create vanishing.

Stable propagation ideally keeps singular values near:

1.

Deep Linear Networks

Even purely linear networks exhibit these effects.

Consider:

y = W_n W_{n-1} \cdots W_1 x.

The gradient depends on the same matrix products.

Nonlinearity is not required for vanishing or explosion.

Depth alone creates the problem.

Residual Connections

Residual networks improve gradient flow using skip connections:

x_{k+1} = x_k + f(x_k).

The Jacobian becomes:

I + J_f.

The identity term helps preserve gradient magnitude.

Gradients can propagate through the skip path even when $J_f$ contracts.

This was a major breakthrough enabling very deep networks.

ReLU Activations

ReLU:

\operatorname{ReLU}(x) = \max(0,x)

has derivative:

\operatorname{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x < 0 \end{cases}

Positive activations preserve gradient magnitude better than sigmoid activations.

This significantly reduced vanishing-gradient problems in feedforward networks.

However, ReLU introduces dead neurons when activations remain negative.

Orthogonal Initialization

Initialization strongly affects gradient stability.

If weight matrices are orthogonal:

W^TW = I,

then singular values equal $1$ .

This preserves vector norms.

Orthogonal initialization therefore improves gradient propagation in deep systems.

Xavier Initialization

Xavier initialization attempts to preserve activation variance:

\operatorname{Var}(W_{ij}) = \frac{2}{n_{\text{in}} + n_{\text{out}}}.

The goal is to stabilize both forward activations and backward gradients.

He Initialization

For ReLU networks:

\operatorname{Var}(W_{ij}) = \frac{2}{n_{\text{in}}}.

This compensates for the sparsity induced by ReLU activations.

Batch Normalization

Batch normalization rescales activations:

\hat{x} = \frac{x-\mu}{\sigma}.

This improves conditioning and stabilizes gradient propagation.

Normalization reduces internal covariate shift and smooths optimization dynamics.

Layer Normalization

Transformers commonly use layer normalization:

\hat{x} = \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}.

This stabilizes hidden-state magnitude across layers.

Gradient Clipping

Gradient clipping prevents catastrophic explosion.

Global norm clipping:

g \leftarrow g \cdot \min\left( 1, \frac{\tau}{\|g\|} \right).

This limits update magnitude.

Clipping does not eliminate the underlying instability, but it prevents numerical divergence.

Gated Recurrent Units

LSTM and GRU architectures were designed specifically to combat vanishing gradients.

LSTM memory cells maintain additive state updates:

c_t = f_t c_{t-1} + i_t \tilde{c}_t.

The additive pathway preserves gradients more effectively than repeated multiplicative recurrence.

Spectral Radius Control

For recurrent matrices:

W,

stability depends on:

\rho(W),

the spectral radius.

Constraining:

\rho(W) \approx 1

helps maintain stable gradient flow.

Attention Mechanisms

Attention partially bypasses long sequential chains.

Instead of propagating information only through recurrence, attention creates shorter dependency paths.

This substantially improves gradient flow across long contexts.

Transformers therefore avoid some classic recurrent instability problems.

Hessian Perspective

Gradient explosion often corresponds to sharp curvature.

Suppose Hessian:

H = \nabla^2 L.

Large eigenvalues imply:

extreme local sensitivity,
unstable optimization trajectories.

Vanishing gradients often correspond to flat regions where curvature becomes tiny.

Chaotic Dynamics

Some systems intrinsically amplify perturbations.

Examples:

differentiable simulators,
fluid dynamics,
chaotic recurrent systems,
long-horizon reinforcement learning.

In such systems, exploding gradients may reflect real physical sensitivity rather than poor implementation.

Mixed Precision Effects

Low precision worsens both problems.

Vanishing:

tiny gradients underflow to zero.

Explosion:

large gradients overflow to infinity.

Loss scaling and higher precision accumulators mitigate these effects.

Gradient Noise

Stochastic optimization introduces gradient noise:

g = \nabla L + \epsilon.

Vanishing gradients become dominated by noise.

Exploding gradients amplify noise dramatically.

This destabilizes training dynamics.

Diagnostics

Useful diagnostics include:

Diagnostic	Meaning
Gradient norm	Detect explosion/vanishing
Layer-wise gradient scale	Localize instability
Activation histograms	Detect saturation
Singular value analysis	Study propagation
NaN detection	Catch overflow
Loss spikes	Detect instability

Dynamical Isometry

An ideal deep network approximately preserves vector norms:

\|Jx\| \approx \|x\|.

This condition is called dynamical isometry.

Networks near dynamical isometry exhibit:

stable gradients,
faster optimization,
improved trainability.

Automatic Differentiation Perspective

Automatic differentiation itself is mathematically exact under exact arithmetic.

Vanishing and exploding gradients arise because:

the derivative structure of the program contains repeated contractions or expansions,
floating point arithmetic amplifies these scaling effects,
optimization depends on propagating finite numerical signals through deep graphs.

AD faithfully computes the unstable gradients implied by the computational structure.

Core Idea

Gradient vanishing and explosion arise from repeated Jacobian multiplication during reverse propagation. Deep or recurrent computational graphs naturally amplify or suppress gradients exponentially depending on their local derivative structure. Stable differentiable systems therefore require architectural, numerical, and optimization strategies that preserve gradient magnitude and maintain well-conditioned backward dynamics.