Stability of Reverse Mode

Reverse mode automatic differentiation computes gradients by propagating adjoint values backward through a computational graph. In exact arithmetic, the reverse accumulation rules are mathematically correct applications of the chain rule. In finite precision arithmetic, however, reverse propagation can become numerically unstable.

The instability does not usually come from reverse mode itself. It emerges from the interaction between:

floating point rounding,
ill-conditioned computations,
long dependency chains,
repeated accumulation,
unstable primal programs,
low precision arithmetic,
and extreme gradient magnitudes.

Understanding reverse-mode stability requires viewing gradients as dynamical quantities that evolve backward through the graph.

Reverse Propagation as Backward Dynamics

Suppose a program computes

$$ y = f(x). $$

Reverse mode propagates adjoints:

$$ \bar{v}_i = \frac{\partial y}{\partial v_i}. $$

For a sequence of intermediate states,

$$ x_0 \to x_1 \to x_2 \to \cdots \to x_n, $$

the chain rule gives

$$ \frac{\partial x_n}{\partial x_0} = \prod_{k=1}^{n} \frac{\partial x_k}{\partial x_{k-1}}. $$

Reverse mode propagates gradients through the transpose of these local Jacobians:

$$ \bar{x}{k-1} = \left( \frac{\partial x_k}{\partial x{k-1}} \right)^T \bar{x}_k. $$

This means reverse mode repeatedly applies linear transformations backward through the graph.

If the Jacobians are poorly conditioned, the backward dynamics can amplify numerical error.

Error Amplification

Suppose each reverse step introduces a small perturbation:

$$ \widehat{\bar{x}}_{k-1} = J_k^T \widehat{\bar{x}}_k + \epsilon_k. $$

Unrolling gives

$$ \widehat{\bar{x}}0 = \left( \prod{k=1}^n J_k^T \right) \bar{x}n + \sum_i \left( \prod{j<i} J_j^T \right) \epsilon_i. $$

Small rounding errors may therefore be amplified by the matrix products.

If the Jacobian chain has large singular values, gradient errors grow exponentially.

If the Jacobian chain has very small singular values, gradients collapse toward zero.

This is the same mechanism behind exploding and vanishing gradients.

Stability and Condition Numbers

For a function

$$ f : \mathbb{R}^n \to \mathbb{R}^m, $$

the conditioning of the derivative depends on the Jacobian:

$$ J_f(x). $$

The condition number measures sensitivity:

$$ \kappa(J) = |J| \cdot |J^{-1}|. $$

Large condition numbers imply that small perturbations in inputs or intermediate values may cause large changes in outputs or gradients.

Reverse mode inherits the conditioning of the primal computation.

An unstable primal program produces unstable gradients.

This principle appears repeatedly across numerical optimization, scientific computing, and deep learning.

Instability from Long Chains

Long computational chains are particularly dangerous.

Consider repeated multiplication:

$$ x_k = a_k x_{k-1}. $$

Then

$$ \frac{\partial x_n}{\partial x_0} = \prod_{k=1}^n a_k. $$

$$ |a_k| > 1, $$

the gradient may explode.

$$ |a_k| < 1, $$

the gradient may vanish.

This is fundamental to recurrent neural networks and deep residual systems.

For example, if every local derivative is approximately $0.99$, then after $1000$ steps:

$$ 0.99^{1000} \approx 4.3 \times 10^{-5}. $$

Even though each individual operation appears stable, the global gradient nearly disappears.

Conversely:

$$ 1.01^{1000} \approx 2.1 \times 10^4. $$

Tiny local amplification compounds into catastrophic growth.

Reverse Accumulation Instability

Reverse mode repeatedly performs adjoint accumulation:

$$ \bar{x} \mathrel{+}= \bar{y} \frac{\partial y}{\partial x}. $$

A variable may receive contributions from many paths.

This creates several numerical problems.

Accumulation Order

Floating point addition is not associative:

$$ (a+b)+c \neq a+(b+c). $$

Parallel reductions may therefore produce slightly different gradients depending on execution order.

In large GPU systems, this can create nondeterministic training behavior.

Large Dynamic Ranges

Suppose one contribution is:

$$ 10^{10}, $$

and another is:

$$ 10^{-5}. $$

Adding the smaller value may have no effect in finite precision arithmetic.

Tiny gradient contributions disappear.

Cancellation

If two contributions are nearly opposite:

$$ 10^8 - 10^8 + 1, $$

the result loses significant digits.

Cancellation can destroy gradient accuracy even when the final gradient magnitude is moderate.

Reverse Mode and Stored Primal Values

Reverse mode requires primal values during the backward pass.

There are three common approaches:

Strategy	Description
Store everything	Save all intermediate values
Recompute	Recompute forward states during backward pass
Checkpoint	Store selected states and recompute segments

Each affects stability differently.

Stored Values

If values are stored in reduced precision, backward derivatives inherit the reduced accuracy.

For example:

Forward storage	Consequence
float64	High accuracy
float32	Moderate rounding
float16	Significant quantization
compressed	Reconstruction error

Recomputation

Recomputation may not exactly reproduce the original forward pass.

Sources of mismatch include:

nondeterministic kernels,
parallel reductions,
stochastic operations,
fused kernels,
random number generation,
hardware-specific math implementations.

Even tiny forward differences can produce noticeably different backward gradients in chaotic or ill-conditioned systems.

Stability of Elementary Reverse Rules

The local reverse rules themselves may contain unstable operations.

For division:

$$ z = \frac{x}{y}, $$

the reverse rule is

$$ \bar{x} \mathrel{+}= \frac{\bar{z}}{y}, $$

$$ \bar{y} \mathrel{+}= -\bar{z}\frac{x}{y^2}. $$

If $y$ is small, gradients may explode.

Similarly, for logarithms:

$$ z = \log x, $$

the reverse rule is

$$ \bar{x} \mathrel{+}= \frac{\bar{z}}{x}. $$

Near $x = 0$, derivatives become extremely sensitive.

The backward pass often has worse conditioning than the forward pass.

Stable Reformulations

Many unstable gradients arise from unstable primal formulas.

A stable reformulation usually improves both primal and gradient accuracy.

Example: Log-Sum-Exp

Direct evaluation:

$$ \log \sum_i e^{x_i} $$

may overflow.

A stable form subtracts the maximum:

$$ m = \max_i x_i, $$

$$ \log \sum_i e^{x_i} = m + \log \sum_i e^{x_i - m}. $$

This stabilizes both the forward pass and reverse gradients.

Example: Softmax Cross Entropy

Separating softmax and logarithm is numerically unstable.

A fused stable implementation avoids catastrophic cancellation and overflow.

Modern frameworks therefore provide fused operators whose backward rules are carefully engineered.

Reverse Mode in Deep Networks

Deep learning systems expose reverse-mode stability issues at scale.

Common pathologies include:

Problem	Cause
Vanishing gradients	Jacobian contraction
Exploding gradients	Jacobian expansion
Dead activations	Saturating nonlinearities
NaN gradients	Overflow or invalid arithmetic
Gradient noise	Low precision accumulation
Training instability	Ill-conditioned optimization landscape

Activation functions matter.

Sigmoid derivatives satisfy:

$$ \sigma'(x) = \sigma(x)(1-\sigma(x)). $$

The derivative is always less than $0.25$. Repeated multiplication therefore shrinks gradients rapidly.

ReLU avoids this contraction for positive activations:

$$ \operatorname{ReLU}'(x) = \begin{cases} 1 & x > 0 \ 0 & x < 0 \end{cases} $$

This partly explains why ReLU-based networks train more effectively.

Mixed Precision Reverse Mode

Modern accelerators frequently use float16 or bfloat16.

Backward propagation in low precision introduces new issues:

Issue	Description
Gradient underflow	Small gradients become zero
Quantization	Coarse gradient representation
Reduced accumulation accuracy	Summation error increases
Overflow	Large gradients exceed range

Loss scaling mitigates underflow.

Instead of propagating:

$$ L, $$

the system propagates:

$$ \alpha L. $$

Gradients are later divided by $\alpha$.

This shifts gradient magnitudes into a numerically safer range.

Checkpointing and Stability

Checkpointing trades memory for recomputation.

Suppose the forward graph has depth $n$. Storing all intermediates costs:

$$ O(n) $$

memory.

Checkpointing stores only selected states and recomputes missing segments during backward execution.

This affects stability because recomputed values may differ slightly from original values.

In deterministic float64 scientific computing, this difference may be tiny.

In stochastic GPU training with mixed precision, recomputation differences may noticeably perturb gradients.

Chaotic Systems

Some systems are fundamentally sensitive to perturbations.

Examples include:

turbulent fluid simulation,
chaotic dynamical systems,
long-horizon recurrent models,
differentiable physics,
climate simulation.

In these systems:

$$ |\delta x_t| \approx e^{\lambda t} |\delta x_0|, $$

where $\lambda$ is a Lyapunov exponent.

Backward gradients may become exponentially unstable regardless of AD implementation quality.

This is a property of the modeled system itself.

Numerical Diagnostics

Several practical diagnostics help detect reverse-mode instability.

Gradient Norm Monitoring

Track:

$$ |\nabla L|. $$

Sudden spikes suggest exploding gradients.

Near-zero norms suggest vanishing gradients.

Finite Difference Validation

Compare AD gradients against numerical approximations.

Large disagreement may indicate instability or implementation bugs.

NaN and Inf Detection

Monitor both forward activations and backward gradients.

A valid forward pass does not guarantee valid adjoints.

Gradient Histograms

Inspect gradient distributions layer-by-layer.

Stable systems typically show bounded, smoothly varying distributions.

Stabilization Techniques

Common stabilization methods include:

Technique	Purpose
Gradient clipping	Limit exploding gradients
Residual connections	Preserve gradient flow
Normalization layers	Improve conditioning
Stable algebraic reformulations	Reduce cancellation
Higher precision accumulators	Reduce rounding error
Loss scaling	Prevent underflow
Orthogonal initialization	Preserve singular values
Careful activation choice	Improve Jacobian conditioning

Gradient clipping is particularly common:

$$ g \leftarrow g \cdot \min\left( 1, \frac{\tau}{|g|} \right). $$

This prevents catastrophic updates from large gradients.

Reverse Mode as a Numerical Algorithm

It is tempting to think of reverse mode purely symbolically:

$$ \text{apply chain rule backward}. $$

In practice, reverse mode is a large-scale numerical algorithm operating on finite precision hardware.

Its behavior depends on:

arithmetic precision,
graph structure,
operation ordering,
conditioning,
reduction strategy,
memory policy,
hardware execution model.

Two mathematically equivalent graphs may have radically different numerical stability.

Core Idea

Reverse mode automatic differentiation is mathematically exact in exact arithmetic, but real systems execute in finite precision. Stability therefore depends on the conditioning of the primal computation, the structure of the backward graph, and the numerical properties of accumulation and recomputation. Reverse mode is best understood as a backward numerical propagation process whose stability must be engineered explicitly.