Chapter 17. Numerical and Systems Concerns

Floating Point Error

Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the implemented floating point program, not always the exact derivative of the ideal real-valued mathematical function.

That distinction matters.

Suppose a program represents a function

$$ f : \mathbb{R}^n \to \mathbb{R}^m. $$

Mathematically, we may want the Jacobian

$$ J_f(x) = \frac{\partial f(x)}{\partial x}. $$

But the computer evaluates a floating point approximation

$$ \widehat{f}(x), $$

where every addition, multiplication, division, transcendental function, and memory conversion may introduce rounding error. AD differentiates the operations actually executed. In practice, it propagates derivatives through

$$ \widehat{f}, $$

not through the idealized $f$.

This is usually acceptable. It is also the source of many subtle failures.

Floating Point Numbers

A floating point number represents a real number using three parts:

$$ x = (-1)^s \cdot m \cdot b^e. $$

Here $s$ is the sign, $m$ is the significand, $b$ is the base, and $e$ is the exponent.

Most modern systems use IEEE 754 binary floating point. The common formats are:

Format	Significand precision	Exponent bits	Approximate decimal digits
float16	11 bits	5	3 to 4
bfloat16	8 bits	8	2 to 3
float32	24 bits	8	7
float64	53 bits	11	15 to 16

The key point is simple: only finitely many real numbers can be represented. Every operation must round its result back into the finite set.

For a basic floating point operation, we often model the computed result as

$$ \operatorname{fl}(a \circ b) = (a \circ b)(1 + \delta), $$

where $\circ$ is one of $+, -, \times, /$, and $|\delta| \leq u$. The number $u$ is the unit roundoff.

This model is not exact for every edge case, but it is a useful first approximation.

Error in Function Values

Forward evaluation already contains error before differentiation is considered.

For example:

$$ f(x) = (x + 1) - x. $$

Mathematically, $f(x) = 1$. But in floating point, when $x$ is very large, $x + 1$ may round back to $x$. Then the computed value becomes

$$ \widehat{f}(x) = x - x = 0. $$

The exact function has derivative

$$ f'(x) = 0. $$

The floating point program is locally constant over ranges of large $x$, so its computed derivative is also $0$. In this example, the derivative happens to match. But the function value is wrong.

Now consider

$$ g(x) = \frac{(x + 1) - x}{1}. $$

The symbolic expression is still constant. The AD result may still be $0$. The problem is not AD. The problem is that the primal computation has already lost the information.

AD cannot recover numerical information that the forward pass destroyed.

Error in Derivatives

Derivative error comes from two places.

First, the primal values may be inaccurate. Since local derivatives are evaluated at primal values, the derivative path receives those inaccuracies.

Second, derivative accumulations also use floating point arithmetic. In reverse mode, many adjoints may be summed into the same variable. The order of summation can change the result.

Consider

$$ y = x_1 + x_2 + \cdots + x_n. $$

The exact derivative is simple:

$$ \frac{\partial y}{\partial x_i} = 1. $$

But in a large computational graph, a variable may receive adjoint contributions from thousands or millions of paths. The accumulated adjoint has the form

$$ \bar{x} = \sum_k \bar{v}_k \frac{\partial v_k}{\partial x}. $$

In floating point, this sum depends on order. Adding small values after very large values may discard the small values. Parallel reductions make this more visible because different execution schedules may produce different summation trees.

Catastrophic Cancellation

Cancellation occurs when two nearly equal numbers are subtracted.

For example:

$$ h(x) = \sqrt{x + 1} - \sqrt{x}. $$

For large $x$, both terms are close. Their subtraction loses significant digits.

A numerically better form is

$$ h(x) = \frac{1}{\sqrt{x + 1} + \sqrt{x}}. $$

These two formulas are mathematically equivalent. They are not numerically equivalent.

AD will differentiate whichever program it receives. If the unstable form is used, the derivative calculation inherits the instability. If the stable form is used, the derivative calculation usually improves as well.

This is one of the most important rules in numerical AD:

The derivative program is only as stable as the primal program plus the derivative accumulation strategy.

Forward Mode Error

In forward mode, each value carries a primal and a tangent:

$$ (x, \dot{x}). $$

Each operation updates both.

For multiplication:

$$ z = xy, $$

the tangent is

$$ \dot{z} = \dot{x}y + x\dot{y}. $$

Floating point error appears in both the primal multiplication and the tangent expression. The tangent calculation has an addition, so it can also suffer cancellation.

For example, if

$$ \dot{x}y \approx -x\dot{y}, $$

then the tangent result may lose many significant digits.

Forward mode has a useful property: tangent values are computed in the same direction as primal values. This often makes local error easier to reason about. The drawback is that tangent propagation can still amplify instability in long chains of operations.

Reverse Mode Error

Reverse mode stores or reconstructs primal values, then runs an adjoint computation backward.

For an operation

$$ z = xy, $$

reverse mode applies

$$ \bar{x} \mathrel{+}= \bar{z}y, $$

$$ \bar{y} \mathrel{+}= \bar{z}x. $$

The symbol $\mathrel{+}=$ is important. Reverse mode accumulates. A variable may receive contributions from many later operations.

This creates three numerical issues.

First, adjoint accumulation can be order-dependent.

Second, stored primal values may be rounded, compressed, rematerialized, or recomputed differently.

Third, checkpointing can change numerical behavior if recomputation does not exactly match the original forward pass.

In pure deterministic float64 code, this may be negligible. In mixed precision GPU training, it can be significant.

The Derivative of the Program

A useful way to think about AD is:

$$ \operatorname{AD}(P) $$

computes derivatives of program $P$, not derivatives of the mathematical expression the programmer had in mind.

This matters when the program contains:

Program feature	Numerical consequence
Rounding	Piecewise constant behavior at small scales
Branches	Discontinuous derivative structure
Overflow	Infinite or NaN values
Underflow	Silent loss of small values
Mixed precision	Different primal and adjoint accuracy
Parallel reductions	Non-deterministic accumulation order
Approximate kernels	Derivatives of approximations
Fused operations	Different rounding than unfused code

For example, the fused multiply-add operation computes

$$ \operatorname{fma}(a,b,c) = ab + c $$

with a single rounding step. The separated expression

$$ a \times b + c $$

usually rounds twice. They can produce different primal values and therefore different derivative behavior in edge cases.

Non-Smoothness Introduced by Floating Point

Real-valued smooth functions may become piecewise constant or piecewise discontinuous after floating point implementation.

Consider

$$ f(x) = x + \epsilon. $$

If $\epsilon$ is smaller than the spacing between adjacent floating point numbers near $x$, then

$$ \operatorname{fl}(x + \epsilon) = x. $$

The implemented function is locally constant. Its real mathematical derivative is $1$, but the floating point mapping from representable inputs to representable outputs behaves like the identity or a constant map depending on scale and rounding.

Most AD systems ignore the derivative of rounding itself. They differentiate the arithmetic expression as if operations were real-valued, while using floating point values for evaluation.

This convention is practical. Treating rounding as a literal discrete operation would make most derivatives zero or undefined.

So AD occupies a middle position:

It uses floating point execution, but applies real calculus rules to the executed operations.

Overflow and Underflow

Overflow occurs when a result is too large to represent. Underflow occurs when a result is too small to represent normally.

The exponential function is a common example:

$$ f(x) = e^x. $$

For large positive $x$, $e^x$ overflows. For large negative $x$, it underflows toward zero.

The derivative is

$$ f'(x) = e^x. $$

So the derivative overflows or underflows in the same regions.

A more subtle example is softmax:

$$ \operatorname{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}. $$

The stable implementation subtracts the maximum value:

$$ \operatorname{softmax}(x_i) = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}, \quad m = \max_j x_j. $$

This does not change the mathematical result, but it prevents unnecessary overflow. AD through the stable implementation is also more reliable.

Mixed Precision

Modern ML systems often use mixed precision:

Quantity	Common precision
Activations	float16, bfloat16, float32
Weights	float16, bfloat16, float32
Gradients	float16, bfloat16, float32
Accumulators	float32
Optimizer state	float32 or float64

Mixed precision improves throughput and memory use, but changes numerical behavior.

The most common problem is gradient underflow. Small gradients may become zero in float16. Loss scaling is a practical workaround. Instead of differentiating the loss $L$, the system differentiates

$$ \alpha L, $$

where $\alpha$ is a scale factor. After backpropagation, gradients are divided by $\alpha$.

This preserves more gradient information during low-precision computation.

Numerical Testing of AD

Finite differences are often used to test AD implementations:

$$ f'(x) \approx \frac{f(x + h) - f(x)}{h}. $$

But finite differences have their own floating point error.

If $h$ is too large, truncation error dominates. If $h$ is too small, cancellation dominates.

A centered difference is usually better:

$$ f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}. $$

Even then, numerical derivative checks should use tolerances, not exact equality.

A practical test compares:

$$ \left|g_{\text{AD}} - g_{\text{FD}}\right| $$

and

$$ \frac{\left|g_{\text{AD}} - g_{\text{FD}}\right|} {\max(1, |g_{\text{AD}}|, |g_{\text{FD}}|)}. $$

Both absolute and relative error matter.

Practical Guidelines

Use stable primal formulas. AD does not repair unstable algebra.

Prefer float64 for debugging derivative correctness. Move to lower precision only after the derivative path is trusted.

Use stable reductions for large sums. Pairwise summation, tree reductions, or compensated summation can reduce error.

Be careful with branches near thresholds. A tiny rounding difference can select a different branch and produce a different derivative.

Check for NaN and Inf in both primals and adjoints. A valid forward value does not guarantee a valid gradient.

Use mixed precision deliberately. Keep accumulators and optimizer states in higher precision when possible.

Test derivatives across scales. A function may pass checks near $x = 1$ and fail near $x = 10^{20}$ or $x = 10^{-20}$.

Core Idea

Floating point error does not make automatic differentiation invalid. It means AD must be understood as differentiation of an executed numerical program. The calculus rules may be exact, while the values they operate on are approximate. For reliable systems, numerical stability is part of the AD design, not an afterthought.