Skip to content

Chapter 17. Numerical and Systems Concerns

Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the...

Floating Point Error

Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the implemented floating point program, not always the exact derivative of the ideal real-valued mathematical function.

That distinction matters.

Suppose a program represents a function

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

Mathematically, we may want the Jacobian

Jf(x)=f(x)x. J_f(x) = \frac{\partial f(x)}{\partial x}.

But the computer evaluates a floating point approximation

f^(x), \widehat{f}(x),

where every addition, multiplication, division, transcendental function, and memory conversion may introduce rounding error. AD differentiates the operations actually executed. In practice, it propagates derivatives through

f^, \widehat{f},

not through the idealized ff.

This is usually acceptable. It is also the source of many subtle failures.

Floating Point Numbers

A floating point number represents a real number using three parts:

x=(1)smbe. x = (-1)^s \cdot m \cdot b^e.

Here ss is the sign, mm is the significand, bb is the base, and ee is the exponent.

Most modern systems use IEEE 754 binary floating point. The common formats are:

FormatSignificand precisionExponent bitsApproximate decimal digits
float1611 bits53 to 4
bfloat168 bits82 to 3
float3224 bits87
float6453 bits1115 to 16

The key point is simple: only finitely many real numbers can be represented. Every operation must round its result back into the finite set.

For a basic floating point operation, we often model the computed result as

fl(ab)=(ab)(1+δ), \operatorname{fl}(a \circ b) = (a \circ b)(1 + \delta),

where \circ is one of +,,×,/+, -, \times, /, and δu|\delta| \leq u. The number uu is the unit roundoff.

This model is not exact for every edge case, but it is a useful first approximation.

Error in Function Values

Forward evaluation already contains error before differentiation is considered.

For example:

f(x)=(x+1)x. f(x) = (x + 1) - x.

Mathematically, f(x)=1f(x) = 1. But in floating point, when xx is very large, x+1x + 1 may round back to xx. Then the computed value becomes

f^(x)=xx=0. \widehat{f}(x) = x - x = 0.

The exact function has derivative

f(x)=0. f'(x) = 0.

The floating point program is locally constant over ranges of large xx, so its computed derivative is also 00. In this example, the derivative happens to match. But the function value is wrong.

Now consider

g(x)=(x+1)x1. g(x) = \frac{(x + 1) - x}{1}.

The symbolic expression is still constant. The AD result may still be 00. The problem is not AD. The problem is that the primal computation has already lost the information.

AD cannot recover numerical information that the forward pass destroyed.

Error in Derivatives

Derivative error comes from two places.

First, the primal values may be inaccurate. Since local derivatives are evaluated at primal values, the derivative path receives those inaccuracies.

Second, derivative accumulations also use floating point arithmetic. In reverse mode, many adjoints may be summed into the same variable. The order of summation can change the result.

Consider

y=x1+x2++xn. y = x_1 + x_2 + \cdots + x_n.

The exact derivative is simple:

yxi=1. \frac{\partial y}{\partial x_i} = 1.

But in a large computational graph, a variable may receive adjoint contributions from thousands or millions of paths. The accumulated adjoint has the form

xˉ=kvˉkvkx. \bar{x} = \sum_k \bar{v}_k \frac{\partial v_k}{\partial x}.

In floating point, this sum depends on order. Adding small values after very large values may discard the small values. Parallel reductions make this more visible because different execution schedules may produce different summation trees.

Catastrophic Cancellation

Cancellation occurs when two nearly equal numbers are subtracted.

For example:

h(x)=x+1x. h(x) = \sqrt{x + 1} - \sqrt{x}.

For large xx, both terms are close. Their subtraction loses significant digits.

A numerically better form is

h(x)=1x+1+x. h(x) = \frac{1}{\sqrt{x + 1} + \sqrt{x}}.

These two formulas are mathematically equivalent. They are not numerically equivalent.

AD will differentiate whichever program it receives. If the unstable form is used, the derivative calculation inherits the instability. If the stable form is used, the derivative calculation usually improves as well.

This is one of the most important rules in numerical AD:

The derivative program is only as stable as the primal program plus the derivative accumulation strategy.

Forward Mode Error

In forward mode, each value carries a primal and a tangent:

(x,x˙). (x, \dot{x}).

Each operation updates both.

For multiplication:

z=xy, z = xy,

the tangent is

z˙=x˙y+xy˙. \dot{z} = \dot{x}y + x\dot{y}.

Floating point error appears in both the primal multiplication and the tangent expression. The tangent calculation has an addition, so it can also suffer cancellation.

For example, if

x˙yxy˙, \dot{x}y \approx -x\dot{y},

then the tangent result may lose many significant digits.

Forward mode has a useful property: tangent values are computed in the same direction as primal values. This often makes local error easier to reason about. The drawback is that tangent propagation can still amplify instability in long chains of operations.

Reverse Mode Error

Reverse mode stores or reconstructs primal values, then runs an adjoint computation backward.

For an operation

z=xy, z = xy,

reverse mode applies

xˉ+=zˉy, \bar{x} \mathrel{+}= \bar{z}y, yˉ+=zˉx. \bar{y} \mathrel{+}= \bar{z}x.

The symbol +=\mathrel{+}= is important. Reverse mode accumulates. A variable may receive contributions from many later operations.

This creates three numerical issues.

First, adjoint accumulation can be order-dependent.

Second, stored primal values may be rounded, compressed, rematerialized, or recomputed differently.

Third, checkpointing can change numerical behavior if recomputation does not exactly match the original forward pass.

In pure deterministic float64 code, this may be negligible. In mixed precision GPU training, it can be significant.

The Derivative of the Program

A useful way to think about AD is:

AD(P) \operatorname{AD}(P)

computes derivatives of program PP, not derivatives of the mathematical expression the programmer had in mind.

This matters when the program contains:

Program featureNumerical consequence
RoundingPiecewise constant behavior at small scales
BranchesDiscontinuous derivative structure
OverflowInfinite or NaN values
UnderflowSilent loss of small values
Mixed precisionDifferent primal and adjoint accuracy
Parallel reductionsNon-deterministic accumulation order
Approximate kernelsDerivatives of approximations
Fused operationsDifferent rounding than unfused code

For example, the fused multiply-add operation computes

fma(a,b,c)=ab+c \operatorname{fma}(a,b,c) = ab + c

with a single rounding step. The separated expression

a×b+c a \times b + c

usually rounds twice. They can produce different primal values and therefore different derivative behavior in edge cases.

Non-Smoothness Introduced by Floating Point

Real-valued smooth functions may become piecewise constant or piecewise discontinuous after floating point implementation.

Consider

f(x)=x+ϵ. f(x) = x + \epsilon.

If ϵ\epsilon is smaller than the spacing between adjacent floating point numbers near xx, then

fl(x+ϵ)=x. \operatorname{fl}(x + \epsilon) = x.

The implemented function is locally constant. Its real mathematical derivative is 11, but the floating point mapping from representable inputs to representable outputs behaves like the identity or a constant map depending on scale and rounding.

Most AD systems ignore the derivative of rounding itself. They differentiate the arithmetic expression as if operations were real-valued, while using floating point values for evaluation.

This convention is practical. Treating rounding as a literal discrete operation would make most derivatives zero or undefined.

So AD occupies a middle position:

It uses floating point execution, but applies real calculus rules to the executed operations.

Overflow and Underflow

Overflow occurs when a result is too large to represent. Underflow occurs when a result is too small to represent normally.

The exponential function is a common example:

f(x)=ex. f(x) = e^x.

For large positive xx, exe^x overflows. For large negative xx, it underflows toward zero.

The derivative is

f(x)=ex. f'(x) = e^x.

So the derivative overflows or underflows in the same regions.

A more subtle example is softmax:

softmax(xi)=exijexj. \operatorname{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}.

The stable implementation subtracts the maximum value:

softmax(xi)=eximjexjm,m=maxjxj. \operatorname{softmax}(x_i) = \frac{e^{x_i - m}}{\sum_j e^{x_j - m}}, \quad m = \max_j x_j.

This does not change the mathematical result, but it prevents unnecessary overflow. AD through the stable implementation is also more reliable.

Mixed Precision

Modern ML systems often use mixed precision:

QuantityCommon precision
Activationsfloat16, bfloat16, float32
Weightsfloat16, bfloat16, float32
Gradientsfloat16, bfloat16, float32
Accumulatorsfloat32
Optimizer statefloat32 or float64

Mixed precision improves throughput and memory use, but changes numerical behavior.

The most common problem is gradient underflow. Small gradients may become zero in float16. Loss scaling is a practical workaround. Instead of differentiating the loss LL, the system differentiates

αL, \alpha L,

where α\alpha is a scale factor. After backpropagation, gradients are divided by α\alpha.

This preserves more gradient information during low-precision computation.

Numerical Testing of AD

Finite differences are often used to test AD implementations:

f(x)f(x+h)f(x)h. f'(x) \approx \frac{f(x + h) - f(x)}{h}.

But finite differences have their own floating point error.

If hh is too large, truncation error dominates. If hh is too small, cancellation dominates.

A centered difference is usually better:

f(x)f(x+h)f(xh)2h. f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}.

Even then, numerical derivative checks should use tolerances, not exact equality.

A practical test compares:

gADgFD \left|g_{\text{AD}} - g_{\text{FD}}\right|

and

gADgFDmax(1,gAD,gFD). \frac{\left|g_{\text{AD}} - g_{\text{FD}}\right|} {\max(1, |g_{\text{AD}}|, |g_{\text{FD}}|)}.

Both absolute and relative error matter.

Practical Guidelines

Use stable primal formulas. AD does not repair unstable algebra.

Prefer float64 for debugging derivative correctness. Move to lower precision only after the derivative path is trusted.

Use stable reductions for large sums. Pairwise summation, tree reductions, or compensated summation can reduce error.

Be careful with branches near thresholds. A tiny rounding difference can select a different branch and produce a different derivative.

Check for NaN and Inf in both primals and adjoints. A valid forward value does not guarantee a valid gradient.

Use mixed precision deliberately. Keep accumulators and optimizer states in higher precision when possible.

Test derivatives across scales. A function may pass checks near x=1x = 1 and fail near x=1020x = 10^{20} or x=1020x = 10^{-20}.

Core Idea

Floating point error does not make automatic differentiation invalid. It means AD must be understood as differentiation of an executed numerical program. The calculus rules may be exact, while the values they operate on are approximate. For reliable systems, numerical stability is part of the AD design, not an afterthought.