Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the...
Floating Point Error
Automatic differentiation computes derivatives by executing arithmetic. On a real machine, arithmetic uses finite precision. This means AD gives the derivative of the implemented floating point program, not always the exact derivative of the ideal real-valued mathematical function.
That distinction matters.
Suppose a program represents a function
Mathematically, we may want the Jacobian
But the computer evaluates a floating point approximation
where every addition, multiplication, division, transcendental function, and memory conversion may introduce rounding error. AD differentiates the operations actually executed. In practice, it propagates derivatives through
not through the idealized .
This is usually acceptable. It is also the source of many subtle failures.
Floating Point Numbers
A floating point number represents a real number using three parts:
Here is the sign, is the significand, is the base, and is the exponent.
Most modern systems use IEEE 754 binary floating point. The common formats are:
| Format | Significand precision | Exponent bits | Approximate decimal digits |
|---|---|---|---|
| float16 | 11 bits | 5 | 3 to 4 |
| bfloat16 | 8 bits | 8 | 2 to 3 |
| float32 | 24 bits | 8 | 7 |
| float64 | 53 bits | 11 | 15 to 16 |
The key point is simple: only finitely many real numbers can be represented. Every operation must round its result back into the finite set.
For a basic floating point operation, we often model the computed result as
where is one of , and . The number is the unit roundoff.
This model is not exact for every edge case, but it is a useful first approximation.
Error in Function Values
Forward evaluation already contains error before differentiation is considered.
For example:
Mathematically, . But in floating point, when is very large, may round back to . Then the computed value becomes
The exact function has derivative
The floating point program is locally constant over ranges of large , so its computed derivative is also . In this example, the derivative happens to match. But the function value is wrong.
Now consider
The symbolic expression is still constant. The AD result may still be . The problem is not AD. The problem is that the primal computation has already lost the information.
AD cannot recover numerical information that the forward pass destroyed.
Error in Derivatives
Derivative error comes from two places.
First, the primal values may be inaccurate. Since local derivatives are evaluated at primal values, the derivative path receives those inaccuracies.
Second, derivative accumulations also use floating point arithmetic. In reverse mode, many adjoints may be summed into the same variable. The order of summation can change the result.
Consider
The exact derivative is simple:
But in a large computational graph, a variable may receive adjoint contributions from thousands or millions of paths. The accumulated adjoint has the form
In floating point, this sum depends on order. Adding small values after very large values may discard the small values. Parallel reductions make this more visible because different execution schedules may produce different summation trees.
Catastrophic Cancellation
Cancellation occurs when two nearly equal numbers are subtracted.
For example:
For large , both terms are close. Their subtraction loses significant digits.
A numerically better form is
These two formulas are mathematically equivalent. They are not numerically equivalent.
AD will differentiate whichever program it receives. If the unstable form is used, the derivative calculation inherits the instability. If the stable form is used, the derivative calculation usually improves as well.
This is one of the most important rules in numerical AD:
The derivative program is only as stable as the primal program plus the derivative accumulation strategy.
Forward Mode Error
In forward mode, each value carries a primal and a tangent:
Each operation updates both.
For multiplication:
the tangent is
Floating point error appears in both the primal multiplication and the tangent expression. The tangent calculation has an addition, so it can also suffer cancellation.
For example, if
then the tangent result may lose many significant digits.
Forward mode has a useful property: tangent values are computed in the same direction as primal values. This often makes local error easier to reason about. The drawback is that tangent propagation can still amplify instability in long chains of operations.
Reverse Mode Error
Reverse mode stores or reconstructs primal values, then runs an adjoint computation backward.
For an operation
reverse mode applies
The symbol is important. Reverse mode accumulates. A variable may receive contributions from many later operations.
This creates three numerical issues.
First, adjoint accumulation can be order-dependent.
Second, stored primal values may be rounded, compressed, rematerialized, or recomputed differently.
Third, checkpointing can change numerical behavior if recomputation does not exactly match the original forward pass.
In pure deterministic float64 code, this may be negligible. In mixed precision GPU training, it can be significant.
The Derivative of the Program
A useful way to think about AD is:
computes derivatives of program , not derivatives of the mathematical expression the programmer had in mind.
This matters when the program contains:
| Program feature | Numerical consequence |
|---|---|
| Rounding | Piecewise constant behavior at small scales |
| Branches | Discontinuous derivative structure |
| Overflow | Infinite or NaN values |
| Underflow | Silent loss of small values |
| Mixed precision | Different primal and adjoint accuracy |
| Parallel reductions | Non-deterministic accumulation order |
| Approximate kernels | Derivatives of approximations |
| Fused operations | Different rounding than unfused code |
For example, the fused multiply-add operation computes
with a single rounding step. The separated expression
usually rounds twice. They can produce different primal values and therefore different derivative behavior in edge cases.
Non-Smoothness Introduced by Floating Point
Real-valued smooth functions may become piecewise constant or piecewise discontinuous after floating point implementation.
Consider
If is smaller than the spacing between adjacent floating point numbers near , then
The implemented function is locally constant. Its real mathematical derivative is , but the floating point mapping from representable inputs to representable outputs behaves like the identity or a constant map depending on scale and rounding.
Most AD systems ignore the derivative of rounding itself. They differentiate the arithmetic expression as if operations were real-valued, while using floating point values for evaluation.
This convention is practical. Treating rounding as a literal discrete operation would make most derivatives zero or undefined.
So AD occupies a middle position:
It uses floating point execution, but applies real calculus rules to the executed operations.
Overflow and Underflow
Overflow occurs when a result is too large to represent. Underflow occurs when a result is too small to represent normally.
The exponential function is a common example:
For large positive , overflows. For large negative , it underflows toward zero.
The derivative is
So the derivative overflows or underflows in the same regions.
A more subtle example is softmax:
The stable implementation subtracts the maximum value:
This does not change the mathematical result, but it prevents unnecessary overflow. AD through the stable implementation is also more reliable.
Mixed Precision
Modern ML systems often use mixed precision:
| Quantity | Common precision |
|---|---|
| Activations | float16, bfloat16, float32 |
| Weights | float16, bfloat16, float32 |
| Gradients | float16, bfloat16, float32 |
| Accumulators | float32 |
| Optimizer state | float32 or float64 |
Mixed precision improves throughput and memory use, but changes numerical behavior.
The most common problem is gradient underflow. Small gradients may become zero in float16. Loss scaling is a practical workaround. Instead of differentiating the loss , the system differentiates
where is a scale factor. After backpropagation, gradients are divided by .
This preserves more gradient information during low-precision computation.
Numerical Testing of AD
Finite differences are often used to test AD implementations:
But finite differences have their own floating point error.
If is too large, truncation error dominates. If is too small, cancellation dominates.
A centered difference is usually better:
Even then, numerical derivative checks should use tolerances, not exact equality.
A practical test compares:
and
Both absolute and relative error matter.
Practical Guidelines
Use stable primal formulas. AD does not repair unstable algebra.
Prefer float64 for debugging derivative correctness. Move to lower precision only after the derivative path is trusted.
Use stable reductions for large sums. Pairwise summation, tree reductions, or compensated summation can reduce error.
Be careful with branches near thresholds. A tiny rounding difference can select a different branch and produce a different derivative.
Check for NaN and Inf in both primals and adjoints. A valid forward value does not guarantee a valid gradient.
Use mixed precision deliberately. Keep accumulators and optimizer states in higher precision when possible.
Test derivatives across scales. A function may pass checks near and fail near or .
Core Idea
Floating point error does not make automatic differentiation invalid. It means AD must be understood as differentiation of an executed numerical program. The calculus rules may be exact, while the values they operate on are approximate. For reliable systems, numerical stability is part of the AD design, not an afterthought.