Automatic differentiation computes derivatives exactly with respect to the executed floating point program. This distinguishes AD from numerical differentiation, which...
Automatic differentiation computes derivatives exactly with respect to the executed floating point program. This distinguishes AD from numerical differentiation, which approximates derivatives using perturbations.
The phrase “exact up to floating point” has a precise meaning:
- AD does not introduce truncation error
- AD does not approximate derivatives by finite differences
- AD propagates analytically correct local derivatives
- all remaining error comes from ordinary floating point arithmetic
Automatic differentiation therefore inherits the numerical properties of the primal computation itself.
Exactness Relative to the Executed Program
Consider a floating point program:
y = x * x + sin(x)Automatic differentiation differentiates this exact sequence of operations.
The derivative program becomes:
dy = 2 * x * dx + cos(x) * dxEvery local derivative rule is mathematically exact for the primitive operation being executed.
The resulting derivative corresponds to the derivative of the implemented floating point function:
not necessarily the ideal real-valued mathematical function:
This distinction matters because floating point arithmetic changes the effective function being computed.
Floating Point Arithmetic
Floating point operations satisfy:
where:
- is an arithmetic operation
- represents rounding error
- typically:
Automatic differentiation propagates derivatives through these floating point operations exactly as executed.
Thus AD computes derivatives of:
rather than derivatives of ideal infinite-precision arithmetic.
Contrast with Finite Differences
Finite differences estimate derivatives by perturbation.
Forward difference:
This introduces two competing errors.
Truncation Error
Taylor expansion gives:
$$ f(x+h)
f(x)+hf’(x)+O(h^2) $$
Thus:
$$ \frac{f(x+h)-f(x)}{h}
f’(x)+O(h) $$
The approximation improves as .
Cancellation Error
When becomes very small:
subtracts nearly equal floating point numbers.
This causes catastrophic cancellation.
Relative error grows roughly like:
Thus:
- large : truncation dominates
- small : cancellation dominates
Choosing becomes numerically delicate.
Automatic differentiation avoids this entire tradeoff.
Example of Cancellation
Suppose:
Using finite differences:
$$ \frac{(x+h)^2-x^2}{h}
2x+h $$
Mathematically correct.
But in floating point arithmetic, for very small :
and subtraction destroys significant digits.
AD instead propagates:
directly through local derivative rules.
No subtraction of nearby values occurs.
Stability of Local Rules
AD computes derivatives using local primitive derivatives.
Example:
Derivative propagation:
or in reverse mode:
These rules are algebraically exact.
Thus derivative accuracy depends primarily on:
- the stability of primal evaluation
- the conditioning of the function
- floating point roundoff
AD itself introduces no additional approximation layer.
Conditioning vs Algorithmic Error
A crucial distinction exists between:
- conditioning of the mathematical problem
- stability of the differentiation algorithm
Even exact derivatives may be numerically large if the function is ill-conditioned.
Example:
Derivative:
Near zero:
- the derivative becomes extremely large
- small perturbations cause large output changes
AD computes this derivative correctly.
The instability belongs to the function itself, not the differentiation method.
Stability of Reverse Mode
Reverse mode propagates adjoints backward through the graph.
Some operations amplify sensitivities.
Example:
Backward propagation:
For large :
- gradients may explode
- overflow becomes possible
Similarly:
gives:
Near zero:
- gradients diverge
- numerical instability increases
These issues are intrinsic to the derivatives themselves.
AD exposes them faithfully.
Floating Point Non-Associativity
Floating point arithmetic is not associative.
Example:
This means:
- compiler optimizations may alter results
- reordered reductions may change gradients
- parallel execution may produce different adjoints
AD systems therefore inherit:
- nondeterminism
- reduction sensitivity
- ordering dependence
especially on GPUs and distributed systems.
Differentiating Approximate Kernels
Many numerical libraries use approximate implementations.
Examples:
- fast exponentials
- approximate reciprocal square roots
- polynomial approximations
- fused kernels
AD differentiates the implemented approximation.
If:
then AD computes:
not necessarily:
This distinction matters in:
- hardware accelerators
- low-precision inference
- approximate math libraries
- mixed-precision training
Discontinuities and Branches
AD assumes differentiable local primitives.
Programs with discontinuities complicate this assumption.
Example:
if x > 0:
y = x
else:
y = 0The executed branch determines the propagated derivative.
For :
For :
At :
- the function is nondifferentiable
- AD usually returns the derivative of the executed branch
This is exact relative to the executed trace, but not a generalized symbolic derivative of the entire piecewise function.
Chaotic Systems
Some systems amplify tiny perturbations exponentially.
Examples:
- chaotic differential equations
- long recurrent systems
- turbulent simulations
Even exact derivatives may become numerically meaningless after sufficient time due to sensitivity amplification.
AD does not solve this issue.
It accurately propagates the instability already present in the underlying dynamics.
Mixed Precision
Modern machine learning often uses:
- FP16
- BF16
- FP8
instead of FP64.
AD propagates derivatives in the same precision unless special handling exists.
Low precision introduces:
- rounding amplification
- gradient underflow
- reduced dynamic range
Loss scaling is often used.
Example:
Backward propagation computes:
$$ \nabla L_{\text{scaled}}
\alpha \nabla L $$
then gradients are rescaled afterward.
This prevents tiny gradients from flushing to zero.
Complex Step Differentiation
Complex-step differentiation is another exact derivative method.
Using:
Taylor expansion gives:
$$ f(x+ih)
f(x)+ihf’(x)+O(h^2) $$
Thus:
Unlike finite differences:
- no cancellation occurs
- very small is safe
Complex-step differentiation is highly accurate but:
- requires complex arithmetic support
- works mainly for analytic functions
- is less general than AD
AD remains more flexible for arbitrary programs and control flow.
Exactness in Higher-Order AD
Higher-order AD preserves the same principle:
- local derivative propagation remains exact
- floating point arithmetic still limits accuracy
However higher-order derivatives amplify conditioning problems rapidly.
Example:
Higher derivatives grow exponentially:
Thus:
- overflow becomes likely
- numerical sensitivity increases
The differentiation method remains exact relative to arithmetic, but the underlying mathematics becomes ill-conditioned.
Practical Numerical Stabilization
Real AD systems include stabilized primitives.
Examples:
LogSumExp
Instead of:
use:
where:
This prevents overflow.
Stable Softmax
Naive softmax:
may overflow.
Subtracting the maximum stabilizes both:
- primal outputs
- gradients
AD then propagates stable derivatives automatically.
Exactness of the Chain Rule
The key reason AD achieves high accuracy is that the chain rule is applied analytically at every primitive.
For composition:
AD computes:
through exact local propagation.
No approximation replaces the chain rule itself.
This is the core mathematical advantage of automatic differentiation.
Summary
Automatic differentiation computes derivatives exactly relative to the executed floating point program.
Its errors come from:
- floating point rounding
- conditioning
- instability in the primal computation
- instability inherent in the mathematical problem
It avoids:
- truncation error
- cancellation error from finite differences
- symbolic expression swell
The central principle is:
- AD propagates analytically exact local derivatives
- floating point arithmetic remains the only numerical approximation layer
Thus automatic differentiation is best understood as exact differentiation of numerical programs under finite-precision execution.