Automatic differentiation computes derivatives exactly with respect to the executed floating point program. This distinguishes AD from numerical differentiation, which approximates derivatives using perturbations.

The phrase “exact up to floating point” has a precise meaning:

AD does not introduce truncation error
AD does not approximate derivatives by finite differences
AD propagates analytically correct local derivatives
all remaining error comes from ordinary floating point arithmetic

Automatic differentiation therefore inherits the numerical properties of the primal computation itself.

Exactness Relative to the Executed Program

Consider a floating point program:

y = x * x + sin(x)

Automatic differentiation differentiates this exact sequence of operations.

The derivative program becomes:

dy = 2 * x * dx + cos(x) * dx

Every local derivative rule is mathematically exact for the primitive operation being executed.

The resulting derivative corresponds to the derivative of the implemented floating point function:

\hat f(x)

not necessarily the ideal real-valued mathematical function:

f(x)

This distinction matters because floating point arithmetic changes the effective function being computed.

Floating Point Arithmetic

Floating point operations satisfy:

\operatorname{fl}(a \circ b) = (a \circ b)(1+\delta)

where:

$\circ$ is an arithmetic operation
$\delta$ represents rounding error
typically:

|\delta|\leq \varepsilon_{\text{mach}}

Automatic differentiation propagates derivatives through these floating point operations exactly as executed.

Thus AD computes derivatives of:

\operatorname{fl}(f(x))

rather than derivatives of ideal infinite-precision arithmetic.

Contrast with Finite Differences

Finite differences estimate derivatives by perturbation.

Forward difference:



f'(x)\approx\frac{f(x+h)-f(x)}{h}

This introduces two competing errors.

Truncation Error

Taylor expansion gives:

$$ f(x+h)

f(x)+hf’(x)+O(h^2) $$

Thus:

$$ \frac{f(x+h)-f(x)}{h}

f’(x)+O(h) $$

The approximation improves as $h\to0$ .

Cancellation Error

When $h$ becomes very small:

f(x+h)-f(x)

subtracts nearly equal floating point numbers.

This causes catastrophic cancellation.

Relative error grows roughly like:

O\left(\frac{\varepsilon_{\text{mach}}}{h}\right)

Thus:

large $h$ : truncation dominates
small $h$ : cancellation dominates

Choosing $h$ becomes numerically delicate.

Automatic differentiation avoids this entire tradeoff.

Example of Cancellation

Suppose:

f(x)=x^2

Using finite differences:

$$ \frac{(x+h)^2-x^2}{h}

2x+h $$

Mathematically correct.

But in floating point arithmetic, for very small $h$ :

(x+h)^2 \approx x^2

and subtraction destroys significant digits.

AD instead propagates:

\frac{d}{dx}(x^2)=2x

directly through local derivative rules.

No subtraction of nearby values occurs.

Stability of Local Rules

AD computes derivatives using local primitive derivatives.

Example:

z=\sin(x)

Derivative propagation:

\dot z=\cos(x)\dot x

or in reverse mode:

\bar x += \cos(x)\bar z

These rules are algebraically exact.

Thus derivative accuracy depends primarily on:

the stability of primal evaluation
the conditioning of the function
floating point roundoff

AD itself introduces no additional approximation layer.

Conditioning vs Algorithmic Error

A crucial distinction exists between:

conditioning of the mathematical problem
stability of the differentiation algorithm

Even exact derivatives may be numerically large if the function is ill-conditioned.

Example:

f(x)=\frac{1}{x}

Derivative:

f'(x)=-\frac{1}{x^2}

Near zero:

the derivative becomes extremely large
small perturbations cause large output changes

AD computes this derivative correctly.

The instability belongs to the function itself, not the differentiation method.

Stability of Reverse Mode

Reverse mode propagates adjoints backward through the graph.

Some operations amplify sensitivities.

Example:

z=e^x

Backward propagation:

\bar x += e^x \bar z

For large $x$ :

gradients may explode
overflow becomes possible

Similarly:

z=\log(x)

gives:

\bar x += \frac{\bar z}{x}

Near zero:

gradients diverge
numerical instability increases

These issues are intrinsic to the derivatives themselves.

AD exposes them faithfully.

Floating Point Non-Associativity

Floating point arithmetic is not associative.

Example:

(a+b)+c \neq a+(b+c)

This means:

compiler optimizations may alter results
reordered reductions may change gradients
parallel execution may produce different adjoints

AD systems therefore inherit:

nondeterminism
reduction sensitivity
ordering dependence

especially on GPUs and distributed systems.

Differentiating Approximate Kernels

Many numerical libraries use approximate implementations.

Examples:

fast exponentials
approximate reciprocal square roots
polynomial approximations
fused kernels

AD differentiates the implemented approximation.

If:

\hat z \approx \exp(x)

then AD computes:

\frac{d\hat z}{dx}

not necessarily:

\frac{d}{dx}\exp(x)

This distinction matters in:

hardware accelerators
low-precision inference
approximate math libraries
mixed-precision training

Discontinuities and Branches

AD assumes differentiable local primitives.

Programs with discontinuities complicate this assumption.

Example:

if x > 0:
    y = x
else:
    y = 0

The executed branch determines the propagated derivative.

For $x>0$ :

\frac{dy}{dx}=1

For $x<0$ :

\frac{dy}{dx}=0

At $x=0$ :

the function is nondifferentiable
AD usually returns the derivative of the executed branch

This is exact relative to the executed trace, but not a generalized symbolic derivative of the entire piecewise function.

Chaotic Systems

Some systems amplify tiny perturbations exponentially.

Examples:

chaotic differential equations
long recurrent systems
turbulent simulations

Even exact derivatives may become numerically meaningless after sufficient time due to sensitivity amplification.

AD does not solve this issue.

It accurately propagates the instability already present in the underlying dynamics.

Mixed Precision

Modern machine learning often uses:

FP16
BF16
FP8

instead of FP64.

AD propagates derivatives in the same precision unless special handling exists.

Low precision introduces:

rounding amplification
gradient underflow
reduced dynamic range

Loss scaling is often used.

Example:

L_{\text{scaled}} = \alpha L

Backward propagation computes:

$$ \nabla L_{\text{scaled}}

\alpha \nabla L $$

then gradients are rescaled afterward.

This prevents tiny gradients from flushing to zero.

Complex Step Differentiation

Complex-step differentiation is another exact derivative method.

Using:

f(x+ih)

Taylor expansion gives:

$$ f(x+ih)

f(x)+ihf’(x)+O(h^2) $$

Thus:

f'(x) \approx \frac{\operatorname{Im}(f(x+ih))}{h}

Unlike finite differences:

no cancellation occurs
very small $h$ is safe

Complex-step differentiation is highly accurate but:

requires complex arithmetic support
works mainly for analytic functions
is less general than AD

AD remains more flexible for arbitrary programs and control flow.

Exactness in Higher-Order AD

Higher-order AD preserves the same principle:

local derivative propagation remains exact
floating point arithmetic still limits accuracy

However higher-order derivatives amplify conditioning problems rapidly.

Example:

f(x)=e^{100x}

Higher derivatives grow exponentially:

f^{(n)}(x)=100^n e^{100x}

Thus:

overflow becomes likely
numerical sensitivity increases

The differentiation method remains exact relative to arithmetic, but the underlying mathematics becomes ill-conditioned.

Practical Numerical Stabilization

Real AD systems include stabilized primitives.

Examples:

LogSumExp

Instead of:

\log\left(\sum_i e^{x_i}\right)

use:

m+\log\left(\sum_i e^{x_i-m}\right)

where:

m=\max_i x_i

This prevents overflow.

Stable Softmax

Naive softmax:

\frac{e^{x_i}}{\sum_j e^{x_j}}

may overflow.

Subtracting the maximum stabilizes both:

primal outputs
gradients

AD then propagates stable derivatives automatically.

Exactness of the Chain Rule

The key reason AD achieves high accuracy is that the chain rule is applied analytically at every primitive.

For composition:

f(g(h(x)))

AD computes:

f'(g(h(x))) g'(h(x)) h'(x)

through exact local propagation.

No approximation replaces the chain rule itself.

This is the core mathematical advantage of automatic differentiation.

Summary

Automatic differentiation computes derivatives exactly relative to the executed floating point program.

Its errors come from:

floating point rounding
conditioning
instability in the primal computation
instability inherent in the mathematical problem

It avoids:

truncation error
cancellation error from finite differences
symbolic expression swell

The central principle is:

AD propagates analytically exact local derivatives
floating point arithmetic remains the only numerical approximation layer

Thus automatic differentiation is best understood as exact differentiation of numerical programs under finite-precision execution.

Numerical Exactness up to Floating Point

Exactness Relative to the Executed Program

Floating Point Arithmetic

Contrast with Finite Differences

Truncation Error

$$ f(x+h)

$$ \frac{f(x+h)-f(x)}{h}

Cancellation Error

Example of Cancellation

$$ \frac{(x+h)^2-x^2}{h}

Stability of Local Rules

Conditioning vs Algorithmic Error

Stability of Reverse Mode

Floating Point Non-Associativity

Differentiating Approximate Kernels

Discontinuities and Branches

Chaotic Systems

Mixed Precision

$$ \nabla L_{\text{scaled}}

Complex Step Differentiation

$$ f(x+ih)

Exactness in Higher-Order AD

Practical Numerical Stabilization

LogSumExp

Stable Softmax

Exactness of the Chain Rule

Summary