# Numerical Exactness up to Floating Point

Automatic differentiation computes derivatives exactly with respect to the executed floating point program. This distinguishes AD from numerical differentiation, which approximates derivatives using perturbations.

The phrase “exact up to floating point” has a precise meaning:

- AD does not introduce truncation error
- AD does not approximate derivatives by finite differences
- AD propagates analytically correct local derivatives
- all remaining error comes from ordinary floating point arithmetic

Automatic differentiation therefore inherits the numerical properties of the primal computation itself.

## Exactness Relative to the Executed Program

Consider a floating point program:

```text
y = x * x + sin(x)
```

Automatic differentiation differentiates this exact sequence of operations.

The derivative program becomes:

```text
dy = 2 * x * dx + cos(x) * dx
```

Every local derivative rule is mathematically exact for the primitive operation being executed.

The resulting derivative corresponds to the derivative of the implemented floating point function:

$$
\hat f(x)
$$

not necessarily the ideal real-valued mathematical function:

$$
f(x)
$$

This distinction matters because floating point arithmetic changes the effective function being computed.

## Floating Point Arithmetic

Floating point operations satisfy:

$$
\operatorname{fl}(a \circ b) =
(a \circ b)(1+\delta)
$$

where:
- $\circ$ is an arithmetic operation
- $\delta$ represents rounding error
- typically:

$$
|\delta|\leq \varepsilon_{\text{mach}}
$$

Automatic differentiation propagates derivatives through these floating point operations exactly as executed.

Thus AD computes derivatives of:

$$
\operatorname{fl}(f(x))
$$

rather than derivatives of ideal infinite-precision arithmetic.

## Contrast with Finite Differences

Finite differences estimate derivatives by perturbation.

Forward difference:

$$
f'(x)\approx\frac{f(x+h)-f(x)}{h}
$$

This introduces two competing errors.

### Truncation Error

Taylor expansion gives:

$$
f(x+h)
=
f(x)+hf'(x)+O(h^2)
$$

Thus:

$$
\frac{f(x+h)-f(x)}{h}
=
f'(x)+O(h)
$$

The approximation improves as $h\to0$.

### Cancellation Error

When $h$ becomes very small:

$$
f(x+h)-f(x)
$$

subtracts nearly equal floating point numbers.

This causes catastrophic cancellation.

Relative error grows roughly like:

$$
O\left(\frac{\varepsilon_{\text{mach}}}{h}\right)
$$

Thus:
- large $h$: truncation dominates
- small $h$: cancellation dominates

Choosing $h$ becomes numerically delicate.

Automatic differentiation avoids this entire tradeoff.

## Example of Cancellation

Suppose:

$$
f(x)=x^2
$$

Using finite differences:

$$
\frac{(x+h)^2-x^2}{h}
=
2x+h
$$

Mathematically correct.

But in floating point arithmetic, for very small $h$:

$$
(x+h)^2
\approx
x^2
$$

and subtraction destroys significant digits.

AD instead propagates:

$$
\frac{d}{dx}(x^2)=2x
$$

directly through local derivative rules.

No subtraction of nearby values occurs.

## Stability of Local Rules

AD computes derivatives using local primitive derivatives.

Example:

$$
z=\sin(x)
$$

Derivative propagation:

$$
\dot z=\cos(x)\dot x
$$

or in reverse mode:

$$
\bar x += \cos(x)\bar z
$$

These rules are algebraically exact.

Thus derivative accuracy depends primarily on:
- the stability of primal evaluation
- the conditioning of the function
- floating point roundoff

AD itself introduces no additional approximation layer.

## Conditioning vs Algorithmic Error

A crucial distinction exists between:
- conditioning of the mathematical problem
- stability of the differentiation algorithm

Even exact derivatives may be numerically large if the function is ill-conditioned.

Example:

$$
f(x)=\frac{1}{x}
$$

Derivative:

$$
f'(x)=-\frac{1}{x^2}
$$

Near zero:
- the derivative becomes extremely large
- small perturbations cause large output changes

AD computes this derivative correctly.

The instability belongs to the function itself, not the differentiation method.

## Stability of Reverse Mode

Reverse mode propagates adjoints backward through the graph.

Some operations amplify sensitivities.

Example:

$$
z=e^x
$$

Backward propagation:

$$
\bar x += e^x \bar z
$$

For large $x$:
- gradients may explode
- overflow becomes possible

Similarly:

$$
z=\log(x)
$$

gives:

$$
\bar x += \frac{\bar z}{x}
$$

Near zero:
- gradients diverge
- numerical instability increases

These issues are intrinsic to the derivatives themselves.

AD exposes them faithfully.

## Floating Point Non-Associativity

Floating point arithmetic is not associative.

Example:

$$
(a+b)+c
\neq
a+(b+c)
$$

This means:
- compiler optimizations may alter results
- reordered reductions may change gradients
- parallel execution may produce different adjoints

AD systems therefore inherit:
- nondeterminism
- reduction sensitivity
- ordering dependence

especially on GPUs and distributed systems.

## Differentiating Approximate Kernels

Many numerical libraries use approximate implementations.

Examples:
- fast exponentials
- approximate reciprocal square roots
- polynomial approximations
- fused kernels

AD differentiates the implemented approximation.

If:

$$
\hat z \approx \exp(x)
$$

then AD computes:

$$
\frac{d\hat z}{dx}
$$

not necessarily:

$$
\frac{d}{dx}\exp(x)
$$

This distinction matters in:
- hardware accelerators
- low-precision inference
- approximate math libraries
- mixed-precision training

## Discontinuities and Branches

AD assumes differentiable local primitives.

Programs with discontinuities complicate this assumption.

Example:

```text
if x > 0:
    y = x
else:
    y = 0
```

The executed branch determines the propagated derivative.

For $x>0$:

$$
\frac{dy}{dx}=1
$$

For $x<0$:

$$
\frac{dy}{dx}=0
$$

At $x=0$:
- the function is nondifferentiable
- AD usually returns the derivative of the executed branch

This is exact relative to the executed trace, but not a generalized symbolic derivative of the entire piecewise function.

## Chaotic Systems

Some systems amplify tiny perturbations exponentially.

Examples:
- chaotic differential equations
- long recurrent systems
- turbulent simulations

Even exact derivatives may become numerically meaningless after sufficient time due to sensitivity amplification.

AD does not solve this issue.

It accurately propagates the instability already present in the underlying dynamics.

## Mixed Precision

Modern machine learning often uses:
- FP16
- BF16
- FP8

instead of FP64.

AD propagates derivatives in the same precision unless special handling exists.

Low precision introduces:
- rounding amplification
- gradient underflow
- reduced dynamic range

Loss scaling is often used.

Example:

$$
L_{\text{scaled}} = \alpha L
$$

Backward propagation computes:

$$
\nabla L_{\text{scaled}}
=
\alpha \nabla L
$$

then gradients are rescaled afterward.

This prevents tiny gradients from flushing to zero.

## Complex Step Differentiation

Complex-step differentiation is another exact derivative method.

Using:

$$
f(x+ih)
$$

Taylor expansion gives:

$$
f(x+ih)
=
f(x)+ihf'(x)+O(h^2)
$$

Thus:

$$
f'(x)
\approx
\frac{\operatorname{Im}(f(x+ih))}{h}
$$

Unlike finite differences:
- no cancellation occurs
- very small $h$ is safe

Complex-step differentiation is highly accurate but:
- requires complex arithmetic support
- works mainly for analytic functions
- is less general than AD

AD remains more flexible for arbitrary programs and control flow.

## Exactness in Higher-Order AD

Higher-order AD preserves the same principle:
- local derivative propagation remains exact
- floating point arithmetic still limits accuracy

However higher-order derivatives amplify conditioning problems rapidly.

Example:

$$
f(x)=e^{100x}
$$

Higher derivatives grow exponentially:

$$
f^{(n)}(x)=100^n e^{100x}
$$

Thus:
- overflow becomes likely
- numerical sensitivity increases

The differentiation method remains exact relative to arithmetic, but the underlying mathematics becomes ill-conditioned.

## Practical Numerical Stabilization

Real AD systems include stabilized primitives.

Examples:

### LogSumExp

Instead of:

$$
\log\left(\sum_i e^{x_i}\right)
$$

use:

$$
m+\log\left(\sum_i e^{x_i-m}\right)
$$

where:

$$
m=\max_i x_i
$$

This prevents overflow.

### Stable Softmax

Naive softmax:

$$
\frac{e^{x_i}}{\sum_j e^{x_j}}
$$

may overflow.

Subtracting the maximum stabilizes both:
- primal outputs
- gradients

AD then propagates stable derivatives automatically.

## Exactness of the Chain Rule

The key reason AD achieves high accuracy is that the chain rule is applied analytically at every primitive.

For composition:

$$
f(g(h(x)))
$$

AD computes:

$$
f'(g(h(x)))
g'(h(x))
h'(x)
$$

through exact local propagation.

No approximation replaces the chain rule itself.

This is the core mathematical advantage of automatic differentiation.

## Summary

Automatic differentiation computes derivatives exactly relative to the executed floating point program.

Its errors come from:
- floating point rounding
- conditioning
- instability in the primal computation
- instability inherent in the mathematical problem

It avoids:
- truncation error
- cancellation error from finite differences
- symbolic expression swell

The central principle is:

- AD propagates analytically exact local derivatives
- floating point arithmetic remains the only numerical approximation layer

Thus automatic differentiation is best understood as exact differentiation of numerical programs under finite-precision execution.

