Accuracy, Complexity, and Stability

Derivative computation is not only a mathematical problem. It is also a numerical and systems problem. A derivative method must answer three questions simultaneously:

Is the derivative accurate?
How expensive is it?
Is the computation numerically stable?

These questions determine whether a differentiation method is practical for scientific computing, optimization, and machine learning.

Automatic differentiation became important because it achieves a useful balance across all three dimensions.

Accuracy of Derivative Computation

Different differentiation methods produce different kinds of error.

Method	Main source of error
Numerical differentiation	Truncation and roundoff
Symbolic differentiation	Algebraic expansion and simplification issues
Automatic differentiation	Floating point evaluation only

The distinction is fundamental.

Finite differences estimate derivatives indirectly:

f'(x) \approx \frac{f(x+h)-f(x)}{h}.

This introduces approximation error because the derivative is replaced by a secant slope.

Automatic differentiation does not estimate derivatives using nearby function evaluations. Instead, it applies exact local derivative rules to each primitive operation executed by the program.

For example, if the program computes

z = xy,

AD uses the exact derivative rule

dz = y\,dx + x\,dy.

No finite perturbation is introduced.

The resulting derivative is therefore exact with respect to the floating point computation being differentiated. The only remaining numerical errors are those already present in ordinary floating point arithmetic.

This property is often summarized informally as:

Automatic differentiation computes derivatives accurate to machine precision.

A more careful statement is:

Automatic differentiation computes derivatives of the executed floating point program without truncation error.

Truncation Error vs Floating Point Error

To understand why this matters, consider finite differences again.

Using Taylor expansion,

f(x+h) = f(x) + h f'(x) + \frac{h^2}{2}f''(x) + O(h^3).

Rearranging gives

\frac{f(x+h)-f(x)}{h} = f'(x) + \frac{h}{2}f''(x) + O(h^2).

The extra terms are truncation error. Even in exact arithmetic, the estimate differs from the true derivative.

Now consider floating point subtraction. If

f(x+h) \approx f(x),

their difference may lose many significant digits.

The total error behaves approximately like

C_1 h + C_2 \frac{\epsilon}{h},

where:

$h$ controls truncation error
$\epsilon$ is machine precision
$C_1, C_2$ depend on the function

Reducing $h$ decreases truncation error but increases roundoff amplification.

Automatic differentiation avoids this tradeoff because it does not divide differences of nearby values.

Computational Complexity

A derivative method must also scale efficiently.

Suppose

f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian has size

m \times n.

A naive method that explicitly computes every partial derivative may become prohibitively expensive.

The cost depends strongly on the derivative representation required.

Scalar Input, Scalar Output

For

f : \mathbb{R} \to \mathbb{R},

all methods are relatively cheap. The derivative is a single number.

Many Inputs, Scalar Output

For

f : \mathbb{R}^n \to \mathbb{R},

the gradient contains $n$ components.

Finite differences require roughly:

n+1

function evaluations for forward differences.

Reverse mode automatic differentiation computes the entire gradient with cost typically bounded by a small multiple of one evaluation of $f$ .

This asymmetry is one of the most important facts in modern numerical computing.

A neural network with one billion parameters still produces one scalar loss. Reverse mode exploits this structure.

Forward Mode Complexity

Forward mode propagates tangent vectors forward through the computation.

If the input dimension is $n$ , then computing the full Jacobian by forward mode requires $n$ tangent propagations.

The cost is roughly:

O(n \cdot \text{cost}(f)).

This is efficient when $n$ is small.

Examples:

Problem	Forward mode suitability
Few parameters, many outputs	Good
Directional derivatives	Excellent
Jacobian-vector products	Excellent
Large neural networks	Poor

Forward mode naturally computes

J_f(x)u,

where $u$ is a chosen direction.

Reverse Mode Complexity

Reverse mode propagates adjoints backward.

For scalar-output functions,

f : \mathbb{R}^n \to \mathbb{R},

reverse mode computes

\nabla f(x)

in roughly one reverse sweep.

The complexity is approximately

O(\text{cost}(f)).

This is often described as the cheap gradient principle.

The principle states:

For scalar-output functions, the gradient can often be computed with only a small constant-factor overhead relative to evaluating the function itself.

This principle explains the practical success of backpropagation.

Jacobian Products

Constructing a full Jacobian is often unnecessary.

Instead, many algorithms require:

Operation	Meaning
$Jv$	Jacobian-vector product
$J^T v$	Transpose-Jacobian-vector product
$Hv$	Hessian-vector product

These products can often be computed without materializing the full matrix.

This matters because the full Jacobian may be enormous.

For example:

f : \mathbb{R}^{10^9} \to \mathbb{R}

has a gradient with one billion entries.

A Hessian would contain

10^{18}

entries, which is impossible to store explicitly.

Modern AD systems therefore emphasize structured products rather than dense matrices.

Memory Complexity

Reverse mode gains computational efficiency by storing intermediate values from the forward pass.

These values are needed during the backward pass because derivative rules depend on the original computation.

For example, if

y = \sin x,

then the backward rule needs

\cos x.

If $x$ is not stored, it must be recomputed.

Large programs therefore face a memory tradeoff:

Strategy	Cost
Store intermediates	High memory usage
Recompute intermediates	Higher computation time

This is called the checkpointing tradeoff.

In deep learning systems, activation memory often dominates GPU usage during training because reverse mode must retain forward activations for the backward pass.

Stability of Derivative Computation

A derivative method can be mathematically correct yet numerically unstable.

Stability concerns how errors propagate through the computation.

Consider:

f(x)=\exp(100x).

Then

f'(x)=100\exp(100x).

Near moderate positive $x$ , both the function and derivative become extremely large.

Now consider

f(x)=\log(\exp x).

Mathematically,

f(x)=x.

But naive floating point evaluation may overflow inside $\exp x$ before the logarithm cancels it.

Automatic differentiation differentiates the implemented computation, not the ideal algebraic simplification.

Therefore, stable derivative computation depends on stable primal computation.

Stable Primitive Operations

Many numerical libraries use stable formulations.

For example, the softmax function:

\mathrm{softmax}(x_i) = \frac{\exp(x_i)} {\sum_j \exp(x_j)}

is usually implemented as

\frac{\exp(x_i - m)} {\sum_j \exp(x_j - m)},

where

m = \max_j x_j.

Subtracting the maximum prevents overflow.

The derivative computation inherits this stability because AD differentiates the stabilized implementation.

Similarly, stable implementations exist for:

Operation	Stable reformulation
Log-sum-exp	Shift by maximum
Sigmoid	Piecewise exponential forms
Norms	Scaling to avoid overflow
Quadratic formulas	Cancellation-aware forms
Matrix factorization	Orthogonal transforms

AD systems rely heavily on stable primitives.

Conditioning vs Stability

Two related ideas must be separated.

Conditioning describes the mathematical sensitivity of a problem.

Stability describes the numerical behavior of an algorithm.

A derivative may be inherently sensitive.

For example:

f(x)=\frac{1}{x}.

Its derivative is

f'(x)=-\frac{1}{x^2}.

Near $x=0$ , tiny perturbations in input produce huge output changes. The problem itself is ill-conditioned.

No differentiation method can remove this sensitivity.

A stable AD implementation still produces large derivatives because the underlying mathematics demands them.

Higher-Order Derivatives and Instability

Higher-order derivatives are often more numerically fragile.

Repeated differentiation amplifies noise, cancellation, and scaling problems.

For example:

f(x)=\sin(1000x).

The derivatives scale rapidly:

f'(x)=1000\cos(1000x),

f''(x)=-10^6\sin(1000x).

Higher derivatives magnify frequency and scaling.

This is one reason higher-order AD systems require careful engineering.

Complexity of Real Systems

Real AD systems are constrained by more than asymptotic complexity.

They must also manage:

Concern	Effect
Memory bandwidth	Limits throughput
GPU synchronization	Increases latency
Dynamic graphs	Adds runtime overhead
Sparse structure	Changes storage cost
Parallelism	Alters scheduling
Kernel fusion	Reduces intermediate allocations
Mixed precision	Changes numerical behavior

A theoretically optimal derivative method may still perform poorly on modern hardware if memory access patterns are inefficient.

As a result, practical AD systems combine mathematical techniques with compiler optimization, runtime scheduling, graph rewriting, and hardware-aware execution.

Why AD Became Dominant

Automatic differentiation became dominant because it occupies a useful operating point.

Property	Numerical Diff	Symbolic Diff	Automatic Diff
Accuracy	Approximate	Exact algebraic	Exact computational
Scalability	Poor for large $n$	Poor for large programs	Good
Handles control flow	Yes	Difficult	Yes
Handles large programs	Yes	Difficult	Yes
Expression swell	None	Severe	None
Floating point aware	Indirectly	Often poorly	Naturally
Gradient cost	High	Variable	Efficient

AD does not eliminate numerical problems. It does not make unstable computations stable or ill-conditioned problems well-conditioned.

What it does provide is a systematic way to compute derivatives with high accuracy and practical computational cost for large numerical programs.

That balance is the foundation of differentiable computing.