Skip to content

Accuracy, Complexity, and Stability

Derivative computation is not only a mathematical problem. It is also a numerical and systems problem. A derivative method must answer three questions simultaneously:

Derivative computation is not only a mathematical problem. It is also a numerical and systems problem. A derivative method must answer three questions simultaneously:

  1. Is the derivative accurate?
  2. How expensive is it?
  3. Is the computation numerically stable?

These questions determine whether a differentiation method is practical for scientific computing, optimization, and machine learning.

Automatic differentiation became important because it achieves a useful balance across all three dimensions.

Accuracy of Derivative Computation

Different differentiation methods produce different kinds of error.

MethodMain source of error
Numerical differentiationTruncation and roundoff
Symbolic differentiationAlgebraic expansion and simplification issues
Automatic differentiationFloating point evaluation only

The distinction is fundamental.

Finite differences estimate derivatives indirectly:

f(x)f(x+h)f(x)h. f'(x) \approx \frac{f(x+h)-f(x)}{h}.

This introduces approximation error because the derivative is replaced by a secant slope.

Automatic differentiation does not estimate derivatives using nearby function evaluations. Instead, it applies exact local derivative rules to each primitive operation executed by the program.

For example, if the program computes

z=xy, z = xy,

AD uses the exact derivative rule

dz=ydx+xdy. dz = y\,dx + x\,dy.

No finite perturbation is introduced.

The resulting derivative is therefore exact with respect to the floating point computation being differentiated. The only remaining numerical errors are those already present in ordinary floating point arithmetic.

This property is often summarized informally as:

Automatic differentiation computes derivatives accurate to machine precision.

A more careful statement is:

Automatic differentiation computes derivatives of the executed floating point program without truncation error.

Truncation Error vs Floating Point Error

To understand why this matters, consider finite differences again.

Using Taylor expansion,

f(x+h)=f(x)+hf(x)+h22f(x)+O(h3). f(x+h) = f(x) + h f'(x) + \frac{h^2}{2}f''(x) + O(h^3).

Rearranging gives

f(x+h)f(x)h=f(x)+h2f(x)+O(h2). \frac{f(x+h)-f(x)}{h} = f'(x) + \frac{h}{2}f''(x) + O(h^2).

The extra terms are truncation error. Even in exact arithmetic, the estimate differs from the true derivative.

Now consider floating point subtraction. If

f(x+h)f(x), f(x+h) \approx f(x),

their difference may lose many significant digits.

The total error behaves approximately like

C1h+C2ϵh, C_1 h + C_2 \frac{\epsilon}{h},

where:

  • hh controls truncation error
  • ϵ\epsilon is machine precision
  • C1,C2C_1, C_2 depend on the function

Reducing hh decreases truncation error but increases roundoff amplification.

Automatic differentiation avoids this tradeoff because it does not divide differences of nearby values.

Computational Complexity

A derivative method must also scale efficiently.

Suppose

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian has size

m×n. m \times n.

A naive method that explicitly computes every partial derivative may become prohibitively expensive.

The cost depends strongly on the derivative representation required.

Scalar Input, Scalar Output

For

f:RR, f : \mathbb{R} \to \mathbb{R},

all methods are relatively cheap. The derivative is a single number.

Many Inputs, Scalar Output

For

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the gradient contains nn components.

Finite differences require roughly:

n+1 n+1

function evaluations for forward differences.

Reverse mode automatic differentiation computes the entire gradient with cost typically bounded by a small multiple of one evaluation of ff.

This asymmetry is one of the most important facts in modern numerical computing.

A neural network with one billion parameters still produces one scalar loss. Reverse mode exploits this structure.

Forward Mode Complexity

Forward mode propagates tangent vectors forward through the computation.

If the input dimension is nn, then computing the full Jacobian by forward mode requires nn tangent propagations.

The cost is roughly:

O(ncost(f)). O(n \cdot \text{cost}(f)).

This is efficient when nn is small.

Examples:

ProblemForward mode suitability
Few parameters, many outputsGood
Directional derivativesExcellent
Jacobian-vector productsExcellent
Large neural networksPoor

Forward mode naturally computes

Jf(x)u, J_f(x)u,

where uu is a chosen direction.

Reverse Mode Complexity

Reverse mode propagates adjoints backward.

For scalar-output functions,

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

reverse mode computes

f(x) \nabla f(x)

in roughly one reverse sweep.

The complexity is approximately

O(cost(f)). O(\text{cost}(f)).

This is often described as the cheap gradient principle.

The principle states:

For scalar-output functions, the gradient can often be computed with only a small constant-factor overhead relative to evaluating the function itself.

This principle explains the practical success of backpropagation.

Jacobian Products

Constructing a full Jacobian is often unnecessary.

Instead, many algorithms require:

OperationMeaning
JvJvJacobian-vector product
JTvJ^T vTranspose-Jacobian-vector product
HvHvHessian-vector product

These products can often be computed without materializing the full matrix.

This matters because the full Jacobian may be enormous.

For example:

f:R109R f : \mathbb{R}^{10^9} \to \mathbb{R}

has a gradient with one billion entries.

A Hessian would contain

1018 10^{18}

entries, which is impossible to store explicitly.

Modern AD systems therefore emphasize structured products rather than dense matrices.

Memory Complexity

Reverse mode gains computational efficiency by storing intermediate values from the forward pass.

These values are needed during the backward pass because derivative rules depend on the original computation.

For example, if

y=sinx, y = \sin x,

then the backward rule needs

cosx. \cos x.

If xx is not stored, it must be recomputed.

Large programs therefore face a memory tradeoff:

StrategyCost
Store intermediatesHigh memory usage
Recompute intermediatesHigher computation time

This is called the checkpointing tradeoff.

In deep learning systems, activation memory often dominates GPU usage during training because reverse mode must retain forward activations for the backward pass.

Stability of Derivative Computation

A derivative method can be mathematically correct yet numerically unstable.

Stability concerns how errors propagate through the computation.

Consider:

f(x)=exp(100x). f(x)=\exp(100x).

Then

f(x)=100exp(100x). f'(x)=100\exp(100x).

Near moderate positive xx, both the function and derivative become extremely large.

Now consider

f(x)=log(expx). f(x)=\log(\exp x).

Mathematically,

f(x)=x. f(x)=x.

But naive floating point evaluation may overflow inside expx\exp x before the logarithm cancels it.

Automatic differentiation differentiates the implemented computation, not the ideal algebraic simplification.

Therefore, stable derivative computation depends on stable primal computation.

Stable Primitive Operations

Many numerical libraries use stable formulations.

For example, the softmax function:

softmax(xi)=exp(xi)jexp(xj) \mathrm{softmax}(x_i) = \frac{\exp(x_i)} {\sum_j \exp(x_j)}

is usually implemented as

exp(xim)jexp(xjm), \frac{\exp(x_i - m)} {\sum_j \exp(x_j - m)},

where

m=maxjxj. m = \max_j x_j.

Subtracting the maximum prevents overflow.

The derivative computation inherits this stability because AD differentiates the stabilized implementation.

Similarly, stable implementations exist for:

OperationStable reformulation
Log-sum-expShift by maximum
SigmoidPiecewise exponential forms
NormsScaling to avoid overflow
Quadratic formulasCancellation-aware forms
Matrix factorizationOrthogonal transforms

AD systems rely heavily on stable primitives.

Conditioning vs Stability

Two related ideas must be separated.

Conditioning describes the mathematical sensitivity of a problem.

Stability describes the numerical behavior of an algorithm.

A derivative may be inherently sensitive.

For example:

f(x)=1x. f(x)=\frac{1}{x}.

Its derivative is

f(x)=1x2. f'(x)=-\frac{1}{x^2}.

Near x=0x=0, tiny perturbations in input produce huge output changes. The problem itself is ill-conditioned.

No differentiation method can remove this sensitivity.

A stable AD implementation still produces large derivatives because the underlying mathematics demands them.

Higher-Order Derivatives and Instability

Higher-order derivatives are often more numerically fragile.

Repeated differentiation amplifies noise, cancellation, and scaling problems.

For example:

f(x)=sin(1000x). f(x)=\sin(1000x).

The derivatives scale rapidly:

f(x)=1000cos(1000x), f'(x)=1000\cos(1000x), f(x)=106sin(1000x). f''(x)=-10^6\sin(1000x).

Higher derivatives magnify frequency and scaling.

This is one reason higher-order AD systems require careful engineering.

Complexity of Real Systems

Real AD systems are constrained by more than asymptotic complexity.

They must also manage:

ConcernEffect
Memory bandwidthLimits throughput
GPU synchronizationIncreases latency
Dynamic graphsAdds runtime overhead
Sparse structureChanges storage cost
ParallelismAlters scheduling
Kernel fusionReduces intermediate allocations
Mixed precisionChanges numerical behavior

A theoretically optimal derivative method may still perform poorly on modern hardware if memory access patterns are inefficient.

As a result, practical AD systems combine mathematical techniques with compiler optimization, runtime scheduling, graph rewriting, and hardware-aware execution.

Why AD Became Dominant

Automatic differentiation became dominant because it occupies a useful operating point.

PropertyNumerical DiffSymbolic DiffAutomatic Diff
AccuracyApproximateExact algebraicExact computational
ScalabilityPoor for large nnPoor for large programsGood
Handles control flowYesDifficultYes
Handles large programsYesDifficultYes
Expression swellNoneSevereNone
Floating point awareIndirectlyOften poorlyNaturally
Gradient costHighVariableEfficient

AD does not eliminate numerical problems. It does not make unstable computations stable or ill-conditioned problems well-conditioned.

What it does provide is a systematic way to compute derivatives with high accuracy and practical computational cost for large numerical programs.

That balance is the foundation of differentiable computing.