Skip to content

Chapter 19. Theory and Foundations

Automatic differentiation is often described by a simple rule:

Complexity Bounds

Automatic differentiation is often described by a simple rule:

The derivative of a program can usually be computed with cost comparable to the cost of running the program itself.

This statement is useful, but incomplete. The cost depends on the shape of the function, the derivative object requested, the accumulation mode, memory behavior, and the representation of intermediate values. A scalar derivative, a gradient, a full Jacobian, and a Hessian have different lower bounds. No AD system can avoid the basic information-theoretic cost of producing a large derivative object.

This section gives the main complexity bounds that guide AD system design.

Cost Model

Let a program compute a function

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

using C(f)C(f) elementary operations. We assume each elementary operation has constant cost, including addition, multiplication, division, and elementary functions such as sin\sin, exp\exp, and log\log. This abstraction ignores cache effects, vectorization, memory allocation, and parallel execution, but it gives the first-order structure.

The Jacobian of ff is

Jf(x)Rm×n J_f(x) \in \mathbb{R}^{m \times n}

with entries

(Jf)ij=fixj. (J_f)_{ij} = \frac{\partial f_i}{\partial x_j}.

The full Jacobian has mnmn entries, so writing it down already costs Ω(mn)\Omega(mn) memory traffic. This lower bound matters. AD can make derivative computation efficient, but it cannot make a dense m×nm \times n matrix free.

Forward Mode Bound

Forward mode propagates tangent values with primal values. Given a seed vector

vRn, v \in \mathbb{R}^n,

forward mode computes the Jacobian-vector product

Jf(x)v. J_f(x)v.

The important bound is:

cost(Jfv)=O(C(f)). \operatorname{cost}(J_f v) = O(C(f)).

More precisely, a forward-mode evaluation usually costs a small constant multiple of the primal evaluation. Each primal operation is replaced by a rule that updates both the primal value and its tangent.

For example, if

z=xy, z = xy,

then forward mode propagates

z˙=x˙y+xy˙. \dot z = \dot x y + x \dot y.

The cost per operation remains constant.

To compute a full Jacobian by forward mode, seed the basis vectors e1,,ene_1,\ldots,e_n. This gives

Jfe1,,Jfen, J_f e_1,\ldots,J_f e_n,

which are the columns of the Jacobian. Therefore,

cost(Jf)=O(nC(f)) \operatorname{cost}(J_f) = O(n C(f))

by naive forward accumulation.

Forward mode is therefore natural when nn is small and mm may be large. It is efficient for functions with few inputs and many outputs.

Reverse Mode Bound

Reverse mode propagates adjoints backward through the computation. Given a covector

wRm, w \in \mathbb{R}^m,

reverse mode computes the vector-Jacobian product

wTJf(x). w^T J_f(x).

The key bound is:

cost(wTJf)=O(C(f)). \operatorname{cost}(w^T J_f) = O(C(f)).

Reverse mode first runs the primal program, stores enough intermediate values, then performs a backward sweep. Each elementary operation contributes local adjoint updates.

For a scalar-output function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

choosing w=1w = 1 gives

f(x)T=Jf(x). \nabla f(x)^T = J_f(x).

Thus reverse mode computes the full gradient of a scalar function in time comparable to one primal evaluation:

cost(f)=O(C(f)). \operatorname{cost}(\nabla f) = O(C(f)).

This is the central complexity result behind backpropagation.

To compute a full Jacobian by reverse mode, run reverse mode once for each output basis covector. This gives the rows of the Jacobian:

e1TJf,,emTJf. e_1^T J_f,\ldots,e_m^T J_f.

So naive reverse-mode Jacobian computation costs

cost(Jf)=O(mC(f)). \operatorname{cost}(J_f) = O(m C(f)).

Reverse mode is therefore natural when mm is small and nn may be large. This is exactly the structure of most machine learning losses:

f:RnR. f : \mathbb{R}^n \to \mathbb{R}.

Many parameters, one scalar loss.

The Jacobian Cost Table

Function shapeDesired derivativeBest basic modeCost
f:RnRf : \mathbb{R}^n \to \mathbb{R}Gradient f\nabla fReverseO(C(f))O(C(f))
f:RRmf : \mathbb{R} \to \mathbb{R}^mDerivative vectorForwardO(C(f))O(C(f))
f:RnRmf : \mathbb{R}^n \to \mathbb{R}^mJfvJ_f vForwardO(C(f))O(C(f))
f:RnRmf : \mathbb{R}^n \to \mathbb{R}^mwTJfw^T J_fReverseO(C(f))O(C(f))
f:RnRmf : \mathbb{R}^n \to \mathbb{R}^mFull dense JacobianForward or reverseO(min(n,m)C(f))O(\min(n,m)C(f)), plus output cost

The last line hides an important qualification. If the Jacobian is dense, the output itself has mnmn entries, so any method must pay at least

Ω(mn). \Omega(mn).

Thus a more honest bound is

cost(Jf)=O(min(n,m)C(f)+mn). \operatorname{cost}(J_f) = O(\min(n,m)C(f) + mn).

The mnmn term is unavoidable.

Memory Bounds

Forward mode has low memory overhead. It propagates tangents alongside primal values and usually needs only the live variables required by the original computation. Its memory usage is typically

O(M(f)), O(M(f)),

where M(f)M(f) is the memory required by the primal program, plus tangent storage.

Reverse mode has a different profile. It needs intermediate primal values during the backward sweep. If the program has LL elementary operations, a simple tape records enough data for each operation:

O(L) O(L)

additional memory.

Since LL is proportional to C(f)C(f) in the simple operation-count model, naive reverse mode has memory

O(M(f)+C(f)). O(M(f) + C(f)).

This bound explains why reverse mode can become memory-bound in deep learning and scientific simulation. The derivative is cheap in arithmetic, but expensive in storage.

Checkpointing trades recomputation for memory. Instead of storing every intermediate value, the system stores selected checkpoints and recomputes missing values during the backward pass. This can reduce memory at the cost of extra primal evaluations.

The tradeoff is not cosmetic. For long computations, such as time-stepping simulations or deep sequence models, checkpointing is often the difference between feasible and infeasible reverse-mode differentiation.

Hessian Bounds

For a scalar function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the Hessian is

Hf(x)Rn×n. H_f(x) \in \mathbb{R}^{n \times n}.

A dense Hessian has n2n^2 entries, so materializing it costs at least

Ω(n2). \Omega(n^2).

However, many algorithms do not need the full Hessian. They need Hessian-vector products:

Hf(x)v. H_f(x)v.

A Hessian-vector product can be computed efficiently by combining forward and reverse mode. One common strategy is forward-over-reverse:

  1. Use reverse mode to define the gradient computation.
  2. Apply forward mode to that gradient computation in direction vv.

This gives

Hf(x)v H_f(x)v

with cost comparable to a small constant multiple of the gradient cost:

cost(Hfv)=O(C(f)). \operatorname{cost}(H_f v) = O(C(f)).

The constant is larger than for a single gradient, but the asymptotic cost remains linear in the primal computation.

A full dense Hessian can then be computed by applying Hessian-vector products to basis vectors:

Hfe1,,Hfen. H_f e_1,\ldots,H_f e_n.

This gives

cost(Hf)=O(nC(f)+n2). \operatorname{cost}(H_f) = O(n C(f) + n^2).

Again, the n2n^2 output term cannot be avoided for dense Hessians.

Sparse and Structured Bounds

Many real Jacobians and Hessians are sparse. If the derivative has only ss nonzero entries, the ideal output cost is

Ω(s) \Omega(s)

rather than Ω(mn)\Omega(mn).

AD can exploit sparsity in two ways.

First, it can propagate sparse tangent or adjoint objects. This is useful when local derivative structure is sparse and remains sparse through the computation.

Second, it can use graph coloring. If several Jacobian columns have disjoint sparsity patterns, they can be recovered from one seeded forward pass. Instead of requiring nn passes, the system may require only kk passes, where kk is the number of colors.

Then the cost becomes

O(kC(f)+s), O(k C(f) + s),

where knk \le n, often much smaller than nn.

For sparse Hessians, analogous coloring techniques apply. The practical performance depends on discovering or supplying the sparsity pattern. Without sparsity information, an AD system may fall back to dense behavior even when the mathematical derivative is sparse.

Lower Bounds

AD does not beat fundamental lower bounds. It organizes derivative computation so that the chain rule is applied with optimal sharing, but several costs remain unavoidable.

For a dense Jacobian,

Ω(mn) \Omega(mn)

entries must be produced.

For a dense Hessian,

Ω(n2) \Omega(n^2)

entries must be produced.

For a scalar-output gradient,

Ω(n) \Omega(n)

entries must be produced.

Reverse mode is remarkable because it computes nn gradient entries with only O(C(f))O(C(f)) arithmetic overhead, not because it avoids writing nn numbers.

A useful way to state the bound is:

AD can reduce derivative arithmetic, but not derivative size. \text{AD can reduce derivative arithmetic, but not derivative size.}

The best mode depends on the derivative query:

QueryObject sizeEfficient AD primitive
JfvJ_f vmmForward mode
wTJfw^T J_fnnReverse mode
f\nabla f, scalar ffnnReverse mode
Full JfJ_fmnmnRepeated forward or reverse
HfvH_f vnnMixed mode
Full HfH_fn2n^2Repeated mixed mode

Complexity of AD Transformations

So far we have counted runtime cost. An AD compiler also has transformation cost.

Given an input program of size SS, a source-transformation AD system produces a derivative program. For first-order forward mode, the transformed program usually has size

O(S). O(S).

For first-order reverse mode, the transformed program also has size approximately

O(S), O(S),

but it introduces a tape, adjoint variables, and reverse control flow.

Higher-order AD can cause code growth. Naively applying AD repeatedly may duplicate derivative logic and produce large intermediate programs. If the transformed code is aggressively inlined and specialized, the compiler may face expression swell similar to symbolic differentiation.

A well-designed AD compiler avoids this with staging, sharing, common subexpression elimination, and intermediate representations that preserve graph structure.

The theoretical distinction is important:

Symbolic differentiation may expand expressions algebraically.

Automatic differentiation differentiates the computation.

That difference is why AD usually gives linear-size derivative programs for first-order derivatives.

Practical Complexity

The asymptotic bounds are only the first layer. Real systems also pay for:

FactorEffect
Tape allocationCan dominate small operations
Cache localityReverse mode may access memory in poor order
Kernel launch overheadImportant on GPUs
MutationRequires versioning or alias analysis
Control flowMay produce dynamic tapes
CheckpointingTrades memory for recomputation
VectorizationChanges constant factors
SparsityCan reduce full derivative cost
Custom gradientsCan replace expensive differentiated programs

A reverse-mode gradient may be O(C(f))O(C(f)), yet still be slow if the tape is large, memory traffic is high, or the backward pass uses unfused kernels. The complexity theorem explains why the method scales. Systems engineering determines whether it is fast.

The Core Result

The central complexity bounds are:

cost(Jfv)=O(C(f)) \operatorname{cost}(J_f v) = O(C(f)) cost(wTJf)=O(C(f)) \operatorname{cost}(w^T J_f) = O(C(f)) cost(f)=O(C(f))for f:RnR \operatorname{cost}(\nabla f) = O(C(f)) \quad \text{for } f : \mathbb{R}^n \to \mathbb{R} cost(Jf)=O(min(n,m)C(f)+mn) \operatorname{cost}(J_f) = O(\min(n,m)C(f) + mn) cost(Hfv)=O(C(f)) \operatorname{cost}(H_f v) = O(C(f)) cost(Hf)=O(nC(f)+n2) \operatorname{cost}(H_f) = O(nC(f) + n^2)

These bounds explain the basic architecture of automatic differentiation systems. Forward mode gives cheap directional derivatives. Reverse mode gives cheap gradients for scalar outputs. Mixed mode gives efficient second-order products. Full dense derivative matrices remain expensive because their size is large.

The practical rule is simple:

Use AD to compute the derivative object you need, not the largest derivative object the function admits.