Chapter 19. Theory and Foundations

Complexity Bounds

Automatic differentiation is often described by a simple rule:

The derivative of a program can usually be computed with cost comparable to the cost of running the program itself.

This statement is useful, but incomplete. The cost depends on the shape of the function, the derivative object requested, the accumulation mode, memory behavior, and the representation of intermediate values. A scalar derivative, a gradient, a full Jacobian, and a Hessian have different lower bounds. No AD system can avoid the basic information-theoretic cost of producing a large derivative object.

This section gives the main complexity bounds that guide AD system design.

Cost Model

Let a program compute a function

f : \mathbb{R}^n \to \mathbb{R}^m

using $C(f)$ elementary operations. We assume each elementary operation has constant cost, including addition, multiplication, division, and elementary functions such as $\sin$ , $\exp$ , and $\log$ . This abstraction ignores cache effects, vectorization, memory allocation, and parallel execution, but it gives the first-order structure.

The Jacobian of $f$ is

J_f(x) \in \mathbb{R}^{m \times n}

with entries

(J_f)_{ij} = \frac{\partial f_i}{\partial x_j}.

The full Jacobian has $mn$ entries, so writing it down already costs $\Omega(mn)$ memory traffic. This lower bound matters. AD can make derivative computation efficient, but it cannot make a dense $m \times n$ matrix free.

Forward Mode Bound

Forward mode propagates tangent values with primal values. Given a seed vector

v \in \mathbb{R}^n,

forward mode computes the Jacobian-vector product

J_f(x)v.

The important bound is:

\operatorname{cost}(J_f v) = O(C(f)).

More precisely, a forward-mode evaluation usually costs a small constant multiple of the primal evaluation. Each primal operation is replaced by a rule that updates both the primal value and its tangent.

For example, if

z = xy,

then forward mode propagates

\dot z = \dot x y + x \dot y.

The cost per operation remains constant.

To compute a full Jacobian by forward mode, seed the basis vectors $e_1,\ldots,e_n$ . This gives

J_f e_1,\ldots,J_f e_n,

which are the columns of the Jacobian. Therefore,

\operatorname{cost}(J_f) = O(n C(f))

by naive forward accumulation.

Forward mode is therefore natural when $n$ is small and $m$ may be large. It is efficient for functions with few inputs and many outputs.

Reverse Mode Bound

Reverse mode propagates adjoints backward through the computation. Given a covector

w \in \mathbb{R}^m,

reverse mode computes the vector-Jacobian product

w^T J_f(x).

The key bound is:

\operatorname{cost}(w^T J_f) = O(C(f)).

Reverse mode first runs the primal program, stores enough intermediate values, then performs a backward sweep. Each elementary operation contributes local adjoint updates.

For a scalar-output function

f : \mathbb{R}^n \to \mathbb{R},

choosing $w = 1$ gives

\nabla f(x)^T = J_f(x).

Thus reverse mode computes the full gradient of a scalar function in time comparable to one primal evaluation:

\operatorname{cost}(\nabla f) = O(C(f)).

This is the central complexity result behind backpropagation.

To compute a full Jacobian by reverse mode, run reverse mode once for each output basis covector. This gives the rows of the Jacobian:

e_1^T J_f,\ldots,e_m^T J_f.

So naive reverse-mode Jacobian computation costs

\operatorname{cost}(J_f) = O(m C(f)).

Reverse mode is therefore natural when $m$ is small and $n$ may be large. This is exactly the structure of most machine learning losses:

f : \mathbb{R}^n \to \mathbb{R}.

Many parameters, one scalar loss.

The Jacobian Cost Table

Function shape	Desired derivative	Best basic mode	Cost
$f : \mathbb{R}^n \to \mathbb{R}$	Gradient $\nabla f$	Reverse	$O(C(f))$
$f : \mathbb{R} \to \mathbb{R}^m$	Derivative vector	Forward	$O(C(f))$
$f : \mathbb{R}^n \to \mathbb{R}^m$	$J_f v$	Forward	$O(C(f))$
$f : \mathbb{R}^n \to \mathbb{R}^m$	$w^T J_f$	Reverse	$O(C(f))$
$f : \mathbb{R}^n \to \mathbb{R}^m$	Full dense Jacobian	Forward or reverse	$O(\min(n,m)C(f))$ , plus output cost

The last line hides an important qualification. If the Jacobian is dense, the output itself has $mn$ entries, so any method must pay at least

\Omega(mn).

Thus a more honest bound is

\operatorname{cost}(J_f) = O(\min(n,m)C(f) + mn).

The $mn$ term is unavoidable.

Memory Bounds

Forward mode has low memory overhead. It propagates tangents alongside primal values and usually needs only the live variables required by the original computation. Its memory usage is typically

O(M(f)),

where $M(f)$ is the memory required by the primal program, plus tangent storage.

Reverse mode has a different profile. It needs intermediate primal values during the backward sweep. If the program has $L$ elementary operations, a simple tape records enough data for each operation:

O(L)

additional memory.

Since $L$ is proportional to $C(f)$ in the simple operation-count model, naive reverse mode has memory

O(M(f) + C(f)).

This bound explains why reverse mode can become memory-bound in deep learning and scientific simulation. The derivative is cheap in arithmetic, but expensive in storage.

Checkpointing trades recomputation for memory. Instead of storing every intermediate value, the system stores selected checkpoints and recomputes missing values during the backward pass. This can reduce memory at the cost of extra primal evaluations.

The tradeoff is not cosmetic. For long computations, such as time-stepping simulations or deep sequence models, checkpointing is often the difference between feasible and infeasible reverse-mode differentiation.

Hessian Bounds

For a scalar function

f : \mathbb{R}^n \to \mathbb{R},

the Hessian is

H_f(x) \in \mathbb{R}^{n \times n}.

A dense Hessian has $n^2$ entries, so materializing it costs at least

\Omega(n^2).

However, many algorithms do not need the full Hessian. They need Hessian-vector products:

H_f(x)v.

A Hessian-vector product can be computed efficiently by combining forward and reverse mode. One common strategy is forward-over-reverse:

Use reverse mode to define the gradient computation.
Apply forward mode to that gradient computation in direction $v$ .

This gives

H_f(x)v

with cost comparable to a small constant multiple of the gradient cost:

\operatorname{cost}(H_f v) = O(C(f)).

The constant is larger than for a single gradient, but the asymptotic cost remains linear in the primal computation.

A full dense Hessian can then be computed by applying Hessian-vector products to basis vectors:

H_f e_1,\ldots,H_f e_n.

This gives

\operatorname{cost}(H_f) = O(n C(f) + n^2).

Again, the $n^2$ output term cannot be avoided for dense Hessians.

Sparse and Structured Bounds

Many real Jacobians and Hessians are sparse. If the derivative has only $s$ nonzero entries, the ideal output cost is

\Omega(s)

rather than $\Omega(mn)$ .

AD can exploit sparsity in two ways.

First, it can propagate sparse tangent or adjoint objects. This is useful when local derivative structure is sparse and remains sparse through the computation.

Second, it can use graph coloring. If several Jacobian columns have disjoint sparsity patterns, they can be recovered from one seeded forward pass. Instead of requiring $n$ passes, the system may require only $k$ passes, where $k$ is the number of colors.

Then the cost becomes

O(k C(f) + s),

where $k \le n$ , often much smaller than $n$ .

For sparse Hessians, analogous coloring techniques apply. The practical performance depends on discovering or supplying the sparsity pattern. Without sparsity information, an AD system may fall back to dense behavior even when the mathematical derivative is sparse.

Lower Bounds

AD does not beat fundamental lower bounds. It organizes derivative computation so that the chain rule is applied with optimal sharing, but several costs remain unavoidable.

For a dense Jacobian,

\Omega(mn)

entries must be produced.

For a dense Hessian,

\Omega(n^2)

entries must be produced.

For a scalar-output gradient,

\Omega(n)

entries must be produced.

Reverse mode is remarkable because it computes $n$ gradient entries with only $O(C(f))$ arithmetic overhead, not because it avoids writing $n$ numbers.

A useful way to state the bound is:

\text{AD can reduce derivative arithmetic, but not derivative size.}

The best mode depends on the derivative query:

Query	Object size	Efficient AD primitive
$J_f v$	$m$	Forward mode
$w^T J_f$	$n$	Reverse mode
$\nabla f$ , scalar $f$	$n$	Reverse mode
Full $J_f$	$mn$	Repeated forward or reverse
$H_f v$	$n$	Mixed mode
Full $H_f$	$n^2$	Repeated mixed mode

Complexity of AD Transformations

So far we have counted runtime cost. An AD compiler also has transformation cost.

Given an input program of size $S$ , a source-transformation AD system produces a derivative program. For first-order forward mode, the transformed program usually has size

O(S).

For first-order reverse mode, the transformed program also has size approximately

O(S),

but it introduces a tape, adjoint variables, and reverse control flow.

Higher-order AD can cause code growth. Naively applying AD repeatedly may duplicate derivative logic and produce large intermediate programs. If the transformed code is aggressively inlined and specialized, the compiler may face expression swell similar to symbolic differentiation.

A well-designed AD compiler avoids this with staging, sharing, common subexpression elimination, and intermediate representations that preserve graph structure.

The theoretical distinction is important:

Symbolic differentiation may expand expressions algebraically.

Automatic differentiation differentiates the computation.

That difference is why AD usually gives linear-size derivative programs for first-order derivatives.

Practical Complexity

The asymptotic bounds are only the first layer. Real systems also pay for:

Factor	Effect
Tape allocation	Can dominate small operations
Cache locality	Reverse mode may access memory in poor order
Kernel launch overhead	Important on GPUs
Mutation	Requires versioning or alias analysis
Control flow	May produce dynamic tapes
Checkpointing	Trades memory for recomputation
Vectorization	Changes constant factors
Sparsity	Can reduce full derivative cost
Custom gradients	Can replace expensive differentiated programs

A reverse-mode gradient may be $O(C(f))$ , yet still be slow if the tape is large, memory traffic is high, or the backward pass uses unfused kernels. The complexity theorem explains why the method scales. Systems engineering determines whether it is fast.

The Core Result

The central complexity bounds are:

\operatorname{cost}(J_f v) = O(C(f))

\operatorname{cost}(w^T J_f) = O(C(f))

\operatorname{cost}(\nabla f) = O(C(f)) \quad \text{for } f : \mathbb{R}^n \to \mathbb{R}

\operatorname{cost}(J_f) = O(\min(n,m)C(f) + mn)

\operatorname{cost}(H_f v) = O(C(f))

\operatorname{cost}(H_f) = O(nC(f) + n^2)

These bounds explain the basic architecture of automatic differentiation systems. Forward mode gives cheap directional derivatives. Reverse mode gives cheap gradients for scalar outputs. Mixed mode gives efficient second-order products. Full dense derivative matrices remain expensive because their size is large.

The practical rule is simple:

Use AD to compute the derivative object you need, not the largest derivative object the function admits.