Complexity Bounds
Automatic differentiation is often described by a simple rule:
The derivative of a program can usually be computed with cost comparable to the cost of running the program itself.
This statement is useful, but incomplete. The cost depends on the shape of the function, the derivative object requested, the accumulation mode, memory behavior, and the representation of intermediate values. A scalar derivative, a gradient, a full Jacobian, and a Hessian have different lower bounds. No AD system can avoid the basic information-theoretic cost of producing a large derivative object.
This section gives the main complexity bounds that guide AD system design.
Cost Model
Let a program compute a function
using elementary operations. We assume each elementary operation has constant cost, including addition, multiplication, division, and elementary functions such as , , and . This abstraction ignores cache effects, vectorization, memory allocation, and parallel execution, but it gives the first-order structure.
The Jacobian of is
with entries
The full Jacobian has entries, so writing it down already costs memory traffic. This lower bound matters. AD can make derivative computation efficient, but it cannot make a dense matrix free.
Forward Mode Bound
Forward mode propagates tangent values with primal values. Given a seed vector
forward mode computes the Jacobian-vector product
The important bound is:
More precisely, a forward-mode evaluation usually costs a small constant multiple of the primal evaluation. Each primal operation is replaced by a rule that updates both the primal value and its tangent.
For example, if
then forward mode propagates
The cost per operation remains constant.
To compute a full Jacobian by forward mode, seed the basis vectors . This gives
which are the columns of the Jacobian. Therefore,
by naive forward accumulation.
Forward mode is therefore natural when is small and may be large. It is efficient for functions with few inputs and many outputs.
Reverse Mode Bound
Reverse mode propagates adjoints backward through the computation. Given a covector
reverse mode computes the vector-Jacobian product
The key bound is:
Reverse mode first runs the primal program, stores enough intermediate values, then performs a backward sweep. Each elementary operation contributes local adjoint updates.
For a scalar-output function
choosing gives
Thus reverse mode computes the full gradient of a scalar function in time comparable to one primal evaluation:
This is the central complexity result behind backpropagation.
To compute a full Jacobian by reverse mode, run reverse mode once for each output basis covector. This gives the rows of the Jacobian:
So naive reverse-mode Jacobian computation costs
Reverse mode is therefore natural when is small and may be large. This is exactly the structure of most machine learning losses:
Many parameters, one scalar loss.
The Jacobian Cost Table
| Function shape | Desired derivative | Best basic mode | Cost |
|---|---|---|---|
| Gradient | Reverse | ||
| Derivative vector | Forward | ||
| Forward | |||
| Reverse | |||
| Full dense Jacobian | Forward or reverse | , plus output cost |
The last line hides an important qualification. If the Jacobian is dense, the output itself has entries, so any method must pay at least
Thus a more honest bound is
The term is unavoidable.
Memory Bounds
Forward mode has low memory overhead. It propagates tangents alongside primal values and usually needs only the live variables required by the original computation. Its memory usage is typically
where is the memory required by the primal program, plus tangent storage.
Reverse mode has a different profile. It needs intermediate primal values during the backward sweep. If the program has elementary operations, a simple tape records enough data for each operation:
additional memory.
Since is proportional to in the simple operation-count model, naive reverse mode has memory
This bound explains why reverse mode can become memory-bound in deep learning and scientific simulation. The derivative is cheap in arithmetic, but expensive in storage.
Checkpointing trades recomputation for memory. Instead of storing every intermediate value, the system stores selected checkpoints and recomputes missing values during the backward pass. This can reduce memory at the cost of extra primal evaluations.
The tradeoff is not cosmetic. For long computations, such as time-stepping simulations or deep sequence models, checkpointing is often the difference between feasible and infeasible reverse-mode differentiation.
Hessian Bounds
For a scalar function
the Hessian is
A dense Hessian has entries, so materializing it costs at least
However, many algorithms do not need the full Hessian. They need Hessian-vector products:
A Hessian-vector product can be computed efficiently by combining forward and reverse mode. One common strategy is forward-over-reverse:
- Use reverse mode to define the gradient computation.
- Apply forward mode to that gradient computation in direction .
This gives
with cost comparable to a small constant multiple of the gradient cost:
The constant is larger than for a single gradient, but the asymptotic cost remains linear in the primal computation.
A full dense Hessian can then be computed by applying Hessian-vector products to basis vectors:
This gives
Again, the output term cannot be avoided for dense Hessians.
Sparse and Structured Bounds
Many real Jacobians and Hessians are sparse. If the derivative has only nonzero entries, the ideal output cost is
rather than .
AD can exploit sparsity in two ways.
First, it can propagate sparse tangent or adjoint objects. This is useful when local derivative structure is sparse and remains sparse through the computation.
Second, it can use graph coloring. If several Jacobian columns have disjoint sparsity patterns, they can be recovered from one seeded forward pass. Instead of requiring passes, the system may require only passes, where is the number of colors.
Then the cost becomes
where , often much smaller than .
For sparse Hessians, analogous coloring techniques apply. The practical performance depends on discovering or supplying the sparsity pattern. Without sparsity information, an AD system may fall back to dense behavior even when the mathematical derivative is sparse.
Lower Bounds
AD does not beat fundamental lower bounds. It organizes derivative computation so that the chain rule is applied with optimal sharing, but several costs remain unavoidable.
For a dense Jacobian,
entries must be produced.
For a dense Hessian,
entries must be produced.
For a scalar-output gradient,
entries must be produced.
Reverse mode is remarkable because it computes gradient entries with only arithmetic overhead, not because it avoids writing numbers.
A useful way to state the bound is:
The best mode depends on the derivative query:
| Query | Object size | Efficient AD primitive |
|---|---|---|
| Forward mode | ||
| Reverse mode | ||
| , scalar | Reverse mode | |
| Full | Repeated forward or reverse | |
| Mixed mode | ||
| Full | Repeated mixed mode |
Complexity of AD Transformations
So far we have counted runtime cost. An AD compiler also has transformation cost.
Given an input program of size , a source-transformation AD system produces a derivative program. For first-order forward mode, the transformed program usually has size
For first-order reverse mode, the transformed program also has size approximately
but it introduces a tape, adjoint variables, and reverse control flow.
Higher-order AD can cause code growth. Naively applying AD repeatedly may duplicate derivative logic and produce large intermediate programs. If the transformed code is aggressively inlined and specialized, the compiler may face expression swell similar to symbolic differentiation.
A well-designed AD compiler avoids this with staging, sharing, common subexpression elimination, and intermediate representations that preserve graph structure.
The theoretical distinction is important:
Symbolic differentiation may expand expressions algebraically.
Automatic differentiation differentiates the computation.
That difference is why AD usually gives linear-size derivative programs for first-order derivatives.
Practical Complexity
The asymptotic bounds are only the first layer. Real systems also pay for:
| Factor | Effect |
|---|---|
| Tape allocation | Can dominate small operations |
| Cache locality | Reverse mode may access memory in poor order |
| Kernel launch overhead | Important on GPUs |
| Mutation | Requires versioning or alias analysis |
| Control flow | May produce dynamic tapes |
| Checkpointing | Trades memory for recomputation |
| Vectorization | Changes constant factors |
| Sparsity | Can reduce full derivative cost |
| Custom gradients | Can replace expensive differentiated programs |
A reverse-mode gradient may be , yet still be slow if the tape is large, memory traffic is high, or the backward pass uses unfused kernels. The complexity theorem explains why the method scales. Systems engineering determines whether it is fast.
The Core Result
The central complexity bounds are:
These bounds explain the basic architecture of automatic differentiation systems. Forward mode gives cheap directional derivatives. Reverse mode gives cheap gradients for scalar outputs. Mixed mode gives efficient second-order products. Full dense derivative matrices remain expensive because their size is large.
The practical rule is simple:
Use AD to compute the derivative object you need, not the largest derivative object the function admits.