Automatic differentiation is fundamentally a computational technique. Its practical importance comes from the fact that derivatives can often be computed with asymptotic cost close to the original function evaluation.

This section studies the computational complexity of automatic differentiation:

operation count
memory usage
asymptotic scaling
dimensional dependence
graph structure effects

The central result is that derivative computation is usually much cheaper than symbolic differentiation and much more accurate than numerical differentiation.

Computational Model

Consider a function:

f:\mathbb{R}^n \to \mathbb{R}^m

implemented as a straight-line program containing:

arithmetic operations
transcendental operations
tensor primitives
control flow

Suppose the primal evaluation requires:

C(f)

elementary operations.

The key question is:

\text{What is the cost of computing derivatives of } f?

Automatic differentiation answers this by transforming the computational procedure itself.

Complexity of Numerical Differentiation

Finite differences approximate derivatives by repeated function evaluations.

Forward finite difference:



f'(x)\approx\frac{f(x+h)-f(x)}{h}

For:

f:\mathbb{R}^n \to \mathbb{R}

computing the full gradient requires:

one baseline evaluation
one perturbed evaluation per input dimension

Total cost:

(n+1)C(f)

Central differences require:

2nC(f)

evaluations.

Higher-order finite differences become even more expensive.

In addition:

truncation error appears
cancellation error appears
step-size tuning becomes difficult

Automatic differentiation avoids all three issues.

Complexity of Symbolic Differentiation

Symbolic differentiation manipulates expressions algebraically.

The derivative expression may grow exponentially larger than the original expression.

Example:

f(x)=((x+1)^2+1)^2+\cdots

Repeated symbolic expansion can cause:

expression swell
repeated subexpressions
large intermediate trees

A symbolic derivative may therefore cost much more to evaluate than the original function.

Automatic differentiation avoids symbolic expansion because it works directly on the executed computation graph.

Forward Mode Complexity

Forward mode propagates tangents alongside primal values.

Each primitive operation computes:

primal output
tangent output

Example:

z=xy

requires:

z=xy

and:

\dot z=y\dot x+x\dot y

Thus each primal operation induces a small constant amount of derivative work.

If the primal computation cost is:

C(f)

then one forward-mode pass usually costs:

$$ C_{\text{forward}}

\Theta(C(f)) $$

with a modest constant factor.

Typical practical overhead:

$2\times$
$3\times$
sometimes $5\times$

depending on primitive complexity.

Full Jacobian via Forward Mode

For:

f:\mathbb{R}^n\to\mathbb{R}^m

forward mode computes:

J_f(x)v

for one seed vector $v$ .

To recover the full Jacobian, use basis vectors:

e_1,e_2,\dots,e_n

Thus the full Jacobian requires $n$ forward passes.

Total complexity:

\Theta(nC(f))

This is efficient when:

$n$ is small
directional derivatives are sufficient
only selected Jacobian columns are needed

Reverse Mode Complexity

Reverse mode computes vector-Jacobian products.

For scalar-output functions:

f:\mathbb{R}^n\to\mathbb{R}

reverse mode computes the full gradient in one reverse pass.

The total cost is approximately:

$$ C_{\text{reverse}}

\Theta(C(f)) $$

again with a moderate constant factor.

This result is one of the most important facts in automatic differentiation.

A function with millions of parameters can often have its full gradient computed at only a few times the cost of the forward evaluation.

The Cheap Gradient Principle

The classical result is often called the cheap gradient principle.

For scalar-output functions:

f:\mathbb{R}^n\to\mathbb{R}

the gradient complexity satisfies:

C(\nabla f) \leq \omega C(f)

where:

$\omega$ is a small constant
typically less than 5 in practice

Importantly:

the cost does not scale linearly with $n$
reverse mode exploits shared intermediate computations

This principle explains the feasibility of modern deep learning.

Memory Complexity

Forward mode requires little additional memory.

Each variable stores:

primal value
tangent value

Memory complexity remains:

\Theta(\text{live variables})

Reverse mode is different.

The reverse pass requires:

primal intermediate values
dependency structure
operation ordering

Thus reverse mode often stores:

the computation tape
intermediate activations

Memory complexity becomes approximately:

\Theta(C(f))

for long computations.

This memory growth is one of the main engineering constraints in reverse-mode systems.

Tape Complexity

A reverse-mode tape stores information such as:

operation type
input references
saved primal values
local backward rule

Tape size depends on:

number of operations
tensor sizes
control flow
recomputation strategy

For deep neural networks:

activation storage dominates memory cost
gradients themselves are often smaller than saved activations

This is why activation checkpointing is important.

Checkpointing Tradeoff

Checkpointing reduces memory usage by recomputing parts of the forward pass during the backward pass.

Instead of storing every intermediate:

store selected checkpoints
recompute missing values later

This trades:

memory for:
additional computation

A simplified tradeoff:

Strategy	Memory	Compute
full tape	high	low
recomputation	low	high
checkpointing	moderate	moderate

Optimal checkpoint schedules are an active research topic.

Complexity by Input and Output Shape

Mode efficiency depends strongly on dimensions.

For:

f:\mathbb{R}^n\to\mathbb{R}^m

Goal	Complexity
one JVP	forward efficient
one VJP	reverse efficient
full Jacobian via forward	$\Theta(nC(f))$
full Jacobian via reverse	$\Theta(mC(f))$
scalar gradient	reverse efficient
many-output sensitivity	forward efficient

Thus:

forward mode scales with input dimension
reverse mode scales with output dimension

Sparse Jacobians

Many programs have sparse derivative structure.

Example:

f_i(x)

may depend on only a few input variables.

Naive Jacobian construction wastes work on zeros.

Sparse AD techniques exploit:

graph coloring
compressed seeding
structural sparsity
block decomposition

Complexity can then approach:

\Theta(\text{nonzero derivative count})

rather than dense matrix complexity.

This is critical in:

PDE solvers
optimization
scientific computing
large simulations

Higher-Order Complexity

Higher-order derivatives grow rapidly in cost.

Gradient:

first-order tensor

Hessian:

second-order tensor

Third derivative:

third-order tensor

Tensor size grows combinatorially.

Example:

H_f(x)\in\mathbb{R}^{n\times n}

contains:

n^2

entries.

Third derivatives contain:

n^3

entries.

Full higher-order tensors become impractical quickly.

Mixed-mode techniques therefore compute:

Hessian-vector products
directional higher derivatives
structured approximations

instead of full dense tensors.

Dynamic Graph Complexity

Programs with dynamic control flow create variable computational graphs.

Example:

loops with data-dependent iteration counts
recursion
dynamic branching

AD complexity then depends on the executed trace.

For runtime-generated graphs:

tape size varies
operation count varies
backward cost follows executed operations

This dynamic nature complicates compiler optimization.

Parallel Complexity

Derivative computations often inherit parallel structure from the primal computation.

Independent operations remain parallelizable.

However:

reverse accumulation introduces dependency ordering
gradient accumulation creates synchronization points
reduction operations may serialize

GPU systems therefore optimize:

fused kernels
batched reductions
asynchronous scheduling
gradient aggregation

Efficient reverse-mode scheduling is a systems problem as much as a mathematical one.

Asymptotic Comparison

Method	Accuracy	Complexity
finite differences	approximate	$O(nC(f))$
symbolic differentiation	exact symbolic	potentially exponential
forward AD	exact up to FP	$O(nC(f))$ for full gradient
reverse AD	exact up to FP	$O(C(f))$ for scalar gradient

This comparison explains why automatic differentiation became dominant in machine learning and scientific optimization.

Computational Graph Reuse

A major efficiency source in AD is graph reuse.

Suppose:

f(x)=g(x)+g(x)

Symbolic differentiation may duplicate work.

AD computes:

$g(x)$ once
propagates derivatives through shared intermediates

Thus AD complexity tracks executed computation rather than expanded algebraic form.

This distinction is fundamental.

Complexity of Real Systems

Actual runtime depends on:

tensor sizes
memory bandwidth
cache locality
GPU utilization
kernel launch overhead
operator fusion
graph compilation
sparse structure
batching

Modern AD systems are therefore compiler-runtime hybrids, not merely calculus engines.

The asymptotic theory explains scalability, but practical performance depends heavily on systems design.

Summary

Automatic differentiation achieves efficient derivative computation because it propagates local derivatives through executed computation graphs.

Key complexity facts:

forward mode computes JVPs efficiently
reverse mode computes gradients efficiently
reverse mode enables cheap gradients for scalar-output functions
memory cost is the main reverse-mode limitation
sparse and structured methods improve scalability
higher-order derivatives require mixed strategies

The central complexity result is the cheap gradient principle:

$$ C(\nabla f)

\Theta(C(f)) $$

for scalar-output functions under reverse accumulation.

Computational Complexity

Computational Model

Complexity of Numerical Differentiation

Complexity of Symbolic Differentiation

Forward Mode Complexity

$$ C_{\text{forward}}

Full Jacobian via Forward Mode

Reverse Mode Complexity

$$ C_{\text{reverse}}

The Cheap Gradient Principle

Memory Complexity

Tape Complexity

Checkpointing Tradeoff

Complexity by Input and Output Shape

Sparse Jacobians

Higher-Order Complexity

Dynamic Graph Complexity

Parallel Complexity

Asymptotic Comparison

Computational Graph Reuse

Complexity of Real Systems

Summary

$$ C(\nabla f)