# Computational Complexity

Automatic differentiation is fundamentally a computational technique. Its practical importance comes from the fact that derivatives can often be computed with asymptotic cost close to the original function evaluation.

This section studies the computational complexity of automatic differentiation:
- operation count
- memory usage
- asymptotic scaling
- dimensional dependence
- graph structure effects

The central result is that derivative computation is usually much cheaper than symbolic differentiation and much more accurate than numerical differentiation.

## Computational Model

Consider a function:

$$
f:\mathbb{R}^n \to \mathbb{R}^m
$$

implemented as a straight-line program containing:
- arithmetic operations
- transcendental operations
- tensor primitives
- control flow

Suppose the primal evaluation requires:

$$
C(f)
$$

elementary operations.

The key question is:

$$
\text{What is the cost of computing derivatives of } f?
$$

Automatic differentiation answers this by transforming the computational procedure itself.

## Complexity of Numerical Differentiation

Finite differences approximate derivatives by repeated function evaluations.

Forward finite difference:

$$
f'(x)\approx\frac{f(x+h)-f(x)}{h}
$$

For:

$$
f:\mathbb{R}^n \to \mathbb{R}
$$

computing the full gradient requires:
- one baseline evaluation
- one perturbed evaluation per input dimension

Total cost:

$$
(n+1)C(f)
$$

Central differences require:

$$
2nC(f)
$$

evaluations.

Higher-order finite differences become even more expensive.

In addition:
- truncation error appears
- cancellation error appears
- step-size tuning becomes difficult

Automatic differentiation avoids all three issues.

## Complexity of Symbolic Differentiation

Symbolic differentiation manipulates expressions algebraically.

The derivative expression may grow exponentially larger than the original expression.

Example:

$$
f(x)=((x+1)^2+1)^2+\cdots
$$

Repeated symbolic expansion can cause:
- expression swell
- repeated subexpressions
- large intermediate trees

A symbolic derivative may therefore cost much more to evaluate than the original function.

Automatic differentiation avoids symbolic expansion because it works directly on the executed computation graph.

## Forward Mode Complexity

Forward mode propagates tangents alongside primal values.

Each primitive operation computes:
- primal output
- tangent output

Example:

$$
z=xy
$$

requires:

$$
z=xy
$$

and:

$$
\dot z=y\dot x+x\dot y
$$

Thus each primal operation induces a small constant amount of derivative work.

If the primal computation cost is:

$$
C(f)
$$

then one forward-mode pass usually costs:

$$
C_{\text{forward}}
=
\Theta(C(f))
$$

with a modest constant factor.

Typical practical overhead:
- $2\times$
- $3\times$
- sometimes $5\times$

depending on primitive complexity.

## Full Jacobian via Forward Mode

For:

$$
f:\mathbb{R}^n\to\mathbb{R}^m
$$

forward mode computes:

$$
J_f(x)v
$$

for one seed vector $v$.

To recover the full Jacobian, use basis vectors:

$$
e_1,e_2,\dots,e_n
$$

Thus the full Jacobian requires $n$ forward passes.

Total complexity:

$$
\Theta(nC(f))
$$

This is efficient when:
- $n$ is small
- directional derivatives are sufficient
- only selected Jacobian columns are needed

## Reverse Mode Complexity

Reverse mode computes vector-Jacobian products.

For scalar-output functions:

$$
f:\mathbb{R}^n\to\mathbb{R}
$$

reverse mode computes the full gradient in one reverse pass.

The total cost is approximately:

$$
C_{\text{reverse}}
=
\Theta(C(f))
$$

again with a moderate constant factor.

This result is one of the most important facts in automatic differentiation.

A function with millions of parameters can often have its full gradient computed at only a few times the cost of the forward evaluation.

## The Cheap Gradient Principle

The classical result is often called the cheap gradient principle.

For scalar-output functions:

$$
f:\mathbb{R}^n\to\mathbb{R}
$$

the gradient complexity satisfies:

$$
C(\nabla f)
\leq
\omega C(f)
$$

where:
- $\omega$ is a small constant
- typically less than 5 in practice

Importantly:
- the cost does not scale linearly with $n$
- reverse mode exploits shared intermediate computations

This principle explains the feasibility of modern deep learning.

## Memory Complexity

Forward mode requires little additional memory.

Each variable stores:
- primal value
- tangent value

Memory complexity remains:

$$
\Theta(\text{live variables})
$$

Reverse mode is different.

The reverse pass requires:
- primal intermediate values
- dependency structure
- operation ordering

Thus reverse mode often stores:
- the computation tape
- intermediate activations

Memory complexity becomes approximately:

$$
\Theta(C(f))
$$

for long computations.

This memory growth is one of the main engineering constraints in reverse-mode systems.

## Tape Complexity

A reverse-mode tape stores information such as:
- operation type
- input references
- saved primal values
- local backward rule

Tape size depends on:
- number of operations
- tensor sizes
- control flow
- recomputation strategy

For deep neural networks:
- activation storage dominates memory cost
- gradients themselves are often smaller than saved activations

This is why activation checkpointing is important.

## Checkpointing Tradeoff

Checkpointing reduces memory usage by recomputing parts of the forward pass during the backward pass.

Instead of storing every intermediate:
- store selected checkpoints
- recompute missing values later

This trades:
- memory
for:
- additional computation

A simplified tradeoff:

| Strategy | Memory | Compute |
|---|---|---|
| full tape | high | low |
| recomputation | low | high |
| checkpointing | moderate | moderate |

Optimal checkpoint schedules are an active research topic.

## Complexity by Input and Output Shape

Mode efficiency depends strongly on dimensions.

For:

$$
f:\mathbb{R}^n\to\mathbb{R}^m
$$

| Goal | Complexity |
|---|---|
| one JVP | forward efficient |
| one VJP | reverse efficient |
| full Jacobian via forward | $\Theta(nC(f))$ |
| full Jacobian via reverse | $\Theta(mC(f))$ |
| scalar gradient | reverse efficient |
| many-output sensitivity | forward efficient |

Thus:
- forward mode scales with input dimension
- reverse mode scales with output dimension

## Sparse Jacobians

Many programs have sparse derivative structure.

Example:

$$
f_i(x)
$$

may depend on only a few input variables.

Naive Jacobian construction wastes work on zeros.

Sparse AD techniques exploit:
- graph coloring
- compressed seeding
- structural sparsity
- block decomposition

Complexity can then approach:

$$
\Theta(\text{nonzero derivative count})
$$

rather than dense matrix complexity.

This is critical in:
- PDE solvers
- optimization
- scientific computing
- large simulations

## Higher-Order Complexity

Higher-order derivatives grow rapidly in cost.

Gradient:
- first-order tensor

Hessian:
- second-order tensor

Third derivative:
- third-order tensor

Tensor size grows combinatorially.

Example:

$$
H_f(x)\in\mathbb{R}^{n\times n}
$$

contains:

$$
n^2
$$

entries.

Third derivatives contain:

$$
n^3
$$

entries.

Full higher-order tensors become impractical quickly.

Mixed-mode techniques therefore compute:
- Hessian-vector products
- directional higher derivatives
- structured approximations

instead of full dense tensors.

## Dynamic Graph Complexity

Programs with dynamic control flow create variable computational graphs.

Example:
- loops with data-dependent iteration counts
- recursion
- dynamic branching

AD complexity then depends on the executed trace.

For runtime-generated graphs:
- tape size varies
- operation count varies
- backward cost follows executed operations

This dynamic nature complicates compiler optimization.

## Parallel Complexity

Derivative computations often inherit parallel structure from the primal computation.

Independent operations remain parallelizable.

However:
- reverse accumulation introduces dependency ordering
- gradient accumulation creates synchronization points
- reduction operations may serialize

GPU systems therefore optimize:
- fused kernels
- batched reductions
- asynchronous scheduling
- gradient aggregation

Efficient reverse-mode scheduling is a systems problem as much as a mathematical one.

## Asymptotic Comparison

| Method | Accuracy | Complexity |
|---|---|---|
| finite differences | approximate | $O(nC(f))$ |
| symbolic differentiation | exact symbolic | potentially exponential |
| forward AD | exact up to FP | $O(nC(f))$ for full gradient |
| reverse AD | exact up to FP | $O(C(f))$ for scalar gradient |

This comparison explains why automatic differentiation became dominant in machine learning and scientific optimization.

## Computational Graph Reuse

A major efficiency source in AD is graph reuse.

Suppose:

$$
f(x)=g(x)+g(x)
$$

Symbolic differentiation may duplicate work.

AD computes:
- $g(x)$ once
- propagates derivatives through shared intermediates

Thus AD complexity tracks executed computation rather than expanded algebraic form.

This distinction is fundamental.

## Complexity of Real Systems

Actual runtime depends on:
- tensor sizes
- memory bandwidth
- cache locality
- GPU utilization
- kernel launch overhead
- operator fusion
- graph compilation
- sparse structure
- batching

Modern AD systems are therefore compiler-runtime hybrids, not merely calculus engines.

The asymptotic theory explains scalability, but practical performance depends heavily on systems design.

## Summary

Automatic differentiation achieves efficient derivative computation because it propagates local derivatives through executed computation graphs.

Key complexity facts:
- forward mode computes JVPs efficiently
- reverse mode computes gradients efficiently
- reverse mode enables cheap gradients for scalar-output functions
- memory cost is the main reverse-mode limitation
- sparse and structured methods improve scalability
- higher-order derivatives require mixed strategies

The central complexity result is the cheap gradient principle:

$$
C(\nabla f)
=
\Theta(C(f))
$$

for scalar-output functions under reverse accumulation.