Automatic differentiation is a method for computing derivatives by transforming programs into derivative-propagating computations. It does not approximate derivatives...
Automatic differentiation is a method for computing derivatives by transforming programs into derivative-propagating computations. It does not approximate derivatives numerically and does not manipulate symbolic expressions globally. Instead, it applies the chain rule locally across primitive operations.
This chapter established the operational foundations of AD:
- local derivative propagation
- elementary derivative rules
- forward accumulation
- reverse accumulation
- mixed-mode composition
- computational complexity
- floating point exactness
- program transformation
Together, these ideas define the modern theory of automatic differentiation.
The Central Computational Idea
A program computes a function through a sequence of elementary operations.
Example:
becomes:
v1 = x * x
v2 = 3 * x
v3 = v1 + v2
v4 = sin(v3)Automatic differentiation augments this computation with derivative propagation.
The chain rule is applied incrementally:
Rather than constructing this symbolic expression explicitly, AD propagates local derivative information operationally.
This is the key abstraction.
Local Derivative Propagation
Every primitive operation defines:
- primal computation
- local derivative rule
Example:
Forward propagation:
Reverse propagation:
Global derivatives emerge from repeated local propagation through the computational graph.
The graph structure determines:
- dependency ordering
- derivative flow
- accumulation structure
- memory requirements
Forward Accumulation
Forward mode propagates tangents in the same direction as computation.
Each variable carries:
The tangent represents sensitivity with respect to seeded input directions.
Forward mode computes:
which is a Jacobian-vector product.
It is efficient when:
- input dimension is small
- output dimension is large
- directional derivatives are needed
Forward accumulation:
- uses little memory
- follows ordinary execution order
- works naturally with loops and recursion
Its cost scales with the number of input directions.
Reverse Accumulation
Reverse mode propagates adjoints backward from outputs to inputs.
Each variable carries:
Reverse mode computes:
which is a vector-Jacobian product.
For scalar-output functions:
reverse mode computes the full gradient in one backward pass.
This gives the cheap gradient principle:
up to a small constant factor.
Reverse mode is therefore dominant in:
- deep learning
- large-scale optimization
- differentiable simulation
- scientific inverse problems
Its main limitation is memory usage.
Mixed-Mode Differentiation
Forward and reverse modes can be composed.
Mixed mode enables:
- Hessian-vector products
- higher-order derivatives
- structured Jacobians
- nested sensitivities
Example:
can be computed efficiently by:
- forward-over-reverse
- reverse-over-forward
without materializing the full Hessian.
Mixed mode generalizes differentiation from:
- first-order gradients to:
- derivative operators acting on derivative programs
Complexity Structure
Automatic differentiation achieves efficient derivative computation because it reuses the computational structure of the original program.
Forward mode:
- scales with input dimension
Reverse mode:
- scales with output dimension
For:
reverse mode computes the gradient with cost comparable to evaluating the function itself.
Compared with other approaches:
| Method | Accuracy | Cost |
|---|---|---|
| finite differences | approximate | repeated evaluations |
| symbolic differentiation | exact symbolic | possible expression explosion |
| automatic differentiation | exact up to FP | graph-local propagation |
AD combines:
- numerical efficiency
- derivative exactness
- scalability
This combination explains its widespread adoption.
Exactness and Floating Point Semantics
Automatic differentiation computes derivatives exactly relative to the executed floating point program.
It avoids:
- truncation error
- cancellation error from finite differences
But it still inherits:
- floating point rounding
- overflow
- underflow
- instability of the primal computation
- conditioning of the mathematical problem
Thus AD computes derivatives of:
rather than ideal infinite-precision mathematics.
This distinction becomes important in:
- mixed precision training
- chaotic systems
- long recurrent computations
- unstable numerical kernels
AD as Program Transformation
Automatic differentiation is fundamentally a compiler or runtime transformation.
The original program:
y = f(x)becomes a transformed derivative program.
Forward transformation augments values with tangents.
Reverse transformation generates:
- forward recording
- backward propagation
This transformation may occur at:
- source level
- operator level
- graph level
- IR level
- compiler level
The program-transformation perspective explains:
- tape systems
- tracing systems
- SSA-based AD
- graph compilers
- differentiable programming languages
AD is therefore both:
- a calculus technique
- a systems architecture
Computational Graphs
A computational graph represents:
- primitive operations as nodes
- dependencies as edges
Derivative propagation becomes graph traversal.
Forward mode traverses:
- inputs → outputs
Reverse mode traverses:
- outputs → inputs
The graph representation enables:
- dependency analysis
- scheduling
- memory planning
- checkpointing
- fusion
- parallel execution
Modern ML frameworks are fundamentally graph-transformation systems built around this principle.
Relationship to Linear Algebra
Derivative propagation is linear algebra over local Jacobians.
Forward mode propagates:
Reverse mode propagates:
Thus:
- forward mode is pushforward propagation
- reverse mode is pullback propagation
This geometric viewpoint generalizes naturally to:
- tangent bundles
- cotangent bundles
- differential geometry
- categorical semantics
The same algebraic structure appears throughout the field.
Derivatives as Programs
A major conceptual shift introduced by AD is that derivatives are executable computations.
The derivative of a program is itself another program.
This means derivatives can:
- be optimized
- transformed
- compiled
- differentiated again
- parallelized
- fused with other computations
Higher-order differentiation follows naturally from this perspective.
A gradient is not merely a mathematical object. It is an executable computational artifact.
Engineering Constraints
Practical AD systems must address:
- mutation
- aliasing
- dynamic control flow
- memory pressure
- sparse structure
- distributed execution
- GPU scheduling
- custom primitives
- compiler optimization
- mixed precision
Thus modern AD systems combine:
- calculus
- compiler theory
- runtime systems
- numerical analysis
- linear algebra
- programming language semantics
Automatic differentiation sits at the intersection of all these domains.
Conceptual Summary
Automatic differentiation is built from four core principles.
1. Programs Decompose into Primitive Operations
Complex computations are reduced to local operations with known derivative rules.
2. The Chain Rule is Applied Locally
Derivatives propagate incrementally through the computation graph.
3. Derivative Propagation Can Run Forward or Backward
Forward mode propagates tangents.
Reverse mode propagates adjoints.
4. Differentiation is Program Transformation
Derivative computation is generated mechanically from executable programs.
These principles are sufficient to construct:
- gradient systems
- neural network training engines
- differentiable simulators
- scientific optimization frameworks
- differentiable programming languages
Transition to the Next Chapters
The remainder of the book builds on these foundations.
The next chapters examine:
- forward mode in depth
- reverse mode internals
- dual-number algebra
- higher-order differentiation
- tensor and matrix calculus
- compiler architectures
- differentiable programming systems
Everything that follows depends on the ideas established here: local derivative propagation through transformed computational programs.