Skip to content

Summary and Synthesis

Automatic differentiation is a method for computing derivatives by transforming programs into derivative-propagating computations. It does not approximate derivatives...

Automatic differentiation is a method for computing derivatives by transforming programs into derivative-propagating computations. It does not approximate derivatives numerically and does not manipulate symbolic expressions globally. Instead, it applies the chain rule locally across primitive operations.

This chapter established the operational foundations of AD:

  • local derivative propagation
  • elementary derivative rules
  • forward accumulation
  • reverse accumulation
  • mixed-mode composition
  • computational complexity
  • floating point exactness
  • program transformation

Together, these ideas define the modern theory of automatic differentiation.

The Central Computational Idea

A program computes a function through a sequence of elementary operations.

Example:

f(x)=sin(x2+3x) f(x)=\sin(x^2+3x)

becomes:

v1 = x * x
v2 = 3 * x
v3 = v1 + v2
v4 = sin(v3)

Automatic differentiation augments this computation with derivative propagation.

The chain rule is applied incrementally:

dydx=dydv3dv3dv1dv1dx+dydv3dv3dv2dv2dx \frac{dy}{dx} = \frac{dy}{dv_3} \frac{dv_3}{dv_1} \frac{dv_1}{dx} + \frac{dy}{dv_3} \frac{dv_3}{dv_2} \frac{dv_2}{dx}

Rather than constructing this symbolic expression explicitly, AD propagates local derivative information operationally.

This is the key abstraction.

Local Derivative Propagation

Every primitive operation defines:

  • primal computation
  • local derivative rule

Example:

z=xy z=xy

Forward propagation:

z˙=yx˙+xy˙ \dot z = y\dot x+x\dot y

Reverse propagation:

xˉ+=yzˉ \bar x += y\bar z yˉ+=xzˉ \bar y += x\bar z

Global derivatives emerge from repeated local propagation through the computational graph.

The graph structure determines:

  • dependency ordering
  • derivative flow
  • accumulation structure
  • memory requirements

Forward Accumulation

Forward mode propagates tangents in the same direction as computation.

Each variable carries:

(v,v˙) (v,\dot v)

The tangent represents sensitivity with respect to seeded input directions.

Forward mode computes:

Jf(x)v J_f(x)v

which is a Jacobian-vector product.

It is efficient when:

  • input dimension is small
  • output dimension is large
  • directional derivatives are needed

Forward accumulation:

  • uses little memory
  • follows ordinary execution order
  • works naturally with loops and recursion

Its cost scales with the number of input directions.

Reverse Accumulation

Reverse mode propagates adjoints backward from outputs to inputs.

Each variable carries:

vˉ=yv \bar v = \frac{\partial y}{\partial v}

Reverse mode computes:

uTJf(x) u^T J_f(x)

which is a vector-Jacobian product.

For scalar-output functions:

f:RnR f:\mathbb{R}^n\to\mathbb{R}

reverse mode computes the full gradient in one backward pass.

This gives the cheap gradient principle:

C(f)=Θ(C(f)) C(\nabla f) = \Theta(C(f))

up to a small constant factor.

Reverse mode is therefore dominant in:

  • deep learning
  • large-scale optimization
  • differentiable simulation
  • scientific inverse problems

Its main limitation is memory usage.

Mixed-Mode Differentiation

Forward and reverse modes can be composed.

Mixed mode enables:

  • Hessian-vector products
  • higher-order derivatives
  • structured Jacobians
  • nested sensitivities

Example:

Hf(x)v H_f(x)v

can be computed efficiently by:

  • forward-over-reverse
  • reverse-over-forward

without materializing the full Hessian.

Mixed mode generalizes differentiation from:

  • first-order gradients to:
  • derivative operators acting on derivative programs

Complexity Structure

Automatic differentiation achieves efficient derivative computation because it reuses the computational structure of the original program.

Forward mode:

  • scales with input dimension

Reverse mode:

  • scales with output dimension

For:

f:RnR f:\mathbb{R}^n\to\mathbb{R}

reverse mode computes the gradient with cost comparable to evaluating the function itself.

Compared with other approaches:

MethodAccuracyCost
finite differencesapproximaterepeated evaluations
symbolic differentiationexact symbolicpossible expression explosion
automatic differentiationexact up to FPgraph-local propagation

AD combines:

  • numerical efficiency
  • derivative exactness
  • scalability

This combination explains its widespread adoption.

Exactness and Floating Point Semantics

Automatic differentiation computes derivatives exactly relative to the executed floating point program.

It avoids:

  • truncation error
  • cancellation error from finite differences

But it still inherits:

  • floating point rounding
  • overflow
  • underflow
  • instability of the primal computation
  • conditioning of the mathematical problem

Thus AD computes derivatives of:

fl(f(x)) \operatorname{fl}(f(x))

rather than ideal infinite-precision mathematics.

This distinction becomes important in:

  • mixed precision training
  • chaotic systems
  • long recurrent computations
  • unstable numerical kernels

AD as Program Transformation

Automatic differentiation is fundamentally a compiler or runtime transformation.

The original program:

y = f(x)

becomes a transformed derivative program.

Forward transformation augments values with tangents.

Reverse transformation generates:

  • forward recording
  • backward propagation

This transformation may occur at:

  • source level
  • operator level
  • graph level
  • IR level
  • compiler level

The program-transformation perspective explains:

  • tape systems
  • tracing systems
  • SSA-based AD
  • graph compilers
  • differentiable programming languages

AD is therefore both:

  • a calculus technique
  • a systems architecture

Computational Graphs

A computational graph represents:

  • primitive operations as nodes
  • dependencies as edges

Derivative propagation becomes graph traversal.

Forward mode traverses:

  • inputs → outputs

Reverse mode traverses:

  • outputs → inputs

The graph representation enables:

  • dependency analysis
  • scheduling
  • memory planning
  • checkpointing
  • fusion
  • parallel execution

Modern ML frameworks are fundamentally graph-transformation systems built around this principle.

Relationship to Linear Algebra

Derivative propagation is linear algebra over local Jacobians.

Forward mode propagates:

y˙=Jf(x)x˙ \dot y = J_f(x)\dot x

Reverse mode propagates:

xˉ=Jf(x)Tyˉ \bar x = J_f(x)^T\bar y

Thus:

  • forward mode is pushforward propagation
  • reverse mode is pullback propagation

This geometric viewpoint generalizes naturally to:

  • tangent bundles
  • cotangent bundles
  • differential geometry
  • categorical semantics

The same algebraic structure appears throughout the field.

Derivatives as Programs

A major conceptual shift introduced by AD is that derivatives are executable computations.

The derivative of a program is itself another program.

This means derivatives can:

  • be optimized
  • transformed
  • compiled
  • differentiated again
  • parallelized
  • fused with other computations

Higher-order differentiation follows naturally from this perspective.

A gradient is not merely a mathematical object. It is an executable computational artifact.

Engineering Constraints

Practical AD systems must address:

  • mutation
  • aliasing
  • dynamic control flow
  • memory pressure
  • sparse structure
  • distributed execution
  • GPU scheduling
  • custom primitives
  • compiler optimization
  • mixed precision

Thus modern AD systems combine:

  • calculus
  • compiler theory
  • runtime systems
  • numerical analysis
  • linear algebra
  • programming language semantics

Automatic differentiation sits at the intersection of all these domains.

Conceptual Summary

Automatic differentiation is built from four core principles.

1. Programs Decompose into Primitive Operations

Complex computations are reduced to local operations with known derivative rules.

2. The Chain Rule is Applied Locally

Derivatives propagate incrementally through the computation graph.

3. Derivative Propagation Can Run Forward or Backward

Forward mode propagates tangents.

Reverse mode propagates adjoints.

4. Differentiation is Program Transformation

Derivative computation is generated mechanically from executable programs.

These principles are sufficient to construct:

  • gradient systems
  • neural network training engines
  • differentiable simulators
  • scientific optimization frameworks
  • differentiable programming languages

Transition to the Next Chapters

The remainder of the book builds on these foundations.

The next chapters examine:

  • forward mode in depth
  • reverse mode internals
  • dual-number algebra
  • higher-order differentiation
  • tensor and matrix calculus
  • compiler architectures
  • differentiable programming systems

Everything that follows depends on the ideas established here: local derivative propagation through transformed computational programs.