Skip to content

Chapter 4. Core Theory of Automatic Differentiation

Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full...

Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full symbolic expression, or approximating derivatives numerically, AD propagates derivative information through each elementary operation in a program.

This section develops the central mechanism underlying all forms of automatic differentiation: local derivative propagation.

Programs as Compositions

Consider a function:

f(x)=sin(x2+3x) f(x)=\sin(x^2+3x)

This function is not evaluated as a single indivisible object. A program computes it as a sequence of intermediate operations:

StepExpressionVariable
1x2x^2v1v_1
23x3xv2v_2
3v1+v2v_1 + v_2v3v_3
4sin(v3)\sin(v_3)v4v_4

The final result is:

f(x)=v4 f(x)=v_4

Each operation has a local derivative rule:

OperationLocal Derivative
a+ba+b1,11,1
aba \cdot bb,ab,a
sin(a)\sin(a)cos(a)\cos(a)
a2a^22a2a

Automatic differentiation applies these rules incrementally across the computation graph.

The Chain Rule as Propagation

The chain rule connects local derivatives into global derivatives.

For nested functions:

y=f(g(h(x))) y=f(g(h(x)))

the derivative becomes:

dydx=dydgdgdhdhdx \frac{dy}{dx} = \frac{dy}{dg} \frac{dg}{dh} \frac{dh}{dx}

Automatic differentiation operationalizes this rule mechanically.

Instead of reasoning symbolically about the full expression, the system computes:

  1. local values
  2. local derivatives
  3. propagated sensitivities

This decomposition is the key idea.

Computational Graph Representation

A program can be represented as a directed acyclic graph (DAG).

Nodes represent operations or variables.

Edges represent data dependencies.

For:

f(x,y)=xy+sin(x) f(x,y)=x y+\sin(x)

the graph contains:

  • input nodes x,yx,y
  • multiplication node xyxy
  • sine node sin(x)\sin(x)
  • addition node

Derivative information flows along graph edges.

The graph separates:

  • primal computation
  • derivative propagation

This separation allows AD systems to transform programs systematically.

Local Jacobians

Each operation defines a small Jacobian relating outputs to inputs.

For scalar operations, these are ordinary derivatives.

For vector operations, they become matrices.

Example:

z=xy z=x y

Local derivatives:

zx=y \frac{\partial z}{\partial x}=y zy=x \frac{\partial z}{\partial y}=x

The local Jacobian is:

J=[yx] J= \begin{bmatrix} y & x \end{bmatrix}

AD systems never need to materialize the full global Jacobian explicitly. Instead, they propagate products involving local Jacobians.

This distinction is critical for scalability.

Forward Propagation of Derivatives

In forward mode, derivatives move in the same direction as computation.

Suppose:

v1=x2 v_1=x^2

Then:

v˙1=2xx˙ \dot v_1 = 2x \dot x

The dot notation denotes a tangent value.

Every variable carries:

  • its primal value
  • its derivative value

Evaluation proceeds step-by-step.

Example:

f(x)=sin(x2) f(x)=\sin(x^2)

with x=3x=3.

Primal computation:

VariableValue
xx3
v1=x2v_1=x^29
v2=sin(v1)v_2=\sin(v_1)sin(9)\sin(9)

Derivative propagation:

VariableDerivative
x˙\dot x1
v˙1\dot v_12xx˙=62x\dot x=6
v˙2\dot v_2cos(v1)v˙1=6cos(9)\cos(v_1)\dot v_1=6\cos(9)

Final derivative:

f(3)=6cos(9) f'(3)=6\cos(9)

No symbolic algebra is required.

Reverse Propagation of Derivatives

Reverse mode propagates sensitivities backward.

Instead of computing:

vix \frac{\partial v_i}{\partial x}

it computes:

yvi \frac{\partial y}{\partial v_i}

These quantities are called adjoints.

Suppose:

y=sin(x2) y=\sin(x^2)

Forward pass:

  • compute intermediate values
  • store them

Reverse pass:

  • propagate gradients backward

Intermediate variables:

v1=x2 v_1=x^2 v2=sin(v1) v_2=\sin(v_1)

Backward propagation:

vˉ2=1 \bar v_2=1 vˉ1=vˉ2cos(v1) \bar v_1 = \bar v_2 \cos(v_1) xˉ=vˉ1(2x) \bar x = \bar v_1 (2x)

Final result:

dydx=2xcos(x2) \frac{dy}{dx} = 2x\cos(x^2)

Reverse mode is especially efficient when:

  • outputs are few
  • inputs are many

This property explains its dominance in deep learning.

Propagation Rules for Elementary Operations

Each primitive operation defines propagation equations.

Addition:

z=x+y z=x+y

Forward:

z˙=x˙+y˙ \dot z=\dot x+\dot y

Reverse:

xˉ+=zˉ \bar x += \bar z yˉ+=zˉ \bar y += \bar z

Multiplication:

z=xy z=xy

Forward:

z˙=yx˙+xy˙ \dot z=y\dot x+x\dot y

Reverse:

xˉ+=yzˉ \bar x += y\bar z yˉ+=xzˉ \bar y += x\bar z

Sine:

z=sin(x) z=\sin(x)

Forward:

z˙=cos(x)x˙ \dot z=\cos(x)\dot x

Reverse:

xˉ+=cos(x)zˉ \bar x += \cos(x)\bar z

These rules form the operational core of AD engines.

Locality and Modularity

A major advantage of AD is locality.

Each operation only needs:

  • local inputs
  • local outputs
  • local derivative rules

An operation never needs knowledge of the full program.

This locality enables:

  • composability
  • compiler transformations
  • efficient runtime systems
  • extensible operator libraries

New differentiable operations can be added independently.

Exactness and Floating Point Semantics

Automatic differentiation computes derivatives accurate up to machine precision.

This differs fundamentally from finite differences.

Finite difference approximation:

f(x)f(x+h)f(x)h f'(x)\approx\frac{f(x+h)-f(x)}{h}

introduces:

  • truncation error
  • cancellation error
  • sensitivity to step size

AD avoids these issues because it differentiates the computational procedure itself.

The derivative computation is exact relative to the executed floating point operations.

Complexity Perspective

Suppose a program requires CC operations.

Forward mode derivative computation costs approximately:

O(C) O(C)

for one directional derivative.

Reverse mode computes gradients with cost proportional to a small multiple of the original computation.

This efficiency is one of the most important results in automatic differentiation theory.

For scalar-output functions:

f:RnR f:\mathbb{R}^n \to \mathbb{R}

reverse mode computes the full gradient in roughly the same asymptotic complexity as evaluating the function itself.

Local Propagation as the Core Abstraction

Everything in automatic differentiation reduces to local propagation.

Regardless of:

  • programming language
  • compiler strategy
  • tensor system
  • graph representation
  • runtime architecture

the core mechanism remains unchanged:

  1. decompose computation into primitive operations
  2. associate local derivative rules
  3. propagate derivative information through the graph

Forward mode propagates tangents.

Reverse mode propagates adjoints.

Higher-order systems propagate richer algebraic structures.

But all of them inherit the same foundational principle: local derivative propagation through compositional programs.