Chapter 4. Core Theory of Automatic Differentiation

Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full symbolic expression, or approximating derivatives numerically, AD propagates derivative information through each elementary operation in a program.

This section develops the central mechanism underlying all forms of automatic differentiation: local derivative propagation.

Programs as Compositions

Consider a function:



f(x)=\sin(x^2+3x)

This function is not evaluated as a single indivisible object. A program computes it as a sequence of intermediate operations:

Step	Expression	Variable
1	$x^2$	$v_1$
2	$3x$	$v_2$
3	$v_1 + v_2$	$v_3$
4	$\sin(v_3)$	$v_4$

The final result is:

f(x)=v_4

Each operation has a local derivative rule:

Operation	Local Derivative
$a+b$	$1,1$
$a \cdot b$	$b,a$
$\sin(a)$	$\cos(a)$
$a^2$	$2a$

Automatic differentiation applies these rules incrementally across the computation graph.

The Chain Rule as Propagation

The chain rule connects local derivatives into global derivatives.

For nested functions:



y=f(g(h(x)))

the derivative becomes:

\frac{dy}{dx} = \frac{dy}{dg} \frac{dg}{dh} \frac{dh}{dx}

Automatic differentiation operationalizes this rule mechanically.

Instead of reasoning symbolically about the full expression, the system computes:

local values
local derivatives
propagated sensitivities

This decomposition is the key idea.

Computational Graph Representation

A program can be represented as a directed acyclic graph (DAG).

Nodes represent operations or variables.

Edges represent data dependencies.

For:

f(x,y)=x y+\sin(x)

the graph contains:

input nodes $x,y$
multiplication node $xy$
sine node $\sin(x)$
addition node

Derivative information flows along graph edges.

The graph separates:

primal computation
derivative propagation

This separation allows AD systems to transform programs systematically.

Local Jacobians

Each operation defines a small Jacobian relating outputs to inputs.

For scalar operations, these are ordinary derivatives.

For vector operations, they become matrices.

Example:

z=x y

Local derivatives:

\frac{\partial z}{\partial x}=y

\frac{\partial z}{\partial y}=x

The local Jacobian is:

J= \begin{bmatrix} y & x \end{bmatrix}

AD systems never need to materialize the full global Jacobian explicitly. Instead, they propagate products involving local Jacobians.

This distinction is critical for scalability.

Forward Propagation of Derivatives

In forward mode, derivatives move in the same direction as computation.

Suppose:

v_1=x^2

Then:

\dot v_1 = 2x \dot x

The dot notation denotes a tangent value.

Every variable carries:

its primal value
its derivative value

Evaluation proceeds step-by-step.

Example:

f(x)=\sin(x^2)

with $x=3$ .

Primal computation:

Variable	Value
$x$	3
$v_1=x^2$	9
$v_2=\sin(v_1)$	$\sin(9)$

Derivative propagation:

Variable	Derivative
$\dot x$	1
$\dot v_1$	$2x\dot x=6$
$\dot v_2$	$\cos(v_1)\dot v_1=6\cos(9)$

Final derivative:

f'(3)=6\cos(9)

No symbolic algebra is required.

Reverse Propagation of Derivatives

Reverse mode propagates sensitivities backward.

Instead of computing:

\frac{\partial v_i}{\partial x}

it computes:

\frac{\partial y}{\partial v_i}

These quantities are called adjoints.

Suppose:

y=\sin(x^2)

Forward pass:

compute intermediate values
store them

Reverse pass:

propagate gradients backward

Intermediate variables:

v_1=x^2

v_2=\sin(v_1)

Backward propagation:

\bar v_2=1

\bar v_1 = \bar v_2 \cos(v_1)

\bar x = \bar v_1 (2x)

Final result:

\frac{dy}{dx} = 2x\cos(x^2)

Reverse mode is especially efficient when:

outputs are few
inputs are many

This property explains its dominance in deep learning.

Propagation Rules for Elementary Operations

Each primitive operation defines propagation equations.

Addition:

z=x+y

Forward:

\dot z=\dot x+\dot y

Reverse:

\bar x += \bar z

\bar y += \bar z

Multiplication:

z=xy

Forward:

\dot z=y\dot x+x\dot y

Reverse:

\bar x += y\bar z

\bar y += x\bar z

Sine:

z=\sin(x)

Forward:

\dot z=\cos(x)\dot x

Reverse:

\bar x += \cos(x)\bar z

These rules form the operational core of AD engines.

Locality and Modularity

A major advantage of AD is locality.

Each operation only needs:

local inputs
local outputs
local derivative rules

An operation never needs knowledge of the full program.

This locality enables:

composability
compiler transformations
efficient runtime systems
extensible operator libraries

New differentiable operations can be added independently.

Exactness and Floating Point Semantics

Automatic differentiation computes derivatives accurate up to machine precision.

This differs fundamentally from finite differences.

Finite difference approximation:



f'(x)\approx\frac{f(x+h)-f(x)}{h}

introduces:

truncation error
cancellation error
sensitivity to step size

AD avoids these issues because it differentiates the computational procedure itself.

The derivative computation is exact relative to the executed floating point operations.

Complexity Perspective

Suppose a program requires $C$ operations.

Forward mode derivative computation costs approximately:

O(C)

for one directional derivative.

Reverse mode computes gradients with cost proportional to a small multiple of the original computation.

This efficiency is one of the most important results in automatic differentiation theory.

For scalar-output functions:

f:\mathbb{R}^n \to \mathbb{R}

reverse mode computes the full gradient in roughly the same asymptotic complexity as evaluating the function itself.

Local Propagation as the Core Abstraction

Everything in automatic differentiation reduces to local propagation.

Regardless of:

programming language
compiler strategy
tensor system
graph representation
runtime architecture

the core mechanism remains unchanged:

decompose computation into primitive operations
associate local derivative rules
propagate derivative information through the graph

Forward mode propagates tangents.

Reverse mode propagates adjoints.

Higher-order systems propagate richer algebraic structures.

But all of them inherit the same foundational principle: local derivative propagation through compositional programs.