The chain rule is the central theorem behind automatic differentiation. Every useful AD algorithm is a disciplined way of applying the chain rule to a program.
The chain rule is the central theorem behind automatic differentiation. Every useful AD algorithm is a disciplined way of applying the chain rule to a program.
A program is built from smaller operations. Each operation maps input values to output values. When one operation feeds another, the corresponding mathematical functions are composed. AD differentiates the whole computation by differentiating each small mapping and composing their local derivative information.
Composition of Scalar Functions
For scalar functions,
and
their composition is
meaning
The derivative is
This simple rule already contains the main structure of AD. We evaluate the inner function , use its result as the input to , and multiply local derivatives in the same dependency order.
For example, let
Define
Then
Substituting ,
AD keeps the intermediate value available because the local derivative of depends on it.
Composition of Vector Functions
In automatic differentiation, scalar composition is only the simplest case. Most computations involve vector-valued mappings.
Let
and
Then
has type
The Jacobian chain rule is
The matrix shapes are important:
The product is valid because the intermediate dimension appears once as the output dimension of and once as the input dimension of .
This is composition algebra: derivative objects compose according to the same wiring as the original computation, but with linear maps replacing nonlinear maps.
Local Linear Maps
Each differentiable function has a local linear approximation. If
then a small perturbation produces
If
then
Substituting the first approximation into the second gives
Therefore,
This interpretation is more useful for AD than the formula alone. AD propagates local linear maps through the same structure as values.
Computational Graph View
Consider a computation
x -> u -> v -> ywith
The full function is
The Jacobian is
The order matters. Values flow forward:
The Jacobian product is written in the reverse textual order because matrix multiplication applies right to left:
Forward mode evaluates this from the inside outward, propagating tangent perturbations along with values.
Reverse mode evaluates the transposed action backward, propagating adjoints from outputs to inputs.
Forward Accumulation
Forward mode applies the chain rule in the same direction as value evaluation.
For a composition
and an input tangent , forward mode computes
Thus,
This is a Jacobian-vector product:
Forward mode never has to build the full Jacobian. It only propagates the effect of one input direction.
Reverse Accumulation
Reverse mode applies the chain rule backward.
For the same computation,
reverse mode starts with an output cotangent or adjoint . It propagates sensitivities backward:
Combining these,
Equivalently,
This is a vector-Jacobian product, written in column-vector convention.
Reverse mode is efficient when the output dimension is small. For a scalar loss, one reverse pass computes the gradient with respect to all inputs.
Branching Computations
Programs often have values that feed more than one later operation.
Example:
u = x * x
v = sin(u)
w = exp(u)
y = v + wHere, is used twice. The output is
The derivative with respect to receives contributions from both branches:
Since
we have
Therefore,
Reverse mode implements this by accumulation. If a variable contributes to several downstream paths, its adjoint is the sum of contributions from all paths.
This is one reason reverse mode needs careful handling of storage and mutation. Intermediate variables may receive adjoint updates from multiple consumers.
Multiple Inputs
For a primitive operation with multiple inputs,
the local linear rule is
For multiplication,
the rule is
Reverse mode uses the transposed rule. Given , the adjoints are
The symbol matters. A variable may receive sensitivity from more than one downstream use.
Multiple Outputs
For a primitive operation with multiple outputs,
forward mode propagates tangents to each output:
Reverse mode accumulates input sensitivity from each output:
This is the same chain rule, expressed through a local input-output interface.
Chain Rule as an Interface Contract
An AD system can be organized around a simple contract.
Each primitive operation must provide:
- A primal evaluation rule.
- A forward derivative rule, or JVP rule.
- A reverse derivative rule, or VJP rule.
For example, multiplication has primal rule:
forward rule:
reverse rule:
Once every primitive provides these local rules, a full AD system can differentiate large programs by composition.
Algebraic Interpretation
A computation can be seen as a graph of typed mappings. Each node is a primitive map. Each edge carries values. Differentiation replaces each primitive map with a rule for propagating linear information.
Forward mode composes maps of the form:
Reverse mode composes maps of the form:
The same program can therefore support different derivative computations depending on how we traverse the composition.
This is why the chain rule is not merely a calculus identity in AD. It is the algebra that makes derivative programs compositional.
Practical Consequences
The chain rule explains several design facts about AD systems.
First, AD needs intermediate primal values. The derivative of needs . The derivative of multiplication needs the values of and . Reverse mode therefore often stores a tape of intermediate values or recomputes them later.
Second, derivative propagation follows data dependencies. If a value does not affect the output, it receives zero sensitivity. If it affects the output through several paths, its sensitivities add.
Third, AD scales because the chain rule is local. A large program can be differentiated by giving derivative rules for a small set of primitive operations.
The rest of automatic differentiation is mostly about how to execute this composition efficiently: which direction to accumulate, what to store, what to recompute, how to handle control flow, and how to represent derivative information without materializing enormous matrices.