This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:
This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:
- evaluate the primal computation forward;
- seed the output adjoint;
- propagate adjoints backward;
- collect input or parameter gradients.
The examples move from scalar expressions to neural network layers and iterative programs.
Case Study 1: Scalar Expression
Consider
Decompose the computation:
Forward pass:
| Variable | Expression |
|---|---|
Reverse initialization:
Backward propagation:
Therefore,
Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.
Case Study 2: Shared Intermediate
Consider
Use a shared intermediate:
The value contributes to the output through two paths. Reverse mode must accumulate both.
Seed:
From addition:
From sine:
Thus,
From square:
Therefore,
This example shows why adjoints are added, not assigned.
Case Study 3: Two Inputs, One Output
Let
Wengert list:
Seed:
From addition:
From sine:
From multiplication:
Final gradient:
One reverse pass computes both partial derivatives.
Case Study 4: Vector-Jacobian Product
Define
Let the output seed be
Reverse mode computes
Write outputs as:
Seed output adjoints:
From :
From :
Final result:
Reverse mode computes this product without constructing the full Jacobian as a stored matrix.
Case Study 5: Linear Layer
Consider a batch linear layer:
Shapes:
| Symbol | Shape |
|---|---|
Let be the adjoint arriving from later layers.
Reverse rules:
The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.
This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.
Case Study 6: ReLU Layer
Let
where
The derivative is
At , systems choose a convention, commonly zero.
Given output adjoint , reverse propagation gives
A ReLU backward pass therefore needs either:
- the original input ;
- the output ;
- a stored Boolean mask.
The mask often costs less memory than storing the full input, depending on representation.
Case Study 7: Softmax Cross-Entropy
Softmax followed by cross-entropy is common in classification.
Given logits , define
For one-hot target , the loss is
A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.
However, most systems use a fused stable backward rule:
For a batch with mean reduction:
This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.
Case Study 8: Simple Recurrent Computation
Consider a recurrence:
Forward pass stores or checkpoints hidden states:
Seed:
Backward through time:
Reverse rules:
The same parameter is used at every time step, so its adjoint accumulates over all steps:
This is backpropagation through time.
Case Study 9: Checkpointed Chain
Consider a chain of eight layers:
Full reverse mode stores all activations:
A checkpointed scheme might store:
During backward propagation through the segment , the system recomputes:
from checkpoint , then runs the local backward pass.
Then it repeats for .
This reduces peak activation memory but adds extra forward computation.
The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.
Case Study 10: Tiny Reverse Mode Engine
A minimal scalar reverse mode system stores values, adjoints, and backward closures.
import math
class Var:
def __init__(self, value):
self.value = float(value)
self.grad = 0.0
self.parents = []
def backward(self):
topo = []
seen = set()
def visit(v):
if id(v) in seen:
return
seen.add(id(v))
for parent, _ in v.parents:
visit(parent)
topo.append(v)
visit(self)
self.grad = 1.0
for v in reversed(topo):
for parent, pullback in v.parents:
parent.grad += pullback(v.grad)
def add(a, b):
out = Var(a.value + b.value)
out.parents = [
(a, lambda g: g),
(b, lambda g: g),
]
return out
def mul(a, b):
out = Var(a.value * b.value)
out.parents = [
(a, lambda g: g * b.value),
(b, lambda g: g * a.value),
]
return out
def sin(a):
out = Var(math.sin(a.value))
out.parents = [
(a, lambda g: g * math.cos(a.value)),
]
return outUsing it:
x1 = Var(2.0)
x2 = Var(3.0)
y = add(mul(x1, x2), sin(x1))
y.backward()
print(y.value)
print(x1.grad)
print(x2.grad)The gradients are:
This small engine illustrates the whole mechanism:
- the forward pass builds a graph;
- each operation stores a pullback;
- backward traversal accumulates gradients.
Chapter Summary
Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.
The main ideas from this chapter are:
| Concept | Role |
|---|---|
| adjoint | sensitivity of output to an intermediate value |
| reverse graph | dependency graph traversed backward |
| vector-Jacobian product | core operation of reverse mode |
| reverse accumulation | additive propagation of adjoints |
| tape | recorded execution trace |
| Wengert list | single-assignment linearized computation |
| checkpointing | memory reduction by recomputation |
| backpropagation | reverse mode applied to neural networks |
Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.