Reverse mode automatic differentiation operates on a computational graph. The forward pass evaluates the graph from inputs to outputs. The reverse pass traverses the same...
Reverse mode automatic differentiation operates on a computational graph. The forward pass evaluates the graph from inputs to outputs. The reverse pass traverses the same graph in the opposite direction and propagates adjoints backward through the dependencies.
A computational graph converts a program into a directed acyclic structure of values and operations. Each node represents either:
- a variable or intermediate value;
- an elementary operation.
Edges represent data dependencies.
For example, consider
The forward computation can be decomposed into intermediate values:
The graph structure is:
x1 ----\
(*) ---- v1 --\
x2 ----/ \
(+) ---- y
x1 -------- sin ---- v2 /The graph explicitly records how every value depends on earlier values. Reverse mode uses this structure to apply the chain rule systematically.
Forward Graph Construction
A forward evaluation produces two outputs:
- primal values;
- graph structure.
Each primitive operation appends a node to the graph.
For example:
| Node | Operation | Inputs | Output |
|---|---|---|---|
| 1 | multiply | ||
| 2 | sine | ||
| 3 | add |
This sequence is often called a Wengert list or tape. The order is important because every operation only depends on earlier values.
The forward graph therefore defines a topological ordering of dependencies.
Reverse Traversal
The reverse pass traverses the graph backward.
The output node begins with seed adjoint
Then the reverse pass visits nodes in reverse topological order:
At each node, the local derivative rule distributes the current adjoint into parent nodes.
For the addition node:
the reverse rule is
For the sine node:
the reverse rule is
For the multiplication node:
the reverse rule is
The reverse graph traversal accumulates all contributions into the final adjoints.
Local Graph Semantics
Each node in the graph defines a local mapping:
The node receives an output adjoint
and propagates contributions into each parent:
The reverse graph therefore acts like a backward flow of sensitivities.
The forward pass computes values. The reverse pass computes influence.
Fan-Out and Adjoint Accumulation
Graphs frequently contain fan-out, where one value is used by several later operations.
Consider
The value influences the output through two separate paths:
The chain rule requires summing both effects:
This is why reverse mode uses additive accumulation.
Operationally:
adjoint[parent] += contributionnever
adjoint[parent] = contributionunless the graph guarantees a single dependency path.
Reverse Graphs as Transposed Dependency Graphs
The forward graph expresses how values are constructed.
The reverse graph expresses how sensitivities propagate.
Suppose the forward dependency matrix for a local operation is
Forward mode propagates tangents by multiplication with :
Reverse mode propagates adjoints using the transpose:
The reverse graph therefore corresponds to the transpose of the local linearized graph.
This interpretation explains why reverse mode traverses edges backward.
Static Graphs and Dynamic Graphs
Some systems construct the computational graph before execution. Others construct it during execution.
Static graph systems build the graph once and then execute it repeatedly.
Examples include early graph-based machine learning systems.
Advantages:
| Property | Benefit |
|---|---|
| Whole-graph optimization | Better fusion and scheduling |
| Static memory planning | Lower allocation overhead |
| Easier compilation | Better ahead-of-time optimization |
Disadvantages:
| Property | Limitation |
|---|---|
| Fixed structure | Harder support for dynamic control flow |
| Reduced flexibility | More complex debugging |
Dynamic graph systems construct the graph during execution itself.
Each operation immediately appends nodes to the active graph.
Advantages:
| Property | Benefit |
|---|---|
| Natural control flow | Standard language semantics |
| Easier debugging | Immediate execution |
| Flexible graphs | Input-dependent structure |
Disadvantages:
| Property | Limitation |
|---|---|
| Runtime overhead | Dynamic graph construction cost |
| Harder optimization | Less global visibility |
Modern systems often combine both approaches through tracing or staged compilation.
Reverse Graph Storage
The reverse pass requires information from the forward pass. Each node must retain enough information to compute local derivatives.
For example:
| Operation | Required Saved Values |
|---|---|
| none | |
| or | |
| or |
The storage structure may contain:
- operation type;
- pointers to inputs;
- output value;
- saved primal values;
- local metadata.
This storage is commonly called the tape.
A large reverse graph may consume substantial memory. Deep neural networks can generate graphs containing billions of primitive operations. Memory management therefore becomes a central systems problem.
Destructive Updates and Aliasing
Programs often mutate memory:
x = x + 1Mutation complicates reverse graphs because earlier values may be overwritten.
Reverse mode conceptually assumes immutable intermediate values. Systems that support mutation must preserve previous states or reconstruct them later.
Aliasing introduces additional complications:
a = x
b = xIf one alias changes, the derivative system must maintain correct dependency relationships.
Many AD systems avoid these issues by using purely functional intermediate representations or SSA form.
Graph Granularity
The computational graph may represent operations at different levels of abstraction.
Fine-grained graphs represent primitive arithmetic operations individually:
t1 = x * y
t2 = sin(t1)
t3 = t2 + zCoarse-grained graphs represent larger kernels:
matmul(A, B)Fine-grained graphs provide flexibility and exact control over differentiation.
Coarse-grained graphs reduce graph size and runtime overhead.
Modern systems usually combine both approaches:
- large tensor kernels as graph primitives;
- internal differentiation rules implemented separately.
Reverse Graph Execution Order
The reverse pass must respect dependency ordering.
If node contributes to node , then the reverse pass must process before .
This requires reverse topological traversal.
For a graph with nodes
generated in forward order, reverse mode typically processes:
This works because the forward trace already satisfies topological ordering.
Computational Interpretation
A reverse computational graph transforms a program into two coupled systems:
- a primal computation graph for values;
- a dual graph for sensitivities.
The forward graph answers:
How is this value computed?
The reverse graph answers:
Which earlier values influence this output, and by how much?
Reverse mode differentiation is therefore not symbolic algebra over expressions. It is execution over a graph of local derivative transformations.
The graph structure makes the chain rule operational. Each node performs a small local backward transformation, and the full gradient emerges from the composition of these local adjoint propagations.