Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse...
Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse pass traverses that recorded structure backward and accumulates adjoint contributions into each variable.
The word accumulation is important. A variable may influence the final output through many downstream paths. Reverse mode computes the total derivative by summing contributions from all such paths.
For a scalar-valued function
reverse accumulation computes
using one backward traversal.
Forward Evaluation and Recording
Consider a straight-line program:
The forward pass computes intermediate values in order.
| Step | Expression |
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 |
During execution, the system records:
- operation type;
- input references;
- output reference;
- primal values needed later.
The resulting trace is a topologically ordered sequence of operations.
Reverse Accumulation Principle
For every variable , define adjoint
The reverse pass begins with
Then each operation propagates adjoints backward using local derivative rules.
The chain rule gives
where the sum ranges over all immediate children of in the computational graph.
This equation defines reverse accumulation.
Reverse Sweep Example
Start from the previous program.
Forward values:
Initialize all adjoints to zero:
Seed output:
Now traverse backward.
Step 1: Addition Node
Since
the reverse rule is
Therefore,
Step 2: Sine Node
Since
the reverse rule is
Thus,
Step 3: Multiplication Node
Since
the reverse rules are
Substituting:
Since
the final derivatives are
Generic Reverse Accumulation Algorithm
The reverse accumulation procedure can be expressed abstractly.
Suppose the forward pass generated operations
Each operation defines
The reverse algorithm:
- initialize all adjoints to zero;
- set output adjoint to one;
- process operations in reverse order;
- apply local pullbacks;
- accumulate contributions into parent adjoints.
Pseudocode:
forward:
for op in operations:
compute output
record operation metadata
reverse:
for variable:
adjoint = 0
output.adjoint = 1
for op in reverse(operations):
for input in op.inputs:
input.adjoint += (
op.output.adjoint
* local_derivative(op, input)
)This structure is independent of the specific primitive operations.
Local Reverse Rules
Every primitive operation defines a reverse accumulation rule.
Addition
Reverse:
Subtraction
Reverse:
Multiplication
Reverse:
Division
Reverse:
Exponential
Reverse:
Often implemented as
using the stored forward output.
Sine
Reverse:
Each primitive only requires local derivative knowledge.
Graph Interpretation
The reverse accumulation algorithm computes gradients by traversing the computational graph backward.
At each node:
- receive output adjoint;
- compute local contributions;
- distribute contributions into inputs.
The full gradient emerges from repeated local applications of the chain rule.
This locality is one of the most important properties of reverse mode systems.
Sparse Dependency Propagation
Many programs have sparse dependency structure.
Example:
Variable does not influence .
Thus,
Reverse accumulation naturally respects sparsity because adjoints only propagate along dependency edges.
Nodes disconnected from the output never receive nonzero adjoints.
In-Place Adjoint Updates
Most implementations use mutable adjoint buffers.
Operationally:
adjoint[input] += contributionThis avoids repeated allocation and permits efficient accumulation.
However, parallel execution becomes more difficult because several operations may attempt to update the same adjoint simultaneously.
Parallel reverse mode therefore requires:
- synchronization;
- atomic accumulation;
- graph partitioning;
- reduction strategies.
Reverse Topological Ordering
The reverse sweep requires valid dependency order.
Suppose
in the forward graph.
Then reverse propagation requires
before computing contributions into
Thus the reverse pass processes nodes in reverse topological order.
Straight-line programs automatically satisfy this property because the forward execution trace already records operations in topological sequence.
Complexity of Reverse Accumulation
Suppose evaluating the original program costs
Reverse accumulation usually costs a small constant multiple of .
The total cost becomes approximately
| Phase | Cost |
|---|---|
| Forward pass | |
| Reverse pass | |
| Total |
Typical values:
The precise value depends on:
- primitive operation complexity;
- memory traffic;
- tensor kernel efficiency;
- recomputation strategy.
Reverse Accumulation and Memory
Reverse accumulation requires access to forward information.
The reverse rules for many operations depend on saved primal values:
| Operation | Required Data |
|---|---|
| multiplication | inputs |
| division | denominator |
| exponential | output or input |
| matrix factorization | decomposition state |
The system must therefore preserve enough information from the forward pass.
This creates the central tradeoff in reverse mode:
| Strategy | Cost |
|---|---|
| Save everything | high memory |
| Recompute values | high compute |
Checkpointing algorithms balance this tradeoff.
Reverse Accumulation as Backward Linear Propagation
Each local reverse rule applies a transposed Jacobian.
Suppose
The local Jacobian is
Forward mode propagates tangents:
Reverse accumulation propagates adjoints:
Thus the reverse pass computes repeated transposed linear transformations.
The global reverse accumulation algorithm is therefore the composition of local transposed Jacobian actions over the computational graph.
Conceptual Summary
Reverse accumulation algorithms operationalize the multivariate chain rule.
The forward pass builds a dependency trace and computes primal values.
The reverse pass:
- starts from output sensitivities;
- traverses operations backward;
- applies local pullbacks;
- accumulates adjoints into dependencies.
Every gradient computed by reverse mode emerges from repeated local adjoint accumulation over the reverse computational graph.