Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.
Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.
Forward mode asks:
Reverse mode asks:
The second question is the one needed for gradients of scalar-output functions.
The Adjoint Value
Reverse mode attaches an adjoint to each intermediate variable.
For a final scalar output , the adjoint of an intermediate value is:
The bar notation means sensitivity of the output with respect to that variable.
If changing slightly causes a large change in , then is large.
Reverse accumulation computes these adjoints from the end of the program back to the beginning.
Basic Example
Consider:
Decompose the computation:
| Step | Expression | Variable |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 |
The forward pass computes and stores the primal values:
The reverse pass starts with:
because:
Now propagate backward.
For:
we have:
For:
we have:
For:
we have:
For:
we have:
The final adjoint is:
Reverse Mode as Local Transpose Propagation
Each primitive operation has a local Jacobian.
Suppose:
Forward mode propagates tangents by multiplying with the local Jacobian:
Reverse mode propagates adjoints by multiplying with the transpose of the local Jacobian:
This is the central algebraic distinction.
Forward mode computes Jacobian-vector products.
Reverse mode computes vector-Jacobian products.
Vector-Jacobian Products
For a function:
the Jacobian has shape:
Given an output cotangent vector:
reverse mode computes:
Equivalently, as a row-vector convention, it computes:
This is a vector-Jacobian product.
When , the output seed is just , and reverse mode returns the full gradient:
This is why reverse mode is efficient for scalar losses with many parameters.
Reverse Propagation Rules
Reverse propagation uses local rules that distribute output adjoints to input adjoints.
Addition
Local derivatives:
Reverse rule:
Multiplication
Local derivatives:
Reverse rule:
Sine
Local derivative:
Reverse rule:
Matrix Multiplication
Reverse rules:
The same principle holds: local derivatives are applied backward, and adjoints are accumulated.
Accumulation at Shared Inputs
A variable may influence the output through multiple paths.
Example:
The variable contributes through:
Reverse mode must add both contributions.
Using intermediate variables:
Backward propagation gives:
Then:
and:
Therefore:
This is why reverse-mode implementations use additive updates such as:
x.grad += contributionrather than assignment.
Forward Pass and Reverse Pass
Reverse accumulation usually requires two passes.
First pass:
- execute the primal program
- record intermediate values
- record dependency structure
Second pass:
- seed output adjoints
- traverse operations in reverse order
- apply local transpose rules
- accumulate input adjoints
The recorded structure is often called a tape, trace, or Wengert list.
A simple tape entry might contain:
op: mul
inputs: x, y
output: z
saved: x_value, y_value
backward:
x.grad += y.value * z.grad
y.grad += x.value * z.gradThe saved primal values are needed because reverse rules often depend on them.
Cost Model
Let be the cost of evaluating the primal function.
For scalar-output functions:
reverse mode computes the full gradient with cost usually within a small constant multiple of .
The important point is that the cost does not scale linearly with in the way forward mode does for a full gradient.
This makes reverse mode the standard method for training neural networks and optimizing large parameterized models.
Memory Cost
Reverse mode trades memory for derivative efficiency.
The reverse pass often needs intermediate primal values from the forward pass.
For a long computation, this can require substantial memory:
This cost appears in:
- deep neural networks
- long simulations
- recurrent models
- differentiable physics
- solver-based computations
Checkpointing reduces memory by storing only selected states and recomputing others during the reverse pass.
Reverse Mode and Backpropagation
Backpropagation is reverse-mode automatic differentiation applied to neural networks.
A neural network is a composition of layers:
The loss is usually scalar:
Reverse mode propagates:
Layer by layer, the system applies local transpose rules.
Backpropagation is therefore not a separate mathematical technique from AD. It is a specialized instance of reverse accumulation.
Reverse Mode with Multiple Outputs
If:
and , a reverse pass computes:
for one chosen output seed .
To compute the full Jacobian with reverse mode, one can seed each output basis vector:
Each reverse pass gives one row of the Jacobian.
Thus:
- forward mode computes Jacobian columns
- reverse mode computes Jacobian rows
Reverse mode is preferred when output dimension is small.
Forward mode is preferred when input dimension is small.
Mutability and Reverse Accumulation
Reverse accumulation becomes more complex when programs mutate state.
Consider:
x = 2
x = x * x
x = x + 1The name refers to different values over time.
A reverse-mode system must distinguish:
- variable names
- storage locations
- value versions
Static single assignment form helps:
x0 = 2
x1 = x0 * x0
x2 = x1 + 1Now each value has a unique definition, and reverse propagation is well defined.
This is one reason AD compilers often lower programs into SSA-like intermediate representations.
Failure Modes
Reverse accumulation assumes that the executed computation has differentiable local rules.
Problems arise with:
- discontinuities
- integer control flow
- mutation aliases
- in-place overwrites
- undefined domains
- non-deterministic operations
- external side effects
A practical AD system must define semantics for these cases.
Often the system differentiates the executed trace only. It does not differentiate all possible traces.
Summary
Reverse accumulation propagates adjoints backward through a recorded computation.
Its central equation is:
For scalar-output functions with many inputs, reverse mode is usually the most efficient way to compute gradients.
Its strength is computational efficiency.
Its cost is memory, bookkeeping, and more complex semantics for control flow, mutation, and state.