Reverse Accumulation

Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.

Forward mode asks:

\text{How does each intermediate value change when the input changes?}

Reverse mode asks:

\text{How does the final output change when each intermediate value changes?}

The second question is the one needed for gradients of scalar-output functions.

The Adjoint Value

Reverse mode attaches an adjoint to each intermediate variable.

For a final scalar output $y$ , the adjoint of an intermediate value $v$ is:

\bar v = \frac{\partial y}{\partial v}

The bar notation means sensitivity of the output with respect to that variable.

If changing $v$ slightly causes a large change in $y$ , then $\bar v$ is large.

Reverse accumulation computes these adjoints from the end of the program back to the beginning.

Basic Example

Consider:

y=f(x)=\sin(x^2+3x)

Decompose the computation:

Step	Expression	Variable
1	$x^2$	$v_1$
2	$3x$	$v_2$
3	$v_1+v_2$	$v_3$
4	$\sin(v_3)$	$v_4=y$

The forward pass computes and stores the primal values:

v_1=x^2,\quad v_2=3x,\quad v_3=v_1+v_2,\quad v_4=\sin(v_3)

The reverse pass starts with:

\bar v_4 = 1

because:

\frac{\partial y}{\partial y}=1

Now propagate backward.

For:

v_4=\sin(v_3)

we have:

\bar v_3 += \bar v_4 \cos(v_3)

For:

v_3=v_1+v_2

we have:

\bar v_1 += \bar v_3

\bar v_2 += \bar v_3

For:

v_2=3x

we have:

\bar x += 3\bar v_2

For:

v_1=x^2

we have:

\bar x += 2x\bar v_1

The final adjoint $\bar x$ is:

\frac{dy}{dx} = (2x+3)\cos(x^2+3x)

Reverse Mode as Local Transpose Propagation

Each primitive operation has a local Jacobian.

Suppose:

z=\phi(x)

Forward mode propagates tangents by multiplying with the local Jacobian:

\dot z = J_\phi \dot x

Reverse mode propagates adjoints by multiplying with the transpose of the local Jacobian:

\bar x += J_\phi^T \bar z

This is the central algebraic distinction.

Forward mode computes Jacobian-vector products.

Reverse mode computes vector-Jacobian products.

Vector-Jacobian Products

For a function:

f:\mathbb{R}^n\to\mathbb{R}^m

the Jacobian has shape:

J_f(x)\in\mathbb{R}^{m\times n}

Given an output cotangent vector:

\bar y\in\mathbb{R}^m

reverse mode computes:

\bar x = J_f(x)^T\bar y

Equivalently, as a row-vector convention, it computes:

\bar y^T J_f(x)

This is a vector-Jacobian product.

When $m=1$ , the output seed is just $1$ , and reverse mode returns the full gradient:

\nabla f(x)

This is why reverse mode is efficient for scalar losses with many parameters.

Reverse Propagation Rules

Reverse propagation uses local rules that distribute output adjoints to input adjoints.

Addition

z=x+y

Local derivatives:

\frac{\partial z}{\partial x}=1,\quad \frac{\partial z}{\partial y}=1

Reverse rule:

\bar x += \bar z

\bar y += \bar z

Multiplication

z=xy

Local derivatives:

\frac{\partial z}{\partial x}=y,\quad \frac{\partial z}{\partial y}=x

Reverse rule:

\bar x += y\bar z

\bar y += x\bar z

Sine

z=\sin(x)

Local derivative:

\frac{dz}{dx}=\cos(x)

Reverse rule:

\bar x += \cos(x)\bar z

Matrix Multiplication

C=AB

Reverse rules:

\bar A += \bar C B^T

\bar B += A^T \bar C

The same principle holds: local derivatives are applied backward, and adjoints are accumulated.

Accumulation at Shared Inputs

A variable may influence the output through multiple paths.

Example:

y=x^2+x

The variable $x$ contributes through:

$x^2$
$x$

Reverse mode must add both contributions.

Using intermediate variables:

v_1=x^2

v_2=x

y=v_1+v_2

Backward propagation gives:

\bar v_1=1,\quad \bar v_2=1

Then:

\bar x += 2x\bar v_1

and:

\bar x += \bar v_2

Therefore:

\bar x=2x+1

This is why reverse-mode implementations use additive updates such as:

x.grad += contribution

rather than assignment.

Forward Pass and Reverse Pass

Reverse accumulation usually requires two passes.

First pass:

execute the primal program
record intermediate values
record dependency structure

Second pass:

seed output adjoints
traverse operations in reverse order
apply local transpose rules
accumulate input adjoints

The recorded structure is often called a tape, trace, or Wengert list.

A simple tape entry might contain:

op: mul
inputs: x, y
output: z
saved: x_value, y_value
backward:
    x.grad += y.value * z.grad
    y.grad += x.value * z.grad

The saved primal values are needed because reverse rules often depend on them.

Cost Model

Let $C(f)$ be the cost of evaluating the primal function.

For scalar-output functions:

f:\mathbb{R}^n\to\mathbb{R}

reverse mode computes the full gradient with cost usually within a small constant multiple of $C(f)$ .

The important point is that the cost does not scale linearly with $n$ in the way forward mode does for a full gradient.

This makes reverse mode the standard method for training neural networks and optimizing large parameterized models.

Memory Cost

Reverse mode trades memory for derivative efficiency.

The reverse pass often needs intermediate primal values from the forward pass.

For a long computation, this can require substantial memory:

\text{memory} \propto \text{number of recorded operations}

This cost appears in:

deep neural networks
long simulations
recurrent models
differentiable physics
solver-based computations

Checkpointing reduces memory by storing only selected states and recomputing others during the reverse pass.

Reverse Mode and Backpropagation

Backpropagation is reverse-mode automatic differentiation applied to neural networks.

A neural network is a composition of layers:

y = f_L(f_{L-1}(\cdots f_1(x)))

The loss is usually scalar:

\ell(\theta)=L(y,t)

Reverse mode propagates:

\frac{\partial \ell}{\partial y}, \frac{\partial \ell}{\partial f_L}, \dots, \frac{\partial \ell}{\partial \theta}

Layer by layer, the system applies local transpose rules.

Backpropagation is therefore not a separate mathematical technique from AD. It is a specialized instance of reverse accumulation.

Reverse Mode with Multiple Outputs

If:

f:\mathbb{R}^n\to\mathbb{R}^m

and $m>1$ , a reverse pass computes:

J_f(x)^T\bar y

for one chosen output seed $\bar y$ .

To compute the full Jacobian with reverse mode, one can seed each output basis vector:

\bar y=e_i

Each reverse pass gives one row of the Jacobian.

Thus:

forward mode computes Jacobian columns
reverse mode computes Jacobian rows

Reverse mode is preferred when output dimension is small.

Forward mode is preferred when input dimension is small.

Mutability and Reverse Accumulation

Reverse accumulation becomes more complex when programs mutate state.

Consider:

x = 2
x = x * x
x = x + 1

The name $x$ refers to different values over time.

A reverse-mode system must distinguish:

variable names
storage locations
value versions

Static single assignment form helps:

x0 = 2
x1 = x0 * x0
x2 = x1 + 1

Now each value has a unique definition, and reverse propagation is well defined.

This is one reason AD compilers often lower programs into SSA-like intermediate representations.

Failure Modes

Reverse accumulation assumes that the executed computation has differentiable local rules.

Problems arise with:

discontinuities
integer control flow
mutation aliases
in-place overwrites
undefined domains
non-deterministic operations
external side effects

A practical AD system must define semantics for these cases.

Often the system differentiates the executed trace only. It does not differentiate all possible traces.

Summary

Reverse accumulation propagates adjoints backward through a recorded computation.

Its central equation is:

\bar x += J_\phi^T\bar z

For scalar-output functions with many inputs, reverse mode is usually the most efficient way to compute gradients.

Its strength is computational efficiency.

Its cost is memory, bookkeeping, and more complex semantics for control flow, mutation, and state.