Skip to content

Reverse Accumulation

Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.

Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.

Forward mode asks:

How does each intermediate value change when the input changes? \text{How does each intermediate value change when the input changes?}

Reverse mode asks:

How does the final output change when each intermediate value changes? \text{How does the final output change when each intermediate value changes?}

The second question is the one needed for gradients of scalar-output functions.

The Adjoint Value

Reverse mode attaches an adjoint to each intermediate variable.

For a final scalar output yy, the adjoint of an intermediate value vv is:

vˉ=yv \bar v = \frac{\partial y}{\partial v}

The bar notation means sensitivity of the output with respect to that variable.

If changing vv slightly causes a large change in yy, then vˉ\bar v is large.

Reverse accumulation computes these adjoints from the end of the program back to the beginning.

Basic Example

Consider:

y=f(x)=sin(x2+3x) y=f(x)=\sin(x^2+3x)

Decompose the computation:

StepExpressionVariable
1x2x^2v1v_1
23x3xv2v_2
3v1+v2v_1+v_2v3v_3
4sin(v3)\sin(v_3)v4=yv_4=y

The forward pass computes and stores the primal values:

v1=x2,v2=3x,v3=v1+v2,v4=sin(v3) v_1=x^2,\quad v_2=3x,\quad v_3=v_1+v_2,\quad v_4=\sin(v_3)

The reverse pass starts with:

vˉ4=1 \bar v_4 = 1

because:

yy=1 \frac{\partial y}{\partial y}=1

Now propagate backward.

For:

v4=sin(v3) v_4=\sin(v_3)

we have:

vˉ3+=vˉ4cos(v3) \bar v_3 += \bar v_4 \cos(v_3)

For:

v3=v1+v2 v_3=v_1+v_2

we have:

vˉ1+=vˉ3 \bar v_1 += \bar v_3 vˉ2+=vˉ3 \bar v_2 += \bar v_3

For:

v2=3x v_2=3x

we have:

xˉ+=3vˉ2 \bar x += 3\bar v_2

For:

v1=x2 v_1=x^2

we have:

xˉ+=2xvˉ1 \bar x += 2x\bar v_1

The final adjoint xˉ\bar x is:

dydx=(2x+3)cos(x2+3x) \frac{dy}{dx} = (2x+3)\cos(x^2+3x)

Reverse Mode as Local Transpose Propagation

Each primitive operation has a local Jacobian.

Suppose:

z=ϕ(x) z=\phi(x)

Forward mode propagates tangents by multiplying with the local Jacobian:

z˙=Jϕx˙ \dot z = J_\phi \dot x

Reverse mode propagates adjoints by multiplying with the transpose of the local Jacobian:

xˉ+=JϕTzˉ \bar x += J_\phi^T \bar z

This is the central algebraic distinction.

Forward mode computes Jacobian-vector products.

Reverse mode computes vector-Jacobian products.

Vector-Jacobian Products

For a function:

f:RnRm f:\mathbb{R}^n\to\mathbb{R}^m

the Jacobian has shape:

Jf(x)Rm×n J_f(x)\in\mathbb{R}^{m\times n}

Given an output cotangent vector:

yˉRm \bar y\in\mathbb{R}^m

reverse mode computes:

xˉ=Jf(x)Tyˉ \bar x = J_f(x)^T\bar y

Equivalently, as a row-vector convention, it computes:

yˉTJf(x) \bar y^T J_f(x)

This is a vector-Jacobian product.

When m=1m=1, the output seed is just 11, and reverse mode returns the full gradient:

f(x) \nabla f(x)

This is why reverse mode is efficient for scalar losses with many parameters.

Reverse Propagation Rules

Reverse propagation uses local rules that distribute output adjoints to input adjoints.

Addition

z=x+y z=x+y

Local derivatives:

zx=1,zy=1 \frac{\partial z}{\partial x}=1,\quad \frac{\partial z}{\partial y}=1

Reverse rule:

xˉ+=zˉ \bar x += \bar z yˉ+=zˉ \bar y += \bar z

Multiplication

z=xy z=xy

Local derivatives:

zx=y,zy=x \frac{\partial z}{\partial x}=y,\quad \frac{\partial z}{\partial y}=x

Reverse rule:

xˉ+=yzˉ \bar x += y\bar z yˉ+=xzˉ \bar y += x\bar z

Sine

z=sin(x) z=\sin(x)

Local derivative:

dzdx=cos(x) \frac{dz}{dx}=\cos(x)

Reverse rule:

xˉ+=cos(x)zˉ \bar x += \cos(x)\bar z

Matrix Multiplication

C=AB C=AB

Reverse rules:

Aˉ+=CˉBT \bar A += \bar C B^T Bˉ+=ATCˉ \bar B += A^T \bar C

The same principle holds: local derivatives are applied backward, and adjoints are accumulated.

Accumulation at Shared Inputs

A variable may influence the output through multiple paths.

Example:

y=x2+x y=x^2+x

The variable xx contributes through:

  • x2x^2
  • xx

Reverse mode must add both contributions.

Using intermediate variables:

v1=x2 v_1=x^2 v2=x v_2=x y=v1+v2 y=v_1+v_2

Backward propagation gives:

vˉ1=1,vˉ2=1 \bar v_1=1,\quad \bar v_2=1

Then:

xˉ+=2xvˉ1 \bar x += 2x\bar v_1

and:

xˉ+=vˉ2 \bar x += \bar v_2

Therefore:

xˉ=2x+1 \bar x=2x+1

This is why reverse-mode implementations use additive updates such as:

x.grad += contribution

rather than assignment.

Forward Pass and Reverse Pass

Reverse accumulation usually requires two passes.

First pass:

  1. execute the primal program
  2. record intermediate values
  3. record dependency structure

Second pass:

  1. seed output adjoints
  2. traverse operations in reverse order
  3. apply local transpose rules
  4. accumulate input adjoints

The recorded structure is often called a tape, trace, or Wengert list.

A simple tape entry might contain:

op: mul
inputs: x, y
output: z
saved: x_value, y_value
backward:
    x.grad += y.value * z.grad
    y.grad += x.value * z.grad

The saved primal values are needed because reverse rules often depend on them.

Cost Model

Let C(f)C(f) be the cost of evaluating the primal function.

For scalar-output functions:

f:RnR f:\mathbb{R}^n\to\mathbb{R}

reverse mode computes the full gradient with cost usually within a small constant multiple of C(f)C(f).

The important point is that the cost does not scale linearly with nn in the way forward mode does for a full gradient.

This makes reverse mode the standard method for training neural networks and optimizing large parameterized models.

Memory Cost

Reverse mode trades memory for derivative efficiency.

The reverse pass often needs intermediate primal values from the forward pass.

For a long computation, this can require substantial memory:

memorynumber of recorded operations \text{memory} \propto \text{number of recorded operations}

This cost appears in:

  • deep neural networks
  • long simulations
  • recurrent models
  • differentiable physics
  • solver-based computations

Checkpointing reduces memory by storing only selected states and recomputing others during the reverse pass.

Reverse Mode and Backpropagation

Backpropagation is reverse-mode automatic differentiation applied to neural networks.

A neural network is a composition of layers:

y=fL(fL1(f1(x))) y = f_L(f_{L-1}(\cdots f_1(x)))

The loss is usually scalar:

(θ)=L(y,t) \ell(\theta)=L(y,t)

Reverse mode propagates:

y,fL,,θ \frac{\partial \ell}{\partial y}, \frac{\partial \ell}{\partial f_L}, \dots, \frac{\partial \ell}{\partial \theta}

Layer by layer, the system applies local transpose rules.

Backpropagation is therefore not a separate mathematical technique from AD. It is a specialized instance of reverse accumulation.

Reverse Mode with Multiple Outputs

If:

f:RnRm f:\mathbb{R}^n\to\mathbb{R}^m

and m>1m>1, a reverse pass computes:

Jf(x)Tyˉ J_f(x)^T\bar y

for one chosen output seed yˉ\bar y.

To compute the full Jacobian with reverse mode, one can seed each output basis vector:

yˉ=ei \bar y=e_i

Each reverse pass gives one row of the Jacobian.

Thus:

  • forward mode computes Jacobian columns
  • reverse mode computes Jacobian rows

Reverse mode is preferred when output dimension is small.

Forward mode is preferred when input dimension is small.

Mutability and Reverse Accumulation

Reverse accumulation becomes more complex when programs mutate state.

Consider:

x = 2
x = x * x
x = x + 1

The name xx refers to different values over time.

A reverse-mode system must distinguish:

  • variable names
  • storage locations
  • value versions

Static single assignment form helps:

x0 = 2
x1 = x0 * x0
x2 = x1 + 1

Now each value has a unique definition, and reverse propagation is well defined.

This is one reason AD compilers often lower programs into SSA-like intermediate representations.

Failure Modes

Reverse accumulation assumes that the executed computation has differentiable local rules.

Problems arise with:

  • discontinuities
  • integer control flow
  • mutation aliases
  • in-place overwrites
  • undefined domains
  • non-deterministic operations
  • external side effects

A practical AD system must define semantics for these cases.

Often the system differentiates the executed trace only. It does not differentiate all possible traces.

Summary

Reverse accumulation propagates adjoints backward through a recorded computation.

Its central equation is:

xˉ+=JϕTzˉ \bar x += J_\phi^T\bar z

For scalar-output functions with many inputs, reverse mode is usually the most efficient way to compute gradients.

Its strength is computational efficiency.

Its cost is memory, bookkeeping, and more complex semantics for control flow, mutation, and state.