Skip to content

Reverse Accumulation Algorithms

Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse...

Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse pass traverses that recorded structure backward and accumulates adjoint contributions into each variable.

The word accumulation is important. A variable may influence the final output through many downstream paths. Reverse mode computes the total derivative by summing contributions from all such paths.

For a scalar-valued function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

reverse accumulation computes

f(x)=(fx1,,fxn) \nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n} \right)

using one backward traversal.

Forward Evaluation and Recording

Consider a straight-line program:

v1=x1x2, v_1 = x_1 x_2, v2=sin(v1), v_2 = \sin(v_1), v3=v2+x1, v_3 = v_2 + x_1, y=v3. y = v_3.

The forward pass computes intermediate values in order.

StepExpression
1v1=x1x2v_1 = x_1 x_2
2v2=sin(v1)v_2 = \sin(v_1)
3v3=v2+x1v_3 = v_2 + x_1
4y=v3y = v_3

During execution, the system records:

  1. operation type;
  2. input references;
  3. output reference;
  4. primal values needed later.

The resulting trace is a topologically ordered sequence of operations.

Reverse Accumulation Principle

For every variable viv_i, define adjoint

vˉi=yvi. \bar v_i = \frac{\partial y}{\partial v_i}.

The reverse pass begins with

yˉ=1. \bar y = 1.

Then each operation propagates adjoints backward using local derivative rules.

The chain rule gives

vˉi=jvˉjvjvi, \bar v_i = \sum_j \bar v_j \frac{\partial v_j}{\partial v_i},

where the sum ranges over all immediate children of viv_i in the computational graph.

This equation defines reverse accumulation.

Reverse Sweep Example

Start from the previous program.

Forward values:

v1=x1x2, v_1 = x_1 x_2, v2=sin(v1), v_2 = \sin(v_1), v3=v2+x1, v_3 = v_2 + x_1, y=v3. y = v_3.

Initialize all adjoints to zero:

xˉ1=0,xˉ2=0,vˉ1=0,vˉ2=0,vˉ3=0. \bar x_1 = 0, \qquad \bar x_2 = 0, \qquad \bar v_1 = 0, \qquad \bar v_2 = 0, \qquad \bar v_3 = 0.

Seed output:

yˉ=vˉ3=1. \bar y = \bar v_3 = 1.

Now traverse backward.

Step 1: Addition Node

Since

v3=v2+x1, v_3 = v_2 + x_1,

the reverse rule is

vˉ2+=vˉ3, \bar v_2 \mathrel{+}= \bar v_3, xˉ1+=vˉ3. \bar x_1 \mathrel{+}= \bar v_3.

Therefore,

vˉ2=1,xˉ1=1. \bar v_2 = 1, \qquad \bar x_1 = 1.

Step 2: Sine Node

Since

v2=sin(v1), v_2 = \sin(v_1),

the reverse rule is

vˉ1+=vˉ2cos(v1). \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1).

Thus,

vˉ1=cos(v1). \bar v_1 = \cos(v_1).

Step 3: Multiplication Node

Since

v1=x1x2, v_1 = x_1 x_2,

the reverse rules are

xˉ1+=vˉ1x2, \bar x_1 \mathrel{+}= \bar v_1 x_2, xˉ2+=vˉ1x1. \bar x_2 \mathrel{+}= \bar v_1 x_1.

Substituting:

xˉ1=1+x2cos(v1), \bar x_1 = 1 + x_2\cos(v_1), xˉ2=x1cos(v1). \bar x_2 = x_1\cos(v_1).

Since

v1=x1x2, v_1 = x_1 x_2,

the final derivatives are

yx1=1+x2cos(x1x2), \frac{\partial y}{\partial x_1} = 1 + x_2\cos(x_1x_2), yx2=x1cos(x1x2). \frac{\partial y}{\partial x_2} = x_1\cos(x_1x_2).

Generic Reverse Accumulation Algorithm

The reverse accumulation procedure can be expressed abstractly.

Suppose the forward pass generated operations

o1,o2,,ok. o_1, o_2, \ldots, o_k.

Each operation defines

vj=gj(vp1,,vpr). v_j = g_j(v_{p_1}, \ldots, v_{p_r}).

The reverse algorithm:

  1. initialize all adjoints to zero;
  2. set output adjoint to one;
  3. process operations in reverse order;
  4. apply local pullbacks;
  5. accumulate contributions into parent adjoints.

Pseudocode:

forward:
    for op in operations:
        compute output
        record operation metadata

reverse:
    for variable:
        adjoint = 0

    output.adjoint = 1

    for op in reverse(operations):
        for input in op.inputs:
            input.adjoint += (
                op.output.adjoint
                * local_derivative(op, input)
            )

This structure is independent of the specific primitive operations.

Local Reverse Rules

Every primitive operation defines a reverse accumulation rule.

Addition

z=x+y z = x + y

Reverse:

xˉ+=zˉ, \bar x \mathrel{+}= \bar z, yˉ+=zˉ. \bar y \mathrel{+}= \bar z.

Subtraction

z=xy z = x - y

Reverse:

xˉ+=zˉ, \bar x \mathrel{+}= \bar z, yˉ+=zˉ. \bar y \mathrel{+}= -\bar z.

Multiplication

z=xy z = xy

Reverse:

xˉ+=zˉy, \bar x \mathrel{+}= \bar z y, yˉ+=zˉx. \bar y \mathrel{+}= \bar z x.

Division

z=xy z = \frac{x}{y}

Reverse:

xˉ+=zˉy, \bar x \mathrel{+}= \frac{\bar z}{y}, yˉ+=zˉxy2. \bar y \mathrel{+}= -\bar z \frac{x}{y^2}.

Exponential

z=ex z = e^x

Reverse:

xˉ+=zˉex. \bar x \mathrel{+}= \bar z e^x.

Often implemented as

xˉ+=zˉz, \bar x \mathrel{+}= \bar z z,

using the stored forward output.

Sine

z=sin(x) z = \sin(x)

Reverse:

xˉ+=zˉcos(x). \bar x \mathrel{+}= \bar z \cos(x).

Each primitive only requires local derivative knowledge.

Graph Interpretation

The reverse accumulation algorithm computes gradients by traversing the computational graph backward.

At each node:

  1. receive output adjoint;
  2. compute local contributions;
  3. distribute contributions into inputs.

The full gradient emerges from repeated local applications of the chain rule.

This locality is one of the most important properties of reverse mode systems.

Sparse Dependency Propagation

Many programs have sparse dependency structure.

Example:

y=x1x2+x3. y = x_1x_2 + x_3.

Variable x4x_4 does not influence yy.

Thus,

yx4=0. \frac{\partial y}{\partial x_4} = 0.

Reverse accumulation naturally respects sparsity because adjoints only propagate along dependency edges.

Nodes disconnected from the output never receive nonzero adjoints.

In-Place Adjoint Updates

Most implementations use mutable adjoint buffers.

Operationally:

adjoint[input] += contribution

This avoids repeated allocation and permits efficient accumulation.

However, parallel execution becomes more difficult because several operations may attempt to update the same adjoint simultaneously.

Parallel reverse mode therefore requires:

  1. synchronization;
  2. atomic accumulation;
  3. graph partitioning;
  4. reduction strategies.

Reverse Topological Ordering

The reverse sweep requires valid dependency order.

Suppose

vivj v_i \to v_j

in the forward graph.

Then reverse propagation requires

vˉj \bar v_j

before computing contributions into

vˉi. \bar v_i.

Thus the reverse pass processes nodes in reverse topological order.

Straight-line programs automatically satisfy this property because the forward execution trace already records operations in topological sequence.

Complexity of Reverse Accumulation

Suppose evaluating the original program costs

C. C.

Reverse accumulation usually costs a small constant multiple of CC.

The total cost becomes approximately

PhaseCost
Forward passCC
Reverse passαC\alpha C
Total(1+α)C(1+\alpha)C

Typical values:

2α5. 2 \le \alpha \le 5.

The precise value depends on:

  1. primitive operation complexity;
  2. memory traffic;
  3. tensor kernel efficiency;
  4. recomputation strategy.

Reverse Accumulation and Memory

Reverse accumulation requires access to forward information.

The reverse rules for many operations depend on saved primal values:

OperationRequired Data
multiplicationinputs
divisiondenominator
exponentialoutput or input
matrix factorizationdecomposition state

The system must therefore preserve enough information from the forward pass.

This creates the central tradeoff in reverse mode:

StrategyCost
Save everythinghigh memory
Recompute valueshigh compute

Checkpointing algorithms balance this tradeoff.

Reverse Accumulation as Backward Linear Propagation

Each local reverse rule applies a transposed Jacobian.

Suppose

z=g(x). z = g(x).

The local Jacobian is

Jg(x). J_g(x).

Forward mode propagates tangents:

z˙=Jg(x)x˙. \dot z = J_g(x)\dot x.

Reverse accumulation propagates adjoints:

xˉ=Jg(x)zˉ. \bar x = J_g(x)^\top \bar z.

Thus the reverse pass computes repeated transposed linear transformations.

The global reverse accumulation algorithm is therefore the composition of local transposed Jacobian actions over the computational graph.

Conceptual Summary

Reverse accumulation algorithms operationalize the multivariate chain rule.

The forward pass builds a dependency trace and computes primal values.

The reverse pass:

  1. starts from output sensitivities;
  2. traverses operations backward;
  3. applies local pullbacks;
  4. accumulates adjoints into dependencies.

Every gradient computed by reverse mode emerges from repeated local adjoint accumulation over the reverse computational graph.