# Reverse Accumulation Algorithms

## Reverse Accumulation Algorithms

Reverse accumulation is the operational core of reverse mode automatic differentiation. The forward pass evaluates a program and records dependency information. The reverse pass traverses that recorded structure backward and accumulates adjoint contributions into each variable.

The word accumulation is important. A variable may influence the final output through many downstream paths. Reverse mode computes the total derivative by summing contributions from all such paths.

For a scalar-valued function

$$
f : \mathbb{R}^n \to \mathbb{R},
$$

reverse accumulation computes

$$
\nabla f(x) =
\left(
\frac{\partial f}{\partial x_1},
\ldots,
\frac{\partial f}{\partial x_n}
\right)
$$

using one backward traversal.

### Forward Evaluation and Recording

Consider a straight-line program:

$$
v_1 = x_1 x_2,
$$

$$
v_2 = \sin(v_1),
$$

$$
v_3 = v_2 + x_1,
$$

$$
y = v_3.
$$

The forward pass computes intermediate values in order.

| Step | Expression |
|---|---|
| 1 | $v_1 = x_1 x_2$ |
| 2 | $v_2 = \sin(v_1)$ |
| 3 | $v_3 = v_2 + x_1$ |
| 4 | $y = v_3$ |

During execution, the system records:

1. operation type;
2. input references;
3. output reference;
4. primal values needed later.

The resulting trace is a topologically ordered sequence of operations.

### Reverse Accumulation Principle

For every variable $v_i$, define adjoint

$$
\bar v_i =
\frac{\partial y}{\partial v_i}.
$$

The reverse pass begins with

$$
\bar y = 1.
$$

Then each operation propagates adjoints backward using local derivative rules.

The chain rule gives

$$
\bar v_i =
\sum_j
\bar v_j
\frac{\partial v_j}{\partial v_i},
$$

where the sum ranges over all immediate children of $v_i$ in the computational graph.

This equation defines reverse accumulation.

### Reverse Sweep Example

Start from the previous program.

Forward values:

$$
v_1 = x_1 x_2,
$$

$$
v_2 = \sin(v_1),
$$

$$
v_3 = v_2 + x_1,
$$

$$
y = v_3.
$$

Initialize all adjoints to zero:

$$
\bar x_1 = 0,
\qquad
\bar x_2 = 0,
\qquad
\bar v_1 = 0,
\qquad
\bar v_2 = 0,
\qquad
\bar v_3 = 0.
$$

Seed output:

$$
\bar y = \bar v_3 = 1.
$$

Now traverse backward.

#### Step 1: Addition Node

Since

$$
v_3 = v_2 + x_1,
$$

the reverse rule is

$$
\bar v_2 \mathrel{+}= \bar v_3,
$$

$$
\bar x_1 \mathrel{+}= \bar v_3.
$$

Therefore,

$$
\bar v_2 = 1,
\qquad
\bar x_1 = 1.
$$

#### Step 2: Sine Node

Since

$$
v_2 = \sin(v_1),
$$

the reverse rule is

$$
\bar v_1
\mathrel{+}=
\bar v_2 \cos(v_1).
$$

Thus,

$$
\bar v_1 = \cos(v_1).
$$

#### Step 3: Multiplication Node

Since

$$
v_1 = x_1 x_2,
$$

the reverse rules are

$$
\bar x_1
\mathrel{+}=
\bar v_1 x_2,
$$

$$
\bar x_2
\mathrel{+}=
\bar v_1 x_1.
$$

Substituting:

$$
\bar x_1 =
1 + x_2\cos(v_1),
$$

$$
\bar x_2 =
x_1\cos(v_1).
$$

Since

$$
v_1 = x_1 x_2,
$$

the final derivatives are

$$
\frac{\partial y}{\partial x_1} =
1 + x_2\cos(x_1x_2),
$$

$$
\frac{\partial y}{\partial x_2} =
x_1\cos(x_1x_2).
$$

### Generic Reverse Accumulation Algorithm

The reverse accumulation procedure can be expressed abstractly.

Suppose the forward pass generated operations

$$
o_1, o_2, \ldots, o_k.
$$

Each operation defines

$$
v_j = g_j(v_{p_1}, \ldots, v_{p_r}).
$$

The reverse algorithm:

1. initialize all adjoints to zero;
2. set output adjoint to one;
3. process operations in reverse order;
4. apply local pullbacks;
5. accumulate contributions into parent adjoints.

Pseudocode:

```text
forward:
    for op in operations:
        compute output
        record operation metadata

reverse:
    for variable:
        adjoint = 0

    output.adjoint = 1

    for op in reverse(operations):
        for input in op.inputs:
            input.adjoint += (
                op.output.adjoint
                * local_derivative(op, input)
            )
```

This structure is independent of the specific primitive operations.

### Local Reverse Rules

Every primitive operation defines a reverse accumulation rule.

#### Addition

$$
z = x + y
$$

Reverse:

$$
\bar x \mathrel{+}= \bar z,
$$

$$
\bar y \mathrel{+}= \bar z.
$$

#### Subtraction

$$
z = x - y
$$

Reverse:

$$
\bar x \mathrel{+}= \bar z,
$$

$$
\bar y \mathrel{+}= -\bar z.
$$

#### Multiplication

$$
z = xy
$$

Reverse:

$$
\bar x \mathrel{+}= \bar z y,
$$

$$
\bar y \mathrel{+}= \bar z x.
$$

#### Division

$$
z = \frac{x}{y}
$$

Reverse:

$$
\bar x \mathrel{+}= \frac{\bar z}{y},
$$

$$
\bar y \mathrel{+}=
-\bar z \frac{x}{y^2}.
$$

#### Exponential

$$
z = e^x
$$

Reverse:

$$
\bar x \mathrel{+}= \bar z e^x.
$$

Often implemented as

$$
\bar x \mathrel{+}= \bar z z,
$$

using the stored forward output.

#### Sine

$$
z = \sin(x)
$$

Reverse:

$$
\bar x \mathrel{+}= \bar z \cos(x).
$$

Each primitive only requires local derivative knowledge.

### Graph Interpretation

The reverse accumulation algorithm computes gradients by traversing the computational graph backward.

At each node:

1. receive output adjoint;
2. compute local contributions;
3. distribute contributions into inputs.

The full gradient emerges from repeated local applications of the chain rule.

This locality is one of the most important properties of reverse mode systems.

### Sparse Dependency Propagation

Many programs have sparse dependency structure.

Example:

$$
y = x_1x_2 + x_3.
$$

Variable $x_4$ does not influence $y$.

Thus,

$$
\frac{\partial y}{\partial x_4} = 0.
$$

Reverse accumulation naturally respects sparsity because adjoints only propagate along dependency edges.

Nodes disconnected from the output never receive nonzero adjoints.

### In-Place Adjoint Updates

Most implementations use mutable adjoint buffers.

Operationally:

```text
adjoint[input] += contribution
```

This avoids repeated allocation and permits efficient accumulation.

However, parallel execution becomes more difficult because several operations may attempt to update the same adjoint simultaneously.

Parallel reverse mode therefore requires:

1. synchronization;
2. atomic accumulation;
3. graph partitioning;
4. reduction strategies.

### Reverse Topological Ordering

The reverse sweep requires valid dependency order.

Suppose

$$
v_i \to v_j
$$

in the forward graph.

Then reverse propagation requires

$$
\bar v_j
$$

before computing contributions into

$$
\bar v_i.
$$

Thus the reverse pass processes nodes in reverse topological order.

Straight-line programs automatically satisfy this property because the forward execution trace already records operations in topological sequence.

### Complexity of Reverse Accumulation

Suppose evaluating the original program costs

$$
C.
$$

Reverse accumulation usually costs a small constant multiple of $C$.

The total cost becomes approximately

| Phase | Cost |
|---|---|
| Forward pass | $C$ |
| Reverse pass | $\alpha C$ |
| Total | $(1+\alpha)C$ |

Typical values:

$$
2 \le \alpha \le 5.
$$

The precise value depends on:

1. primitive operation complexity;
2. memory traffic;
3. tensor kernel efficiency;
4. recomputation strategy.

### Reverse Accumulation and Memory

Reverse accumulation requires access to forward information.

The reverse rules for many operations depend on saved primal values:

| Operation | Required Data |
|---|---|
| multiplication | inputs |
| division | denominator |
| exponential | output or input |
| matrix factorization | decomposition state |

The system must therefore preserve enough information from the forward pass.

This creates the central tradeoff in reverse mode:

| Strategy | Cost |
|---|---|
| Save everything | high memory |
| Recompute values | high compute |

Checkpointing algorithms balance this tradeoff.

### Reverse Accumulation as Backward Linear Propagation

Each local reverse rule applies a transposed Jacobian.

Suppose

$$
z = g(x).
$$

The local Jacobian is

$$
J_g(x).
$$

Forward mode propagates tangents:

$$
\dot z = J_g(x)\dot x.
$$

Reverse accumulation propagates adjoints:

$$
\bar x = J_g(x)^\top \bar z.
$$

Thus the reverse pass computes repeated transposed linear transformations.

The global reverse accumulation algorithm is therefore the composition of local transposed Jacobian actions over the computational graph.

### Conceptual Summary

Reverse accumulation algorithms operationalize the multivariate chain rule.

The forward pass builds a dependency trace and computes primal values.

The reverse pass:

1. starts from output sensitivities;
2. traverses operations backward;
3. applies local pullbacks;
4. accumulates adjoints into dependencies.

Every gradient computed by reverse mode emerges from repeated local adjoint accumulation over the reverse computational graph.

