# Reverse Accumulation

Reverse accumulation is the reverse-mode form of automatic differentiation. It propagates derivative information backward from outputs to inputs.

Forward mode asks:

$$
\text{How does each intermediate value change when the input changes?}
$$

Reverse mode asks:

$$
\text{How does the final output change when each intermediate value changes?}
$$

The second question is the one needed for gradients of scalar-output functions.

## The Adjoint Value

Reverse mode attaches an adjoint to each intermediate variable.

For a final scalar output $y$, the adjoint of an intermediate value $v$ is:

$$
\bar v =
\frac{\partial y}{\partial v}
$$

The bar notation means sensitivity of the output with respect to that variable.

If changing $v$ slightly causes a large change in $y$, then $\bar v$ is large.

Reverse accumulation computes these adjoints from the end of the program back to the beginning.

## Basic Example

Consider:

$$
y=f(x)=\sin(x^2+3x)
$$

Decompose the computation:

| Step | Expression | Variable |
|---:|---|---|
| 1 | $x^2$ | $v_1$ |
| 2 | $3x$ | $v_2$ |
| 3 | $v_1+v_2$ | $v_3$ |
| 4 | $\sin(v_3)$ | $v_4=y$ |

The forward pass computes and stores the primal values:

$$
v_1=x^2,\quad v_2=3x,\quad v_3=v_1+v_2,\quad v_4=\sin(v_3)
$$

The reverse pass starts with:

$$
\bar v_4 = 1
$$

because:

$$
\frac{\partial y}{\partial y}=1
$$

Now propagate backward.

For:

$$
v_4=\sin(v_3)
$$

we have:

$$
\bar v_3 += \bar v_4 \cos(v_3)
$$

For:

$$
v_3=v_1+v_2
$$

we have:

$$
\bar v_1 += \bar v_3
$$

$$
\bar v_2 += \bar v_3
$$

For:

$$
v_2=3x
$$

we have:

$$
\bar x += 3\bar v_2
$$

For:

$$
v_1=x^2
$$

we have:

$$
\bar x += 2x\bar v_1
$$

The final adjoint $\bar x$ is:

$$
\frac{dy}{dx} =
(2x+3)\cos(x^2+3x)
$$

## Reverse Mode as Local Transpose Propagation

Each primitive operation has a local Jacobian.

Suppose:

$$
z=\phi(x)
$$

Forward mode propagates tangents by multiplying with the local Jacobian:

$$
\dot z = J_\phi \dot x
$$

Reverse mode propagates adjoints by multiplying with the transpose of the local Jacobian:

$$
\bar x += J_\phi^T \bar z
$$

This is the central algebraic distinction.

Forward mode computes Jacobian-vector products.

Reverse mode computes vector-Jacobian products.

## Vector-Jacobian Products

For a function:

$$
f:\mathbb{R}^n\to\mathbb{R}^m
$$

the Jacobian has shape:

$$
J_f(x)\in\mathbb{R}^{m\times n}
$$

Given an output cotangent vector:

$$
\bar y\in\mathbb{R}^m
$$

reverse mode computes:

$$
\bar x =
J_f(x)^T\bar y
$$

Equivalently, as a row-vector convention, it computes:

$$
\bar y^T J_f(x)
$$

This is a vector-Jacobian product.

When $m=1$, the output seed is just $1$, and reverse mode returns the full gradient:

$$
\nabla f(x)
$$

This is why reverse mode is efficient for scalar losses with many parameters.

## Reverse Propagation Rules

Reverse propagation uses local rules that distribute output adjoints to input adjoints.

### Addition

$$
z=x+y
$$

Local derivatives:

$$
\frac{\partial z}{\partial x}=1,\quad
\frac{\partial z}{\partial y}=1
$$

Reverse rule:

$$
\bar x += \bar z
$$

$$
\bar y += \bar z
$$

### Multiplication

$$
z=xy
$$

Local derivatives:

$$
\frac{\partial z}{\partial x}=y,\quad
\frac{\partial z}{\partial y}=x
$$

Reverse rule:

$$
\bar x += y\bar z
$$

$$
\bar y += x\bar z
$$

### Sine

$$
z=\sin(x)
$$

Local derivative:

$$
\frac{dz}{dx}=\cos(x)
$$

Reverse rule:

$$
\bar x += \cos(x)\bar z
$$

### Matrix Multiplication

$$
C=AB
$$

Reverse rules:

$$
\bar A += \bar C B^T
$$

$$
\bar B += A^T \bar C
$$

The same principle holds: local derivatives are applied backward, and adjoints are accumulated.

## Accumulation at Shared Inputs

A variable may influence the output through multiple paths.

Example:

$$
y=x^2+x
$$

The variable $x$ contributes through:
- $x^2$
- $x$

Reverse mode must add both contributions.

Using intermediate variables:

$$
v_1=x^2
$$

$$
v_2=x
$$

$$
y=v_1+v_2
$$

Backward propagation gives:

$$
\bar v_1=1,\quad \bar v_2=1
$$

Then:

$$
\bar x += 2x\bar v_1
$$

and:

$$
\bar x += \bar v_2
$$

Therefore:

$$
\bar x=2x+1
$$

This is why reverse-mode implementations use additive updates such as:

```text
x.grad += contribution
```

rather than assignment.

## Forward Pass and Reverse Pass

Reverse accumulation usually requires two passes.

First pass:

1. execute the primal program
2. record intermediate values
3. record dependency structure

Second pass:

1. seed output adjoints
2. traverse operations in reverse order
3. apply local transpose rules
4. accumulate input adjoints

The recorded structure is often called a tape, trace, or Wengert list.

A simple tape entry might contain:

```text
op: mul
inputs: x, y
output: z
saved: x_value, y_value
backward:
    x.grad += y.value * z.grad
    y.grad += x.value * z.grad
```

The saved primal values are needed because reverse rules often depend on them.

## Cost Model

Let $C(f)$ be the cost of evaluating the primal function.

For scalar-output functions:

$$
f:\mathbb{R}^n\to\mathbb{R}
$$

reverse mode computes the full gradient with cost usually within a small constant multiple of $C(f)$.

The important point is that the cost does not scale linearly with $n$ in the way forward mode does for a full gradient.

This makes reverse mode the standard method for training neural networks and optimizing large parameterized models.

## Memory Cost

Reverse mode trades memory for derivative efficiency.

The reverse pass often needs intermediate primal values from the forward pass.

For a long computation, this can require substantial memory:

$$
\text{memory} \propto \text{number of recorded operations}
$$

This cost appears in:
- deep neural networks
- long simulations
- recurrent models
- differentiable physics
- solver-based computations

Checkpointing reduces memory by storing only selected states and recomputing others during the reverse pass.

## Reverse Mode and Backpropagation

Backpropagation is reverse-mode automatic differentiation applied to neural networks.

A neural network is a composition of layers:

$$
y = f_L(f_{L-1}(\cdots f_1(x)))
$$

The loss is usually scalar:

$$
\ell(\theta)=L(y,t)
$$

Reverse mode propagates:

$$
\frac{\partial \ell}{\partial y},
\frac{\partial \ell}{\partial f_L},
\dots,
\frac{\partial \ell}{\partial \theta}
$$

Layer by layer, the system applies local transpose rules.

Backpropagation is therefore not a separate mathematical technique from AD. It is a specialized instance of reverse accumulation.

## Reverse Mode with Multiple Outputs

If:

$$
f:\mathbb{R}^n\to\mathbb{R}^m
$$

and $m>1$, a reverse pass computes:

$$
J_f(x)^T\bar y
$$

for one chosen output seed $\bar y$.

To compute the full Jacobian with reverse mode, one can seed each output basis vector:

$$
\bar y=e_i
$$

Each reverse pass gives one row of the Jacobian.

Thus:
- forward mode computes Jacobian columns
- reverse mode computes Jacobian rows

Reverse mode is preferred when output dimension is small.

Forward mode is preferred when input dimension is small.

## Mutability and Reverse Accumulation

Reverse accumulation becomes more complex when programs mutate state.

Consider:

```text
x = 2
x = x * x
x = x + 1
```

The name $x$ refers to different values over time.

A reverse-mode system must distinguish:
- variable names
- storage locations
- value versions

Static single assignment form helps:

```text
x0 = 2
x1 = x0 * x0
x2 = x1 + 1
```

Now each value has a unique definition, and reverse propagation is well defined.

This is one reason AD compilers often lower programs into SSA-like intermediate representations.

## Failure Modes

Reverse accumulation assumes that the executed computation has differentiable local rules.

Problems arise with:
- discontinuities
- integer control flow
- mutation aliases
- in-place overwrites
- undefined domains
- non-deterministic operations
- external side effects

A practical AD system must define semantics for these cases.

Often the system differentiates the executed trace only. It does not differentiate all possible traces.

## Summary

Reverse accumulation propagates adjoints backward through a recorded computation.

Its central equation is:

$$
\bar x += J_\phi^T\bar z
$$

For scalar-output functions with many inputs, reverse mode is usually the most efficient way to compute gradients.

Its strength is computational efficiency.

Its cost is memory, bookkeeping, and more complex semantics for control flow, mutation, and state.

