Chapter 6. Reverse Mode Automatic Differentiation

Adjoint Propagation

Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a tangent, which says how that value changes when the input is perturbed. In reverse mode, each intermediate value carries an adjoint, which says how the final output changes when that intermediate value is perturbed.

Consider a scalar-valued function

f : \mathbb{R}^n \to \mathbb{R}.

A program evaluates $f(x)$ by producing intermediate values

v_1, v_2, \ldots, v_k.

Some of these values are inputs. Others are produced by elementary operations. The final output is one of the later values, usually written as

y = v_k = f(x).

For each intermediate value $v_i$ , reverse mode defines an adjoint

\bar v_i = \frac{\partial y}{\partial v_i}.

The bar notation means “sensitivity of the output with respect to this variable.” If $\bar v_i$ is large, a small change in $v_i$ causes a large change in the final output. If $\bar v_i = 0$ , changing $v_i$ has no first-order effect on the output.

From Local Derivatives to Global Sensitivities

Suppose an intermediate value $v_j$ depends on earlier values:

v_j = g(v_a, v_b).

If the final output $y$ depends on $v_j$ , then $v_a$ and $v_b$ influence $y$ through $v_j$ . The chain rule gives

\frac{\partial y}{\partial v_a} \mathrel{+}= \frac{\partial y}{\partial v_j} \frac{\partial v_j}{\partial v_a},

and similarly,

\frac{\partial y}{\partial v_b} \mathrel{+}= \frac{\partial y}{\partial v_j} \frac{\partial v_j}{\partial v_b}.

Using adjoint notation:

\bar v_a \mathrel{+}= \bar v_j \frac{\partial v_j}{\partial v_a},

\bar v_b \mathrel{+}= \bar v_j \frac{\partial v_j}{\partial v_b}.

This is the basic rule of adjoint propagation.

The key detail is accumulation. A variable may influence the output through several later paths. Reverse mode must add all contributions into the same adjoint slot.

Example: A Small Computation

Let

x_1 = a,\qquad x_2 = b,

and define

v_1 = x_1 x_2,

v_2 = \sin(x_1),

v_3 = v_1 + v_2.

The output is

y = v_3.

The forward evaluation computes the values in order:

Step	Expression	Value
1	$v_1 = x_1 x_2$	product
2	$v_2 = \sin(x_1)$	sine
3	$v_3 = v_1 + v_2$	output

Reverse mode starts from the output:

\bar v_3 = \frac{\partial y}{\partial v_3} = 1.

Now propagate backward.

Since

v_3 = v_1 + v_2,

we have

\bar v_1 \mathrel{+}= \bar v_3 \cdot 1,

\bar v_2 \mathrel{+}= \bar v_3 \cdot 1.

\bar v_1 = 1,\qquad \bar v_2 = 1.

Since

v_2 = \sin(x_1),

we propagate

\bar x_1 \mathrel{+}= \bar v_2 \cos(x_1).

Since

v_1 = x_1 x_2,

we propagate

\bar x_1 \mathrel{+}= \bar v_1 x_2,

\bar x_2 \mathrel{+}= \bar v_1 x_1.

Therefore,

\bar x_1 = x_2 + \cos(x_1),

\bar x_2 = x_1.

These are exactly

\frac{\partial y}{\partial x_1} = x_2 + \cos(x_1),

\frac{\partial y}{\partial x_2} = x_1.

The computation has one output and two inputs. Reverse mode computes both input derivatives in one backward pass.

Why Reverse Mode Is Efficient for Scalar Outputs

Reverse mode is most useful when the function has many inputs and few outputs:

f : \mathbb{R}^n \to \mathbb{R}^m,

with $m$ small, especially $m = 1$ .

For a scalar output function

f : \mathbb{R}^n \to \mathbb{R},

the full gradient is

\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right).

Forward mode computes directional derivatives. To recover the full gradient by forward mode, we usually seed one input direction at a time. That requires roughly $n$ forward passes.

Reverse mode computes all components of the gradient in one reverse pass after one forward pass. This explains its central role in machine learning, where a loss function usually has millions or billions of parameters but only one scalar loss.

Local Pullbacks

Each primitive operation in reverse mode can be described by a pullback.

Suppose

z = g(x).

The forward pass computes $z$ . The reverse pass receives $\bar z$ , the sensitivity of the output with respect to $z$ , and returns the sensitivity with respect to $x$ :

\bar x = \bar z \frac{\partial z}{\partial x}.

For a multivariate primitive

z = g(x_1, x_2, \ldots, x_r),

the pullback maps one output adjoint into several input adjoints:

\bar x_i \mathrel{+}= \bar z \frac{\partial z}{\partial x_i}.

The primitive therefore needs two pieces of information:

the local derivative rule;
the primal values needed by that rule.

For example, for multiplication

z = xy,

the reverse rules are

\bar x \mathrel{+}= \bar z y,

\bar y \mathrel{+}= \bar z x.

The backward rule needs the saved forward values $x$ and $y$ . This is why reverse mode usually stores a tape or graph during the forward pass.

Adjoint Propagation Algorithm

A minimal reverse mode algorithm has two phases.

First, run the program forward. Store each primitive operation and the values needed for its derivative.

Second, initialize the output adjoint to one:

\bar y = 1.

Then traverse the recorded operations in reverse order. For each operation, apply its pullback and accumulate adjoints into its inputs.

In pseudocode:

forward:
    for op in program_order:
        compute output value
        record op, inputs, output, needed primal values

reverse:
    set all adjoints to 0
    output.adjoint = 1

    for op in reverse(program_order):
        apply local pullback
        add contributions to input adjoints

This algorithm works because the computation graph is acyclic when viewed as a single execution trace. Even if the source program contains loops or branches, one concrete execution produces a finite ordered trace of primitive operations.

Accumulation at Shared Variables

Shared variables are common in real programs. Consider

v_1 = x^2,

v_2 = x^3,

y = v_1 + v_2.

The input $x$ affects $y$ through both $v_1$ and $v_2$ . Reverse mode must add both contributions:

\bar x = \bar v_1 \frac{\partial v_1}{\partial x} + \bar v_2 \frac{\partial v_2}{\partial x}.

Since

\bar v_1 = 1, \qquad \bar v_2 = 1,

we get

\bar x = 2x + 3x^2.

A system that overwrites adjoints instead of accumulating them computes the wrong gradient. This is one of the most important implementation details in reverse mode.

Vector Outputs and Seed Adjoints

For a vector-valued function

f : \mathbb{R}^n \to \mathbb{R}^m,

reverse mode does not directly compute the full Jacobian in one pass. Instead, it computes a vector-Jacobian product.

Given a seed vector

\lambda \in \mathbb{R}^m,

reverse mode computes

\lambda^\top J_f(x),

where

J_f(x) = \frac{\partial f}{\partial x}.

The scalar-output case is the special case $m = 1$ , where the seed is simply

\lambda = 1.

This is why reverse mode is naturally suited to scalar losses. For multiple outputs, each independent seed gives one linear combination of Jacobian rows.

Reverse Mode as Transposed Linearization

Forward mode propagates tangents through the linearization of the program. Reverse mode propagates adjoints through the transpose of that linearization.

For a local operation

z = g(x),

the differential is

\dot z = J_g(x)\dot x.

Forward mode applies $J_g(x)$ to a tangent vector.

Reverse mode applies the transpose:

\bar x = J_g(x)^\top \bar z.

This viewpoint is important because it separates the primal computation from the derivative computation. The forward pass evaluates the original program. The reverse pass evaluates transposed local linear maps in reverse order.

For a composition

y = g_k \circ g_{k-1} \circ \cdots \circ g_1(x),

the forward linearization is

J_y = J_{g_k} J_{g_{k-1}} \cdots J_{g_1}.

The reverse pass applies the transpose in the opposite order:

J_y^\top = J_{g_1}^\top \cdots J_{g_{k-1}}^\top J_{g_k}^\top.

That reversal is the mathematical reason reverse mode runs backward through the program trace.

Operational Meaning

Adjoint propagation gives each intermediate value a local responsibility: when the reverse pass reaches an operation, it distributes the output sensitivity to that operation’s inputs.

The operation does not need to know the whole program. It only needs its local derivative rule and the adjoint of its result.

This locality is what makes reverse mode modular. A large differentiable program can be built from primitive operations, each with a small backward rule. The global gradient emerges by applying the chain rule over the recorded execution.