Skip to content

Chapter 6. Reverse Mode Automatic Differentiation

Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a...

Adjoint Propagation

Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a tangent, which says how that value changes when the input is perturbed. In reverse mode, each intermediate value carries an adjoint, which says how the final output changes when that intermediate value is perturbed.

Consider a scalar-valued function

f:RnR. f : \mathbb{R}^n \to \mathbb{R}.

A program evaluates f(x)f(x) by producing intermediate values

v1,v2,,vk. v_1, v_2, \ldots, v_k.

Some of these values are inputs. Others are produced by elementary operations. The final output is one of the later values, usually written as

y=vk=f(x). y = v_k = f(x).

For each intermediate value viv_i, reverse mode defines an adjoint

vˉi=yvi. \bar v_i = \frac{\partial y}{\partial v_i}.

The bar notation means “sensitivity of the output with respect to this variable.” If vˉi\bar v_i is large, a small change in viv_i causes a large change in the final output. If vˉi=0\bar v_i = 0, changing viv_i has no first-order effect on the output.

From Local Derivatives to Global Sensitivities

Suppose an intermediate value vjv_j depends on earlier values:

vj=g(va,vb). v_j = g(v_a, v_b).

If the final output yy depends on vjv_j, then vav_a and vbv_b influence yy through vjv_j. The chain rule gives

yva+=yvjvjva, \frac{\partial y}{\partial v_a} \mathrel{+}= \frac{\partial y}{\partial v_j} \frac{\partial v_j}{\partial v_a},

and similarly,

yvb+=yvjvjvb. \frac{\partial y}{\partial v_b} \mathrel{+}= \frac{\partial y}{\partial v_j} \frac{\partial v_j}{\partial v_b}.

Using adjoint notation:

vˉa+=vˉjvjva, \bar v_a \mathrel{+}= \bar v_j \frac{\partial v_j}{\partial v_a}, vˉb+=vˉjvjvb. \bar v_b \mathrel{+}= \bar v_j \frac{\partial v_j}{\partial v_b}.

This is the basic rule of adjoint propagation.

The key detail is accumulation. A variable may influence the output through several later paths. Reverse mode must add all contributions into the same adjoint slot.

Example: A Small Computation

Let

x1=a,x2=b, x_1 = a,\qquad x_2 = b,

and define

v1=x1x2, v_1 = x_1 x_2, v2=sin(x1), v_2 = \sin(x_1), v3=v1+v2. v_3 = v_1 + v_2.

The output is

y=v3. y = v_3.

The forward evaluation computes the values in order:

StepExpressionValue
1v1=x1x2v_1 = x_1 x_2product
2v2=sin(x1)v_2 = \sin(x_1)sine
3v3=v1+v2v_3 = v_1 + v_2output

Reverse mode starts from the output:

vˉ3=yv3=1. \bar v_3 = \frac{\partial y}{\partial v_3} = 1.

Now propagate backward.

Since

v3=v1+v2, v_3 = v_1 + v_2,

we have

vˉ1+=vˉ31, \bar v_1 \mathrel{+}= \bar v_3 \cdot 1, vˉ2+=vˉ31. \bar v_2 \mathrel{+}= \bar v_3 \cdot 1.

So

vˉ1=1,vˉ2=1. \bar v_1 = 1,\qquad \bar v_2 = 1.

Since

v2=sin(x1), v_2 = \sin(x_1),

we propagate

xˉ1+=vˉ2cos(x1). \bar x_1 \mathrel{+}= \bar v_2 \cos(x_1).

Since

v1=x1x2, v_1 = x_1 x_2,

we propagate

xˉ1+=vˉ1x2, \bar x_1 \mathrel{+}= \bar v_1 x_2, xˉ2+=vˉ1x1. \bar x_2 \mathrel{+}= \bar v_1 x_1.

Therefore,

xˉ1=x2+cos(x1), \bar x_1 = x_2 + \cos(x_1), xˉ2=x1. \bar x_2 = x_1.

These are exactly

yx1=x2+cos(x1), \frac{\partial y}{\partial x_1} = x_2 + \cos(x_1), yx2=x1. \frac{\partial y}{\partial x_2} = x_1.

The computation has one output and two inputs. Reverse mode computes both input derivatives in one backward pass.

Why Reverse Mode Is Efficient for Scalar Outputs

Reverse mode is most useful when the function has many inputs and few outputs:

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

with mm small, especially m=1m = 1.

For a scalar output function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the full gradient is

f(x)=(fx1,fx2,,fxn). \nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right).

Forward mode computes directional derivatives. To recover the full gradient by forward mode, we usually seed one input direction at a time. That requires roughly nn forward passes.

Reverse mode computes all components of the gradient in one reverse pass after one forward pass. This explains its central role in machine learning, where a loss function usually has millions or billions of parameters but only one scalar loss.

Local Pullbacks

Each primitive operation in reverse mode can be described by a pullback.

Suppose

z=g(x). z = g(x).

The forward pass computes zz. The reverse pass receives zˉ\bar z, the sensitivity of the output with respect to zz, and returns the sensitivity with respect to xx:

xˉ=zˉzx. \bar x = \bar z \frac{\partial z}{\partial x}.

For a multivariate primitive

z=g(x1,x2,,xr), z = g(x_1, x_2, \ldots, x_r),

the pullback maps one output adjoint into several input adjoints:

xˉi+=zˉzxi. \bar x_i \mathrel{+}= \bar z \frac{\partial z}{\partial x_i}.

The primitive therefore needs two pieces of information:

  1. the local derivative rule;
  2. the primal values needed by that rule.

For example, for multiplication

z=xy, z = xy,

the reverse rules are

xˉ+=zˉy, \bar x \mathrel{+}= \bar z y, yˉ+=zˉx. \bar y \mathrel{+}= \bar z x.

The backward rule needs the saved forward values xx and yy. This is why reverse mode usually stores a tape or graph during the forward pass.

Adjoint Propagation Algorithm

A minimal reverse mode algorithm has two phases.

First, run the program forward. Store each primitive operation and the values needed for its derivative.

Second, initialize the output adjoint to one:

yˉ=1. \bar y = 1.

Then traverse the recorded operations in reverse order. For each operation, apply its pullback and accumulate adjoints into its inputs.

In pseudocode:

forward:
    for op in program_order:
        compute output value
        record op, inputs, output, needed primal values

reverse:
    set all adjoints to 0
    output.adjoint = 1

    for op in reverse(program_order):
        apply local pullback
        add contributions to input adjoints

This algorithm works because the computation graph is acyclic when viewed as a single execution trace. Even if the source program contains loops or branches, one concrete execution produces a finite ordered trace of primitive operations.

Accumulation at Shared Variables

Shared variables are common in real programs. Consider

v1=x2, v_1 = x^2, v2=x3, v_2 = x^3, y=v1+v2. y = v_1 + v_2.

The input xx affects yy through both v1v_1 and v2v_2. Reverse mode must add both contributions:

xˉ=vˉ1v1x+vˉ2v2x. \bar x = \bar v_1 \frac{\partial v_1}{\partial x} + \bar v_2 \frac{\partial v_2}{\partial x}.

Since

vˉ1=1,vˉ2=1, \bar v_1 = 1, \qquad \bar v_2 = 1,

we get

xˉ=2x+3x2. \bar x = 2x + 3x^2.

A system that overwrites adjoints instead of accumulating them computes the wrong gradient. This is one of the most important implementation details in reverse mode.

Vector Outputs and Seed Adjoints

For a vector-valued function

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

reverse mode does not directly compute the full Jacobian in one pass. Instead, it computes a vector-Jacobian product.

Given a seed vector

λRm, \lambda \in \mathbb{R}^m,

reverse mode computes

λJf(x), \lambda^\top J_f(x),

where

Jf(x)=fx. J_f(x) = \frac{\partial f}{\partial x}.

The scalar-output case is the special case m=1m = 1, where the seed is simply

λ=1. \lambda = 1.

This is why reverse mode is naturally suited to scalar losses. For multiple outputs, each independent seed gives one linear combination of Jacobian rows.

Reverse Mode as Transposed Linearization

Forward mode propagates tangents through the linearization of the program. Reverse mode propagates adjoints through the transpose of that linearization.

For a local operation

z=g(x), z = g(x),

the differential is

z˙=Jg(x)x˙. \dot z = J_g(x)\dot x.

Forward mode applies Jg(x)J_g(x) to a tangent vector.

Reverse mode applies the transpose:

xˉ=Jg(x)zˉ. \bar x = J_g(x)^\top \bar z.

This viewpoint is important because it separates the primal computation from the derivative computation. The forward pass evaluates the original program. The reverse pass evaluates transposed local linear maps in reverse order.

For a composition

y=gkgk1g1(x), y = g_k \circ g_{k-1} \circ \cdots \circ g_1(x),

the forward linearization is

Jy=JgkJgk1Jg1. J_y = J_{g_k} J_{g_{k-1}} \cdots J_{g_1}.

The reverse pass applies the transpose in the opposite order:

Jy=Jg1Jgk1Jgk. J_y^\top = J_{g_1}^\top \cdots J_{g_{k-1}}^\top J_{g_k}^\top.

That reversal is the mathematical reason reverse mode runs backward through the program trace.

Operational Meaning

Adjoint propagation gives each intermediate value a local responsibility: when the reverse pass reaches an operation, it distributes the output sensitivity to that operation’s inputs.

The operation does not need to know the whole program. It only needs its local derivative rule and the adjoint of its result.

This locality is what makes reverse mode modular. A large differentiable program can be built from primitive operations, each with a small backward rule. The global gradient emerges by applying the chain rule over the recorded execution.