Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a...
Adjoint Propagation
Reverse mode automatic differentiation computes derivatives by propagating sensitivities backward through a computation. In forward mode, each intermediate value carries a tangent, which says how that value changes when the input is perturbed. In reverse mode, each intermediate value carries an adjoint, which says how the final output changes when that intermediate value is perturbed.
Consider a scalar-valued function
A program evaluates by producing intermediate values
Some of these values are inputs. Others are produced by elementary operations. The final output is one of the later values, usually written as
For each intermediate value , reverse mode defines an adjoint
The bar notation means “sensitivity of the output with respect to this variable.” If is large, a small change in causes a large change in the final output. If , changing has no first-order effect on the output.
From Local Derivatives to Global Sensitivities
Suppose an intermediate value depends on earlier values:
If the final output depends on , then and influence through . The chain rule gives
and similarly,
Using adjoint notation:
This is the basic rule of adjoint propagation.
The key detail is accumulation. A variable may influence the output through several later paths. Reverse mode must add all contributions into the same adjoint slot.
Example: A Small Computation
Let
and define
The output is
The forward evaluation computes the values in order:
| Step | Expression | Value |
|---|---|---|
| 1 | product | |
| 2 | sine | |
| 3 | output |
Reverse mode starts from the output:
Now propagate backward.
Since
we have
So
Since
we propagate
Since
we propagate
Therefore,
These are exactly
The computation has one output and two inputs. Reverse mode computes both input derivatives in one backward pass.
Why Reverse Mode Is Efficient for Scalar Outputs
Reverse mode is most useful when the function has many inputs and few outputs:
with small, especially .
For a scalar output function
the full gradient is
Forward mode computes directional derivatives. To recover the full gradient by forward mode, we usually seed one input direction at a time. That requires roughly forward passes.
Reverse mode computes all components of the gradient in one reverse pass after one forward pass. This explains its central role in machine learning, where a loss function usually has millions or billions of parameters but only one scalar loss.
Local Pullbacks
Each primitive operation in reverse mode can be described by a pullback.
Suppose
The forward pass computes . The reverse pass receives , the sensitivity of the output with respect to , and returns the sensitivity with respect to :
For a multivariate primitive
the pullback maps one output adjoint into several input adjoints:
The primitive therefore needs two pieces of information:
- the local derivative rule;
- the primal values needed by that rule.
For example, for multiplication
the reverse rules are
The backward rule needs the saved forward values and . This is why reverse mode usually stores a tape or graph during the forward pass.
Adjoint Propagation Algorithm
A minimal reverse mode algorithm has two phases.
First, run the program forward. Store each primitive operation and the values needed for its derivative.
Second, initialize the output adjoint to one:
Then traverse the recorded operations in reverse order. For each operation, apply its pullback and accumulate adjoints into its inputs.
In pseudocode:
forward:
for op in program_order:
compute output value
record op, inputs, output, needed primal values
reverse:
set all adjoints to 0
output.adjoint = 1
for op in reverse(program_order):
apply local pullback
add contributions to input adjointsThis algorithm works because the computation graph is acyclic when viewed as a single execution trace. Even if the source program contains loops or branches, one concrete execution produces a finite ordered trace of primitive operations.
Accumulation at Shared Variables
Shared variables are common in real programs. Consider
The input affects through both and . Reverse mode must add both contributions:
Since
we get
A system that overwrites adjoints instead of accumulating them computes the wrong gradient. This is one of the most important implementation details in reverse mode.
Vector Outputs and Seed Adjoints
For a vector-valued function
reverse mode does not directly compute the full Jacobian in one pass. Instead, it computes a vector-Jacobian product.
Given a seed vector
reverse mode computes
where
The scalar-output case is the special case , where the seed is simply
This is why reverse mode is naturally suited to scalar losses. For multiple outputs, each independent seed gives one linear combination of Jacobian rows.
Reverse Mode as Transposed Linearization
Forward mode propagates tangents through the linearization of the program. Reverse mode propagates adjoints through the transpose of that linearization.
For a local operation
the differential is
Forward mode applies to a tangent vector.
Reverse mode applies the transpose:
This viewpoint is important because it separates the primal computation from the derivative computation. The forward pass evaluates the original program. The reverse pass evaluates transposed local linear maps in reverse order.
For a composition
the forward linearization is
The reverse pass applies the transpose in the opposite order:
That reversal is the mathematical reason reverse mode runs backward through the program trace.
Operational Meaning
Adjoint propagation gives each intermediate value a local responsibility: when the reverse pass reaches an operation, it distributes the output sensitivity to that operation’s inputs.
The operation does not need to know the whole program. It only needs its local derivative rule and the adjoint of its result.
This locality is what makes reverse mode modular. A large differentiable program can be built from primitive operations, each with a small backward rule. The global gradient emerges by applying the chain rule over the recorded execution.