Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.
Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.
Consider a differentiable function
Its Jacobian matrix is
The Jacobian maps input perturbations into output perturbations.
Forward mode computes Jacobian-vector products:
Reverse mode computes vector-Jacobian products:
These two operations are dual to each other.
Linearization of a Program
A differentiable program locally behaves like a linear map.
Near a point ,
The Jacobian is therefore the local linearization of the computation.
A tangent vector
propagates forward through the Jacobian:
An adjoint vector
propagates backward through the transpose:
Equivalently,
Reverse mode implements this backward propagation without explicitly materializing the Jacobian matrix.
Scalar Output as a Special Case
Suppose
Then the Jacobian has shape
It is simply the gradient row vector:
If we choose seed vector
then
This explains why reverse mode efficiently computes gradients for scalar-valued objectives.
Deep learning optimization almost always minimizes a scalar loss:
Reverse mode therefore computes
using one backward pass.
Example of a Vector-Jacobian Product
Define
The Jacobian is
Suppose we choose output seed vector
The vector-Jacobian product is
Therefore,
Reverse mode computes this result directly by backward propagation.
Reverse Propagation Interpretation
Suppose the outputs are
The seed vector assigns sensitivities to outputs:
Reverse propagation distributes these sensitivities into the inputs.
From
we obtain contributions:
From
we obtain:
Accumulating both contributions gives
Exactly the vector-Jacobian product.
Reverse Mode Never Builds the Full Jacobian
An explicit Jacobian may be extremely large.
For
the Jacobian contains one billion entries.
For tensor-valued neural networks, the Jacobian may be astronomically large.
Reverse mode avoids materializing the matrix. Instead, it computes only the action of the transpose on a seed vector.
Operationally:
| Method | Computes |
|---|---|
| Forward mode | |
| Reverse mode |
This distinction is fundamental.
Automatic differentiation systems are really systems for efficient linear operator evaluation.
Pullbacks
A reverse-mode primitive defines a pullback.
Suppose
The forward pass computes .
The reverse pass receives output sensitivity
and returns input sensitivity
This mapping
is called a pullback.
For multivariate functions,
the pullback maps
Each primitive operation therefore exposes two interfaces:
| Pass | Mapping |
|---|---|
| Forward | inputs outputs |
| Reverse | output adjoints input adjoints |
The reverse interface is the transposed local linear map.
Composition of Vector-Jacobian Products
Consider a composition
The Jacobian satisfies
A vector-Jacobian product becomes
Reverse mode evaluates this right-to-left:
- propagate backward through ;
- propagate the resulting adjoint backward through .
This exactly matches backward graph traversal.
The reverse pass therefore computes repeated local vector-Jacobian products.
Relation to the Chain Rule
The chain rule for scalar functions says
For vector-valued functions, the chain rule becomes matrix multiplication:
Reverse mode applies transposed chain multiplication:
The reverse execution order emerges directly from this transpose reversal.
Computational Complexity
Suppose a function evaluation costs
For scalar outputs, reverse mode usually computes the full gradient in time proportional to
Often the backward pass costs only a small constant multiple of the forward pass.
This is known as the cheap gradient principle.
For many programs:
| Phase | Relative Cost |
|---|---|
| Forward pass | |
| Reverse pass | to |
| Total gradient | to |
The exact factor depends on:
- operation types;
- memory traffic;
- recomputation strategy;
- tensor kernel efficiency.
Reverse Mode as Covector Propagation
Tangent vectors propagate in the same direction as computation.
Adjoints propagate in the opposite direction.
Geometrically:
| Object | Direction |
|---|---|
| Tangent vector | forward |
| Covector / adjoint | backward |
A Jacobian maps tangent vectors:
Its transpose maps covectors:
Reverse mode therefore operates naturally on covectors rather than tangent vectors.
This interpretation becomes important in differential geometry, optimal control, and variational calculus.
Batch Reverse Mode
Neural networks frequently process batches of examples simultaneously.
Suppose
where is batch size.
The reverse pass propagates adjoints through tensor operations. The underlying mathematics remains vector-Jacobian multiplication, but now the vectors themselves may be large tensors.
For example, matrix multiplication
has reverse rules
These are transposed linear operations derived from the Jacobian structure of matrix multiplication.
Modern tensor libraries implement such vector-Jacobian products directly as optimized kernels.
Conceptual Summary
Reverse mode automatic differentiation is fundamentally a system for evaluating transposed local linear maps.
Each primitive operation defines:
- a forward computation;
- a pullback implementing a vector-Jacobian product.
The backward pass composes these pullbacks in reverse order.
The full Jacobian is almost never constructed explicitly. Instead, reverse mode computes only the action of the Jacobian transpose on a seed vector.
For scalar outputs, this seed is simply
and the resulting vector-Jacobian product becomes the gradient itself.