Forward mode and reverse mode propagate different kinds of objects.
Forward mode propagates tangent vectors. These represent small changes in values.
Reverse mode propagates cotangent vectors. These represent sensitivities of an output with respect to values.
The distinction is more than notation. It explains why forward mode computes Jacobian-vector products, while reverse mode computes vector-Jacobian products.
Tangent Vectors
Suppose
and we evaluate it at a point . A tangent vector represents a small input perturbation:
The derivative tells how this perturbation affects the output:
The output tangent is
Forward mode computes this map:
This map is called the pushforward.
Pushforward
The pushforward is the derivative viewed as a map from input tangents to output tangents.
In ordinary coordinates,
Forward mode implements the pushforward directly. If a program computes , then its forward derivative program computes
This is why forward mode is often called tangent propagation.
Dual Spaces
A dual space is the space of linear functionals on a vector space.
If is a vector space, its dual space is
An element of takes a vector and returns a scalar.
For , a covector can be represented as a row vector:
It acts on a vector by
or, in column-vector notation,
depending on convention.
In AD, these dual vectors are usually called cotangents or adjoints.
Cotangents and Sensitivities
A cotangent represents sensitivity of a scalar quantity to a value.
If
then the derivative of at is a covector:
It maps an output perturbation to the resulting scalar perturbation:
In coordinates, this is
Reverse mode propagates these covectors backward through a computation.
Pullback
If the pushforward moves tangent vectors forward, the pullback moves cotangents backward.
Given
the pushforward maps
The pullback maps an output cotangent to an input cotangent :
Equivalently, in row-vector convention,
Reverse mode implements the pullback.
The pullback answers this question:
Given a sensitivity on the output, what sensitivity does it induce on the input?
Pairing
The connection between pushforward and pullback is the scalar pairing between cotangents and tangents.
Let be an input tangent and be an output cotangent. The output tangent is
Pairing it with gives
The pullback gives an input cotangent
Pairing this with gives
So pushforward and pullback are adjoint operations with respect to this pairing.
This identity is the mathematical basis of reverse mode.
Forward Mode as Pushforward
Forward mode propagates pairs:
where is a primal value and is a tangent.
For a primitive operation
forward mode computes
The tangent moves in the same direction as the dataflow graph.
For a full program
forward mode computes
This is a Jacobian-vector product.
Reverse Mode as Pullback
Reverse mode propagates pairs of the form:
where is a cotangent or adjoint.
For a primitive operation
reverse mode receives and computes contributions:
The cotangent moves opposite to the dataflow graph.
For a full program
reverse mode computes
This is a vector-Jacobian product in column-vector convention.
Why Reverse Mode Gives Gradients
Let
be a scalar-valued function. Its Jacobian has shape
Reverse mode starts from the scalar output with seed
The pullback gives
Thus
up to the row-vector or column-vector convention.
This is the reason reverse mode is efficient for scalar losses with many inputs. One pullback gives all input sensitivities.
Example
Consider
The Jacobian is
For an input tangent
the pushforward is
For an output cotangent
the pullback is
The first expression is what forward mode computes. The second is what reverse mode computes.
Implementation Meaning
In an AD implementation, these ideas become local rules.
For multiplication,
the pushforward rule is
The pullback rule is
For sine,
the pushforward rule is
The pullback rule is
For matrix multiplication,
the pushforward rule is
The pullback rules are
These are all instances of the same principle: forward mode applies the derivative, reverse mode applies its transpose.
Avoiding Full Jacobians
The pushforward and pullback views explain why AD systems rarely need full Jacobians.
Forward mode needs only the action
Reverse mode needs only the action
Both can be computed from local rules without building .
This matters because the Jacobian may be enormous. A neural network can have billions of parameters. A physical simulation can have millions of state variables. The full derivative matrix is usually the wrong object to materialize.
Summary Principle
Forward mode propagates tangents through pushforwards.
Reverse mode propagates cotangents through pullbacks.
Both are manifestations of the same derivative. They differ only in which side of the local linear map they apply and which direction they traverse the computation.