Dual Spaces and Pushforwards

Forward mode and reverse mode propagate different kinds of objects.

Forward mode propagates tangent vectors. These represent small changes in values.

Reverse mode propagates cotangent vectors. These represent sensitivities of an output with respect to values.

The distinction is more than notation. It explains why forward mode computes Jacobian-vector products, while reverse mode computes vector-Jacobian products.

Tangent Vectors

Suppose

f : \mathbb{R}^n \to \mathbb{R}^m

and we evaluate it at a point $x$ . A tangent vector $v \in \mathbb{R}^n$ represents a small input perturbation:

x \mapsto x + \epsilon v

The derivative tells how this perturbation affects the output:

f(x + \epsilon v) = f(x) + \epsilon J_f(x)v + O(\epsilon^2)

The output tangent is

w = J_f(x)v

Forward mode computes this map:

v \mapsto J_f(x)v

This map is called the pushforward.

Pushforward

The pushforward is the derivative viewed as a map from input tangents to output tangents.

f_* : T_x\mathbb{R}^n \to T_{f(x)}\mathbb{R}^m

In ordinary coordinates,

f_*(v) = J_f(x)v

Forward mode implements the pushforward directly. If a program computes $y = f(x)$ , then its forward derivative program computes

(y, \dot{y}) = (f(x), J_f(x)\dot{x})

This is why forward mode is often called tangent propagation.

Dual Spaces

A dual space is the space of linear functionals on a vector space.

If $V$ is a vector space, its dual space $V^*$ is

V^* = \{\alpha : V \to \mathbb{R} \mid \alpha \text{ is linear}\}

An element of $V^*$ takes a vector and returns a scalar.

For $V = \mathbb{R}^n$ , a covector $\alpha \in V^*$ can be represented as a row vector:

\alpha = \begin{bmatrix} \alpha_1 & \alpha_2 & \cdots & \alpha_n \end{bmatrix}

It acts on a vector $v$ by

\alpha(v) = \alpha v

or, in column-vector notation,

\alpha(v) = \alpha^T v

depending on convention.

In AD, these dual vectors are usually called cotangents or adjoints.

Cotangents and Sensitivities

A cotangent represents sensitivity of a scalar quantity to a value.

L : \mathbb{R}^m \to \mathbb{R}

then the derivative of $L$ at $y$ is a covector:

dL_y \in (\mathbb{R}^m)^*

It maps an output perturbation $\Delta y$ to the resulting scalar perturbation:

dL_y(\Delta y)

In coordinates, this is

dL_y(\Delta y) = \nabla L(y)^T \Delta y

Reverse mode propagates these covectors backward through a computation.

Pullback

If the pushforward moves tangent vectors forward, the pullback moves cotangents backward.

Given

f : \mathbb{R}^n \to \mathbb{R}^m

the pushforward maps

v \mapsto J_f(x)v

The pullback maps an output cotangent $\bar{y}$ to an input cotangent $\bar{x}$ :

\bar{x} = J_f(x)^T \bar{y}

Equivalently, in row-vector convention,

\bar{x}^T = \bar{y}^T J_f(x)

Reverse mode implements the pullback.

The pullback answers this question:

Given a sensitivity on the output, what sensitivity does it induce on the input?

Pairing

The connection between pushforward and pullback is the scalar pairing between cotangents and tangents.

Let $v$ be an input tangent and $\bar{y}$ be an output cotangent. The output tangent is

J_f(x)v

Pairing it with $\bar{y}$ gives

\bar{y}^T J_f(x)v

The pullback gives an input cotangent

\bar{x} = J_f(x)^T\bar{y}

Pairing this with $v$ gives

\bar{x}^T v = (J_f(x)^T\bar{y})^T v = \bar{y}^T J_f(x)v

So pushforward and pullback are adjoint operations with respect to this pairing.

This identity is the mathematical basis of reverse mode.

Forward Mode as Pushforward

Forward mode propagates pairs:

(x, \dot{x})

where $x$ is a primal value and $\dot{x}$ is a tangent.

For a primitive operation

z = \phi(x, y)

forward mode computes

\dot{z} = \frac{\partial \phi}{\partial x}\dot{x} + \frac{\partial \phi}{\partial y}\dot{y}

The tangent moves in the same direction as the dataflow graph.

For a full program

y = f(x)

forward mode computes

\dot{y} = J_f(x)\dot{x}

This is a Jacobian-vector product.

Reverse Mode as Pullback

Reverse mode propagates pairs of the form:

(y, \bar{y})

where $\bar{y}$ is a cotangent or adjoint.

For a primitive operation

z = \phi(x, y)

reverse mode receives $\bar{z}$ and computes contributions:

\bar{x} \mathrel{+}= \left(\frac{\partial \phi}{\partial x}\right)^T \bar{z}

\bar{y} \mathrel{+}= \left(\frac{\partial \phi}{\partial y}\right)^T \bar{z}

The cotangent moves opposite to the dataflow graph.

For a full program

y = f(x)

reverse mode computes

\bar{x} = J_f(x)^T\bar{y}

This is a vector-Jacobian product in column-vector convention.

Why Reverse Mode Gives Gradients

Let

L : \mathbb{R}^n \to \mathbb{R}

be a scalar-valued function. Its Jacobian has shape

J_L(x) \in \mathbb{R}^{1 \times n}

Reverse mode starts from the scalar output with seed

\bar{L} = 1

The pullback gives

\bar{x} = J_L(x)^T \cdot 1

Thus

\bar{x} = \nabla L(x)

up to the row-vector or column-vector convention.

This is the reason reverse mode is efficient for scalar losses with many inputs. One pullback gives all input sensitivities.

Example

Consider

f(x_1, x_2) = \begin{bmatrix} x_1x_2 \\ \sin x_1 \end{bmatrix}

The Jacobian is

J_f(x) = \begin{bmatrix} x_2 & x_1 \\ \cos x_1 & 0 \end{bmatrix}

For an input tangent

v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}

the pushforward is

J_f(x)v = \begin{bmatrix} x_2v_1 + x_1v_2 \\ \cos(x_1)v_1 \end{bmatrix}

For an output cotangent

\bar{y} = \begin{bmatrix} \bar{y}_1 \\ \bar{y}_2 \end{bmatrix}

the pullback is

J_f(x)^T\bar{y} = \begin{bmatrix} x_2\bar{y}_1 + \cos(x_1)\bar{y}_2 \\ x_1\bar{y}_1 \end{bmatrix}

The first expression is what forward mode computes. The second is what reverse mode computes.

Implementation Meaning

In an AD implementation, these ideas become local rules.

For multiplication,

z = xy

the pushforward rule is

\dot{z} = y\dot{x} + x\dot{y}

The pullback rule is

\bar{x} \mathrel{+}= y\bar{z}

\bar{y} \mathrel{+}= x\bar{z}

For sine,

z = \sin x

the pushforward rule is

\dot{z} = \cos(x)\dot{x}

The pullback rule is

\bar{x} \mathrel{+}= \cos(x)\bar{z}

For matrix multiplication,

C = AB

the pushforward rule is

\dot{C} = \dot{A}B + A\dot{B}

The pullback rules are

\bar{A} \mathrel{+}= \bar{C}B^T

\bar{B} \mathrel{+}= A^T\bar{C}

These are all instances of the same principle: forward mode applies the derivative, reverse mode applies its transpose.

Avoiding Full Jacobians

The pushforward and pullback views explain why AD systems rarely need full Jacobians.

Forward mode needs only the action

v \mapsto J_f(x)v

Reverse mode needs only the action

\bar{y} \mapsto J_f(x)^T\bar{y}

Both can be computed from local rules without building $J_f(x)$ .

This matters because the Jacobian may be enormous. A neural network can have billions of parameters. A physical simulation can have millions of state variables. The full derivative matrix is usually the wrong object to materialize.

Summary Principle

Forward mode propagates tangents through pushforwards.

Reverse mode propagates cotangents through pullbacks.

Both are manifestations of the same derivative. They differ only in which side of the local linear map they apply and which direction they traverse the computation.