Skip to content

Dual Spaces and Pushforwards

Forward mode and reverse mode propagate different kinds of objects.

Forward mode and reverse mode propagate different kinds of objects.

Forward mode propagates tangent vectors. These represent small changes in values.

Reverse mode propagates cotangent vectors. These represent sensitivities of an output with respect to values.

The distinction is more than notation. It explains why forward mode computes Jacobian-vector products, while reverse mode computes vector-Jacobian products.

Tangent Vectors

Suppose

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

and we evaluate it at a point xx. A tangent vector vRnv \in \mathbb{R}^n represents a small input perturbation:

xx+ϵv x \mapsto x + \epsilon v

The derivative tells how this perturbation affects the output:

f(x+ϵv)=f(x)+ϵJf(x)v+O(ϵ2) f(x + \epsilon v) = f(x) + \epsilon J_f(x)v + O(\epsilon^2)

The output tangent is

w=Jf(x)v w = J_f(x)v

Forward mode computes this map:

vJf(x)v v \mapsto J_f(x)v

This map is called the pushforward.

Pushforward

The pushforward is the derivative viewed as a map from input tangents to output tangents.

f:TxRnTf(x)Rm f_* : T_x\mathbb{R}^n \to T_{f(x)}\mathbb{R}^m

In ordinary coordinates,

f(v)=Jf(x)v f_*(v) = J_f(x)v

Forward mode implements the pushforward directly. If a program computes y=f(x)y = f(x), then its forward derivative program computes

(y,y˙)=(f(x),Jf(x)x˙) (y, \dot{y}) = (f(x), J_f(x)\dot{x})

This is why forward mode is often called tangent propagation.

Dual Spaces

A dual space is the space of linear functionals on a vector space.

If VV is a vector space, its dual space VV^* is

V={α:VRα is linear} V^* = \{\alpha : V \to \mathbb{R} \mid \alpha \text{ is linear}\}

An element of VV^* takes a vector and returns a scalar.

For V=RnV = \mathbb{R}^n, a covector αV\alpha \in V^* can be represented as a row vector:

α=[α1α2αn] \alpha = \begin{bmatrix} \alpha_1 & \alpha_2 & \cdots & \alpha_n \end{bmatrix}

It acts on a vector vv by

α(v)=αv \alpha(v) = \alpha v

or, in column-vector notation,

α(v)=αTv \alpha(v) = \alpha^T v

depending on convention.

In AD, these dual vectors are usually called cotangents or adjoints.

Cotangents and Sensitivities

A cotangent represents sensitivity of a scalar quantity to a value.

If

L:RmR L : \mathbb{R}^m \to \mathbb{R}

then the derivative of LL at yy is a covector:

dLy(Rm) dL_y \in (\mathbb{R}^m)^*

It maps an output perturbation Δy\Delta y to the resulting scalar perturbation:

dLy(Δy) dL_y(\Delta y)

In coordinates, this is

dLy(Δy)=L(y)TΔy dL_y(\Delta y) = \nabla L(y)^T \Delta y

Reverse mode propagates these covectors backward through a computation.

Pullback

If the pushforward moves tangent vectors forward, the pullback moves cotangents backward.

Given

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

the pushforward maps

vJf(x)v v \mapsto J_f(x)v

The pullback maps an output cotangent yˉ\bar{y} to an input cotangent xˉ\bar{x}:

xˉ=Jf(x)Tyˉ \bar{x} = J_f(x)^T \bar{y}

Equivalently, in row-vector convention,

xˉT=yˉTJf(x) \bar{x}^T = \bar{y}^T J_f(x)

Reverse mode implements the pullback.

The pullback answers this question:

Given a sensitivity on the output, what sensitivity does it induce on the input?

Pairing

The connection between pushforward and pullback is the scalar pairing between cotangents and tangents.

Let vv be an input tangent and yˉ\bar{y} be an output cotangent. The output tangent is

Jf(x)v J_f(x)v

Pairing it with yˉ\bar{y} gives

yˉTJf(x)v \bar{y}^T J_f(x)v

The pullback gives an input cotangent

xˉ=Jf(x)Tyˉ \bar{x} = J_f(x)^T\bar{y}

Pairing this with vv gives

xˉTv=(Jf(x)Tyˉ)Tv=yˉTJf(x)v \bar{x}^T v = (J_f(x)^T\bar{y})^T v = \bar{y}^T J_f(x)v

So pushforward and pullback are adjoint operations with respect to this pairing.

This identity is the mathematical basis of reverse mode.

Forward Mode as Pushforward

Forward mode propagates pairs:

(x,x˙) (x, \dot{x})

where xx is a primal value and x˙\dot{x} is a tangent.

For a primitive operation

z=ϕ(x,y) z = \phi(x, y)

forward mode computes

z˙=ϕxx˙+ϕyy˙ \dot{z} = \frac{\partial \phi}{\partial x}\dot{x} + \frac{\partial \phi}{\partial y}\dot{y}

The tangent moves in the same direction as the dataflow graph.

For a full program

y=f(x) y = f(x)

forward mode computes

y˙=Jf(x)x˙ \dot{y} = J_f(x)\dot{x}

This is a Jacobian-vector product.

Reverse Mode as Pullback

Reverse mode propagates pairs of the form:

(y,yˉ) (y, \bar{y})

where yˉ\bar{y} is a cotangent or adjoint.

For a primitive operation

z=ϕ(x,y) z = \phi(x, y)

reverse mode receives zˉ\bar{z} and computes contributions:

xˉ+=(ϕx)Tzˉ \bar{x} \mathrel{+}= \left(\frac{\partial \phi}{\partial x}\right)^T \bar{z} yˉ+=(ϕy)Tzˉ \bar{y} \mathrel{+}= \left(\frac{\partial \phi}{\partial y}\right)^T \bar{z}

The cotangent moves opposite to the dataflow graph.

For a full program

y=f(x) y = f(x)

reverse mode computes

xˉ=Jf(x)Tyˉ \bar{x} = J_f(x)^T\bar{y}

This is a vector-Jacobian product in column-vector convention.

Why Reverse Mode Gives Gradients

Let

L:RnR L : \mathbb{R}^n \to \mathbb{R}

be a scalar-valued function. Its Jacobian has shape

JL(x)R1×n J_L(x) \in \mathbb{R}^{1 \times n}

Reverse mode starts from the scalar output with seed

Lˉ=1 \bar{L} = 1

The pullback gives

xˉ=JL(x)T1 \bar{x} = J_L(x)^T \cdot 1

Thus

xˉ=L(x) \bar{x} = \nabla L(x)

up to the row-vector or column-vector convention.

This is the reason reverse mode is efficient for scalar losses with many inputs. One pullback gives all input sensitivities.

Example

Consider

f(x1,x2)=[x1x2sinx1] f(x_1, x_2) = \begin{bmatrix} x_1x_2 \\ \sin x_1 \end{bmatrix}

The Jacobian is

Jf(x)=[x2x1cosx10] J_f(x) = \begin{bmatrix} x_2 & x_1 \\ \cos x_1 & 0 \end{bmatrix}

For an input tangent

v=[v1v2] v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}

the pushforward is

Jf(x)v=[x2v1+x1v2cos(x1)v1] J_f(x)v = \begin{bmatrix} x_2v_1 + x_1v_2 \\ \cos(x_1)v_1 \end{bmatrix}

For an output cotangent

yˉ=[yˉ1yˉ2] \bar{y} = \begin{bmatrix} \bar{y}_1 \\ \bar{y}_2 \end{bmatrix}

the pullback is

Jf(x)Tyˉ=[x2yˉ1+cos(x1)yˉ2x1yˉ1] J_f(x)^T\bar{y} = \begin{bmatrix} x_2\bar{y}_1 + \cos(x_1)\bar{y}_2 \\ x_1\bar{y}_1 \end{bmatrix}

The first expression is what forward mode computes. The second is what reverse mode computes.

Implementation Meaning

In an AD implementation, these ideas become local rules.

For multiplication,

z=xy z = xy

the pushforward rule is

z˙=yx˙+xy˙ \dot{z} = y\dot{x} + x\dot{y}

The pullback rule is

xˉ+=yzˉ \bar{x} \mathrel{+}= y\bar{z} yˉ+=xzˉ \bar{y} \mathrel{+}= x\bar{z}

For sine,

z=sinx z = \sin x

the pushforward rule is

z˙=cos(x)x˙ \dot{z} = \cos(x)\dot{x}

The pullback rule is

xˉ+=cos(x)zˉ \bar{x} \mathrel{+}= \cos(x)\bar{z}

For matrix multiplication,

C=AB C = AB

the pushforward rule is

C˙=A˙B+AB˙ \dot{C} = \dot{A}B + A\dot{B}

The pullback rules are

Aˉ+=CˉBT \bar{A} \mathrel{+}= \bar{C}B^T Bˉ+=ATCˉ \bar{B} \mathrel{+}= A^T\bar{C}

These are all instances of the same principle: forward mode applies the derivative, reverse mode applies its transpose.

Avoiding Full Jacobians

The pushforward and pullback views explain why AD systems rarely need full Jacobians.

Forward mode needs only the action

vJf(x)v v \mapsto J_f(x)v

Reverse mode needs only the action

yˉJf(x)Tyˉ \bar{y} \mapsto J_f(x)^T\bar{y}

Both can be computed from local rules without building Jf(x)J_f(x).

This matters because the Jacobian may be enormous. A neural network can have billions of parameters. A physical simulation can have millions of state variables. The full derivative matrix is usually the wrong object to materialize.

Summary Principle

Forward mode propagates tangents through pushforwards.

Reverse mode propagates cotangents through pullbacks.

Both are manifestations of the same derivative. They differ only in which side of the local linear map they apply and which direction they traverse the computation.