Vector-Jacobian Products

Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.

Consider a differentiable function

f : \mathbb{R}^n \to \mathbb{R}^m.

Its Jacobian matrix is

J_f(x) = \frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

The Jacobian maps input perturbations into output perturbations.

Forward mode computes Jacobian-vector products:

J_f(x) v.

Reverse mode computes vector-Jacobian products:

u^\top J_f(x).

These two operations are dual to each other.

Linearization of a Program

A differentiable program locally behaves like a linear map.

Near a point $x$ ,

f(x + \delta x) \approx f(x) + J_f(x)\delta x.

The Jacobian is therefore the local linearization of the computation.

A tangent vector

\delta x

propagates forward through the Jacobian:

\delta y = J_f(x)\delta x.

An adjoint vector

u

propagates backward through the transpose:

\delta x^\ast = J_f(x)^\top u.

Equivalently,

u^\top J_f(x).

Reverse mode implements this backward propagation without explicitly materializing the Jacobian matrix.

Scalar Output as a Special Case

Suppose

f : \mathbb{R}^n \to \mathbb{R}.

Then the Jacobian has shape

1 \times n.

It is simply the gradient row vector:

J_f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}.

If we choose seed vector

u = 1,

then

u^\top J_f(x) = J_f(x) = \nabla f(x)^\top.

This explains why reverse mode efficiently computes gradients for scalar-valued objectives.

Deep learning optimization almost always minimizes a scalar loss:

L(\theta).

Reverse mode therefore computes

\nabla_\theta L

using one backward pass.

Example of a Vector-Jacobian Product

Define

f(x_1, x_2) = \begin{bmatrix} x_1 x_2 \\ x_1 + \sin(x_2) \end{bmatrix}.

The Jacobian is

J_f(x) = \begin{bmatrix} x_2 & x_1 \\ 1 & \cos(x_2) \end{bmatrix}.

Suppose we choose output seed vector

u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}.

The vector-Jacobian product is

u^\top J_f(x) = \begin{bmatrix} u_1 & u_2 \end{bmatrix} \begin{bmatrix} x_2 & x_1 \\ 1 & \cos(x_2) \end{bmatrix}.

Therefore,

u^\top J_f(x) = \begin{bmatrix} u_1 x_2 + u_2, \quad u_1 x_1 + u_2\cos(x_2) \end{bmatrix}.

Reverse mode computes this result directly by backward propagation.

Reverse Propagation Interpretation

Suppose the outputs are

y_1 = x_1 x_2,

y_2 = x_1 + \sin(x_2).

The seed vector assigns sensitivities to outputs:

\bar y_1 = u_1, \qquad \bar y_2 = u_2.

Reverse propagation distributes these sensitivities into the inputs.

From

y_1 = x_1 x_2,

we obtain contributions:

\bar x_1 \mathrel{+}= u_1 x_2,

\bar x_2 \mathrel{+}= u_1 x_1.

From

y_2 = x_1 + \sin(x_2),

we obtain:

\bar x_1 \mathrel{+}= u_2,

\bar x_2 \mathrel{+}= u_2\cos(x_2).

Accumulating both contributions gives

\bar x_1 = u_1 x_2 + u_2,

\bar x_2 = u_1 x_1 + u_2\cos(x_2).

Exactly the vector-Jacobian product.

Reverse Mode Never Builds the Full Jacobian

An explicit Jacobian may be extremely large.

For

f : \mathbb{R}^{10^9} \to \mathbb{R},

the Jacobian contains one billion entries.

For tensor-valued neural networks, the Jacobian may be astronomically large.

Reverse mode avoids materializing the matrix. Instead, it computes only the action of the transpose on a seed vector.

Operationally:

Method	Computes
Forward mode	$Jv$
Reverse mode	$u^\top J$

This distinction is fundamental.

Automatic differentiation systems are really systems for efficient linear operator evaluation.

Pullbacks

A reverse-mode primitive defines a pullback.

Suppose

y = g(x).

The forward pass computes $y$ .

The reverse pass receives output sensitivity

\bar y,

and returns input sensitivity

\bar x = J_g(x)^\top \bar y.

This mapping

\mathcal{B}_g : \bar y \mapsto \bar x

is called a pullback.

For multivariate functions,

g : \mathbb{R}^n \to \mathbb{R}^m,

the pullback maps

\mathbb{R}^m \to \mathbb{R}^n.

Each primitive operation therefore exposes two interfaces:

Pass	Mapping
Forward	inputs $\to$ outputs
Reverse	output adjoints $\to$ input adjoints

The reverse interface is the transposed local linear map.

Composition of Vector-Jacobian Products

Consider a composition

f(x) = g(h(x)).

The Jacobian satisfies

J_f(x) = J_g(h(x))J_h(x).

A vector-Jacobian product becomes

u^\top J_f(x) = u^\top J_g(h(x))J_h(x).

Reverse mode evaluates this right-to-left:

propagate $u$ backward through $g$ ;
propagate the resulting adjoint backward through $h$ .

This exactly matches backward graph traversal.

The reverse pass therefore computes repeated local vector-Jacobian products.

Relation to the Chain Rule

The chain rule for scalar functions says

\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.

For vector-valued functions, the chain rule becomes matrix multiplication:

J_{g \circ h}(x) = J_g(h(x))J_h(x).

Reverse mode applies transposed chain multiplication:

J_{g \circ h}(x)^\top = J_h(x)^\top J_g(h(x))^\top.

The reverse execution order emerges directly from this transpose reversal.

Computational Complexity

Suppose a function evaluation costs

C_f.

For scalar outputs, reverse mode usually computes the full gradient in time proportional to

O(C_f).

Often the backward pass costs only a small constant multiple of the forward pass.

This is known as the cheap gradient principle.

For many programs:

Phase	Relative Cost
Forward pass	$1\times$
Reverse pass	$2\times$ to $5\times$
Total gradient	$3\times$ to $6\times$

The exact factor depends on:

operation types;
memory traffic;
recomputation strategy;
tensor kernel efficiency.

Reverse Mode as Covector Propagation

Tangent vectors propagate in the same direction as computation.

Adjoints propagate in the opposite direction.

Geometrically:

Object	Direction
Tangent vector	forward
Covector / adjoint	backward

A Jacobian maps tangent vectors:

v \mapsto Jv.

Its transpose maps covectors:

u \mapsto J^\top u.

Reverse mode therefore operates naturally on covectors rather than tangent vectors.

This interpretation becomes important in differential geometry, optimal control, and variational calculus.

Batch Reverse Mode

Neural networks frequently process batches of examples simultaneously.

Suppose

X \in \mathbb{R}^{B \times n},

where $B$ is batch size.

The reverse pass propagates adjoints through tensor operations. The underlying mathematics remains vector-Jacobian multiplication, but now the vectors themselves may be large tensors.

For example, matrix multiplication

Y = XW

has reverse rules

\bar X = \bar Y W^\top,

\bar W = X^\top \bar Y.

These are transposed linear operations derived from the Jacobian structure of matrix multiplication.

Modern tensor libraries implement such vector-Jacobian products directly as optimized kernels.

Conceptual Summary

Reverse mode automatic differentiation is fundamentally a system for evaluating transposed local linear maps.

Each primitive operation defines:

a forward computation;
a pullback implementing a vector-Jacobian product.

The backward pass composes these pullbacks in reverse order.

The full Jacobian is almost never constructed explicitly. Instead, reverse mode computes only the action of the Jacobian transpose on a seed vector.

For scalar outputs, this seed is simply

1,

and the resulting vector-Jacobian product becomes the gradient itself.