Skip to content

Vector-Jacobian Products

Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.

Reverse mode automatic differentiation fundamentally computes vector-Jacobian products. The gradient of a scalar function is a special case of this more general operation.

Consider a differentiable function

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

Its Jacobian matrix is

Jf(x)=fx=[f1x1f1xnfmx1fmxn]. J_f(x) = \frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

The Jacobian maps input perturbations into output perturbations.

Forward mode computes Jacobian-vector products:

Jf(x)v. J_f(x) v.

Reverse mode computes vector-Jacobian products:

uJf(x). u^\top J_f(x).

These two operations are dual to each other.

Linearization of a Program

A differentiable program locally behaves like a linear map.

Near a point xx,

f(x+δx)f(x)+Jf(x)δx. f(x + \delta x) \approx f(x) + J_f(x)\delta x.

The Jacobian is therefore the local linearization of the computation.

A tangent vector

δx \delta x

propagates forward through the Jacobian:

δy=Jf(x)δx. \delta y = J_f(x)\delta x.

An adjoint vector

u u

propagates backward through the transpose:

δx=Jf(x)u. \delta x^\ast = J_f(x)^\top u.

Equivalently,

uJf(x). u^\top J_f(x).

Reverse mode implements this backward propagation without explicitly materializing the Jacobian matrix.

Scalar Output as a Special Case

Suppose

f:RnR. f : \mathbb{R}^n \to \mathbb{R}.

Then the Jacobian has shape

1×n. 1 \times n.

It is simply the gradient row vector:

Jf(x)=[fx1fxn]. J_f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}.

If we choose seed vector

u=1, u = 1,

then

uJf(x)=Jf(x)=f(x). u^\top J_f(x) = J_f(x) = \nabla f(x)^\top.

This explains why reverse mode efficiently computes gradients for scalar-valued objectives.

Deep learning optimization almost always minimizes a scalar loss:

L(θ). L(\theta).

Reverse mode therefore computes

θL \nabla_\theta L

using one backward pass.

Example of a Vector-Jacobian Product

Define

f(x1,x2)=[x1x2x1+sin(x2)]. f(x_1, x_2) = \begin{bmatrix} x_1 x_2 \\ x_1 + \sin(x_2) \end{bmatrix}.

The Jacobian is

Jf(x)=[x2x11cos(x2)]. J_f(x) = \begin{bmatrix} x_2 & x_1 \\ 1 & \cos(x_2) \end{bmatrix}.

Suppose we choose output seed vector

u=[u1u2]. u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}.

The vector-Jacobian product is

uJf(x)=[u1u2][x2x11cos(x2)]. u^\top J_f(x) = \begin{bmatrix} u_1 & u_2 \end{bmatrix} \begin{bmatrix} x_2 & x_1 \\ 1 & \cos(x_2) \end{bmatrix}.

Therefore,

uJf(x)=[u1x2+u2,u1x1+u2cos(x2)]. u^\top J_f(x) = \begin{bmatrix} u_1 x_2 + u_2, \quad u_1 x_1 + u_2\cos(x_2) \end{bmatrix}.

Reverse mode computes this result directly by backward propagation.

Reverse Propagation Interpretation

Suppose the outputs are

y1=x1x2, y_1 = x_1 x_2, y2=x1+sin(x2). y_2 = x_1 + \sin(x_2).

The seed vector assigns sensitivities to outputs:

yˉ1=u1,yˉ2=u2. \bar y_1 = u_1, \qquad \bar y_2 = u_2.

Reverse propagation distributes these sensitivities into the inputs.

From

y1=x1x2, y_1 = x_1 x_2,

we obtain contributions:

xˉ1+=u1x2, \bar x_1 \mathrel{+}= u_1 x_2, xˉ2+=u1x1. \bar x_2 \mathrel{+}= u_1 x_1.

From

y2=x1+sin(x2), y_2 = x_1 + \sin(x_2),

we obtain:

xˉ1+=u2, \bar x_1 \mathrel{+}= u_2, xˉ2+=u2cos(x2). \bar x_2 \mathrel{+}= u_2\cos(x_2).

Accumulating both contributions gives

xˉ1=u1x2+u2, \bar x_1 = u_1 x_2 + u_2, xˉ2=u1x1+u2cos(x2). \bar x_2 = u_1 x_1 + u_2\cos(x_2).

Exactly the vector-Jacobian product.

Reverse Mode Never Builds the Full Jacobian

An explicit Jacobian may be extremely large.

For

f:R109R, f : \mathbb{R}^{10^9} \to \mathbb{R},

the Jacobian contains one billion entries.

For tensor-valued neural networks, the Jacobian may be astronomically large.

Reverse mode avoids materializing the matrix. Instead, it computes only the action of the transpose on a seed vector.

Operationally:

MethodComputes
Forward modeJvJv
Reverse modeuJu^\top J

This distinction is fundamental.

Automatic differentiation systems are really systems for efficient linear operator evaluation.

Pullbacks

A reverse-mode primitive defines a pullback.

Suppose

y=g(x). y = g(x).

The forward pass computes yy.

The reverse pass receives output sensitivity

yˉ, \bar y,

and returns input sensitivity

xˉ=Jg(x)yˉ. \bar x = J_g(x)^\top \bar y.

This mapping

Bg:yˉxˉ \mathcal{B}_g : \bar y \mapsto \bar x

is called a pullback.

For multivariate functions,

g:RnRm, g : \mathbb{R}^n \to \mathbb{R}^m,

the pullback maps

RmRn. \mathbb{R}^m \to \mathbb{R}^n.

Each primitive operation therefore exposes two interfaces:

PassMapping
Forwardinputs \to outputs
Reverseoutput adjoints \to input adjoints

The reverse interface is the transposed local linear map.

Composition of Vector-Jacobian Products

Consider a composition

f(x)=g(h(x)). f(x) = g(h(x)).

The Jacobian satisfies

Jf(x)=Jg(h(x))Jh(x). J_f(x) = J_g(h(x))J_h(x).

A vector-Jacobian product becomes

uJf(x)=uJg(h(x))Jh(x). u^\top J_f(x) = u^\top J_g(h(x))J_h(x).

Reverse mode evaluates this right-to-left:

  1. propagate uu backward through gg;
  2. propagate the resulting adjoint backward through hh.

This exactly matches backward graph traversal.

The reverse pass therefore computes repeated local vector-Jacobian products.

Relation to the Chain Rule

The chain rule for scalar functions says

dydx=dydududx. \frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.

For vector-valued functions, the chain rule becomes matrix multiplication:

Jgh(x)=Jg(h(x))Jh(x). J_{g \circ h}(x) = J_g(h(x))J_h(x).

Reverse mode applies transposed chain multiplication:

Jgh(x)=Jh(x)Jg(h(x)). J_{g \circ h}(x)^\top = J_h(x)^\top J_g(h(x))^\top.

The reverse execution order emerges directly from this transpose reversal.

Computational Complexity

Suppose a function evaluation costs

Cf. C_f.

For scalar outputs, reverse mode usually computes the full gradient in time proportional to

O(Cf). O(C_f).

Often the backward pass costs only a small constant multiple of the forward pass.

This is known as the cheap gradient principle.

For many programs:

PhaseRelative Cost
Forward pass1×1\times
Reverse pass2×2\times to 5×5\times
Total gradient3×3\times to 6×6\times

The exact factor depends on:

  1. operation types;
  2. memory traffic;
  3. recomputation strategy;
  4. tensor kernel efficiency.

Reverse Mode as Covector Propagation

Tangent vectors propagate in the same direction as computation.

Adjoints propagate in the opposite direction.

Geometrically:

ObjectDirection
Tangent vectorforward
Covector / adjointbackward

A Jacobian maps tangent vectors:

vJv. v \mapsto Jv.

Its transpose maps covectors:

uJu. u \mapsto J^\top u.

Reverse mode therefore operates naturally on covectors rather than tangent vectors.

This interpretation becomes important in differential geometry, optimal control, and variational calculus.

Batch Reverse Mode

Neural networks frequently process batches of examples simultaneously.

Suppose

XRB×n, X \in \mathbb{R}^{B \times n},

where BB is batch size.

The reverse pass propagates adjoints through tensor operations. The underlying mathematics remains vector-Jacobian multiplication, but now the vectors themselves may be large tensors.

For example, matrix multiplication

Y=XW Y = XW

has reverse rules

Xˉ=YˉW, \bar X = \bar Y W^\top, Wˉ=XYˉ. \bar W = X^\top \bar Y.

These are transposed linear operations derived from the Jacobian structure of matrix multiplication.

Modern tensor libraries implement such vector-Jacobian products directly as optimized kernels.

Conceptual Summary

Reverse mode automatic differentiation is fundamentally a system for evaluating transposed local linear maps.

Each primitive operation defines:

  1. a forward computation;
  2. a pullback implementing a vector-Jacobian product.

The backward pass composes these pullbacks in reverse order.

The full Jacobian is almost never constructed explicitly. Instead, reverse mode computes only the action of the Jacobian transpose on a seed vector.

For scalar outputs, this seed is simply

1, 1,

and the resulting vector-Jacobian product becomes the gradient itself.