Jacobian-Vector Products

The natural output of forward mode automatic differentiation is a Jacobian-vector product. Instead of constructing the full Jacobian matrix explicitly, forward mode computes how a perturbation vector propagates through a function.

For a function

f : \mathbb{R}^n \to \mathbb{R}^m,

the Jacobian at $x$ is

J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

Given a direction vector

v \in \mathbb{R}^n,

forward mode computes

J_f(x)v.

This product is called a Jacobian-vector product, usually abbreviated JVP.

Geometric interpretation

A differentiable function locally behaves like a linear map. Around a point $x$ ,

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

If we perturb the input in direction $v$ ,

x \mapsto x + \epsilon v,

then the first-order output perturbation is

f(x + \epsilon v) = f(x) + \epsilon J_f(x)v.

So the JVP tells us how infinitesimal motion in input space transforms into infinitesimal motion in output space.

Forward mode computes exactly this transformed direction.

Tangent propagation produces JVPs

Suppose the inputs are seeded with tangents:

x_i \mapsto (x_i, v_i).

Forward propagation computes tangent values for all intermediate variables. The final output tangent is

\dot{y} = J_f(x)v.

The tangent vector is therefore the directional derivative of the function in direction $v$ .

This is why forward mode is sometimes described as directional differentiation.

Example: scalar output

Consider

f(x,y) = x^2y + \sin y.

The Jacobian is

J_f(x,y) = \begin{bmatrix} 2xy & x^2 + \cos y \end{bmatrix}.

Choose direction

v = \begin{bmatrix} v_x \\ v_y \end{bmatrix}.

Then

J_f(x,y)v = 2xyv_x + (x^2 + \cos y)v_y.

Now compute the same result using forward mode.

Seed:

\dot{x} = v_x, \qquad \dot{y} = v_y.

Evaluate:

a = x^2, \qquad \dot{a} = 2xv_x.

b = ay, \qquad \dot{b} = \dot{a}y + a\dot{y}.

Substitute:

\dot{b} = 2xyv_x + x^2v_y.

c = \sin y, \qquad \dot{c} = \cos y \, v_y.

Finally:

f = b + c, \qquad \dot{f} = \dot{b} + \dot{c}.

\dot{f} = 2xyv_x + (x^2 + \cos y)v_y.

This equals the Jacobian-vector product.

Example: vector output

Now consider

f(x,y) = \begin{bmatrix} xy \\ x+y \\ \sin x \end{bmatrix}.

The Jacobian is

J_f(x,y) = \begin{bmatrix} y & x \\ 1 & 1 \\ \cos x & 0 \end{bmatrix}.

For direction

v = \begin{bmatrix} v_x \\ v_y \end{bmatrix},

the JVP is

J_f(x,y)v = \begin{bmatrix} yv_x + xv_y \\ v_x + v_y \\ \cos x \, v_x \end{bmatrix}.

Forward mode computes this directly.

Seed:

\dot{x} = v_x, \qquad \dot{y} = v_y.

Then:

\dot{f}_1 = yv_x + xv_y,

\dot{f}_2 = v_x + v_y,

\dot{f}_3 = \cos x \, v_x.

The output tangent vector is exactly the JVP.

JVPs without explicit Jacobians

The important point is that forward mode never forms the Jacobian matrix explicitly.

For a large system, the Jacobian may be enormous. Suppose

f : \mathbb{R}^{10^6} \to \mathbb{R}^{10^6}.

The full Jacobian contains $10^{12}$ entries. Explicit storage is often impossible.

Forward mode avoids this cost. It computes

J_f(x)v

directly by propagating one tangent vector through the computation graph.

This is especially valuable when:

Only directional derivatives are needed.
The Jacobian is sparse or implicit.
Forming the full matrix would be too expensive.

Computational complexity

Suppose the primal function evaluation costs $C$ .

A forward-mode JVP typically costs approximately

O(C)

up to a small constant factor.

The tangent computation follows the same graph as the primal computation. Each primitive performs some extra local derivative work, but the asymptotic complexity is usually unchanged.

Computing the full Jacobian is more expensive.

For

f : \mathbb{R}^n \to \mathbb{R}^m,

one forward pass computes one JVP. To recover the full Jacobian, we usually evaluate:

J_f(x)e_1, \quad J_f(x)e_2, \quad \ldots, \quad J_f(x)e_n,

where $e_i$ are standard basis vectors.

Thus full Jacobian construction requires approximately $n$ forward passes.

Forward mode is therefore efficient when:

n \ll m

or when only a few directional derivatives are required.

Matrix view of tangent propagation

Each intermediate variable has a tangent:

\dot{v}.

If the primitive operation is

z = \phi(x_1,\ldots,x_k),

then

\dot{z} = \sum_i \frac{\partial \phi}{\partial x_i} \dot{x}_i.

This is exactly multiplication by the local Jacobian of the primitive.

The entire computation graph therefore performs repeated local matrix-vector multiplications:

v \mapsto J_1v \mapsto J_2J_1v \mapsto \cdots \mapsto J_fv.

Forward mode composes these local linear maps incrementally during execution.

Relation to the chain rule

Suppose

f(x) = h(g(x)).

Then

J_f(x) = J_h(g(x))J_g(x).

Apply this Jacobian to a vector $v$ :

J_f(x)v = J_h(g(x))(J_g(x)v).

Forward mode computes exactly this sequence:

Push $v$ through $g$ .
Push the resulting tangent through $h$ .

The tangent vector flows forward through the composed computation.

This is the operational form of the chain rule.

Basis seeding

To compute a specific partial derivative, choose a basis direction.

For

f : \mathbb{R}^3 \to \mathbb{R},

suppose we want

\frac{\partial f}{\partial x_2}.

Use seed:

v = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}.

Then

J_f(x)v = \frac{\partial f}{\partial x_2}.

More generally:

Seed vector	Result
$e_1$	First Jacobian column
$e_2$	Second Jacobian column
$e_i$	$i$ -th Jacobian column
arbitrary $v$	directional derivative

Thus the seed determines the derivative query.

Multiple directions simultaneously

Forward mode can propagate several tangent directions at once.

Instead of scalar tangents,

\dot{x}_i \in \mathbb{R},

use tangent matrices:

\dot{x}_i \in \mathbb{R}^k.

Each variable now carries $k$ tangent components.

The output becomes

J_f(x)V,

where

V \in \mathbb{R}^{n \times k}.

This computes $k$ JVPs simultaneously.

V = I_n,

the identity matrix, then

J_f(x)V = J_f(x),

so the full Jacobian is recovered in one vectorized pass. However, this may require large tangent storage and substantial arithmetic overhead.

JVPs in machine learning

Modern machine learning systems frequently use JVPs.

Applications include:

Application	Use of JVP
Sensitivity analysis	perturbation propagation
Meta-learning	differentiating parameter updates
Implicit layers	linearized solver differentiation
Neural ODEs	tangent dynamics
Hessian-vector products	nested differentiation
Second-order optimization	curvature approximations
Physics simulation	variational equations

Many algorithms only require products with derivatives, not explicit derivative matrices.

This distinction is fundamental in large-scale systems.

JVP versus VJP

Forward mode computes

Jv.

Reverse mode computes

J^\top v.

The reverse-mode product is called a vector-Jacobian product (VJP) or adjoint product.

The two have complementary complexity profiles:

Mode	Natural product	Efficient when
Forward mode	$Jv$	few inputs
Reverse mode	$J^\top v$	few outputs

For scalar-output functions,

f : \mathbb{R}^n \to \mathbb{R},

reverse mode computes the full gradient in one pass, while forward mode needs $n$ passes.

For scalar-input functions,

f : \mathbb{R} \to \mathbb{R}^m,

forward mode computes the full derivative vector in one pass.

Linearization viewpoint

A JVP can also be viewed as evaluation of the linearized function.

Define the linearization of $f$ at $x$ :

L_x(v) = J_f(x)v.

Forward mode computes

L_x(v)

without materializing $L_x$ as a matrix.

In many systems, the linearized operator is more important than the Jacobian itself. Optimization methods, Krylov solvers, Newton methods, and sensitivity analysis often only require repeated applications of the linearized operator.

Forward mode naturally exposes this operator form.

Sparse directional propagation

If the seed vector $v$ is sparse, tangent propagation only activates dependent computations.

For example, if

v_i = 0

for most components, many tangent computations remain zero.

This property is useful for:

sparse Jacobian estimation,
localized sensitivity analysis,
block-structured systems,
PDE discretizations,
graph-based models.

Efficient sparse forward-mode systems exploit this structure to reduce arithmetic and memory cost.

Summary

Forward mode automatic differentiation naturally computes Jacobian-vector products:

J_f(x)v.

A tangent seed vector defines an infinitesimal perturbation direction. Tangent propagation pushes this perturbation through the computation graph using local derivative rules. The resulting output tangent is the directional derivative of the function.

The key property is that forward mode computes JVPs directly, without explicitly forming Jacobian matrices. This makes it effective for directional sensitivity analysis, sparse systems, higher-order methods, and problems where the number of input directions is small.