Skip to content

Jacobian-Vector Products

The natural output of forward mode automatic differentiation is a Jacobian-vector product. Instead of constructing the full Jacobian matrix explicitly, forward mode computes...

The natural output of forward mode automatic differentiation is a Jacobian-vector product. Instead of constructing the full Jacobian matrix explicitly, forward mode computes how a perturbation vector propagates through a function.

For a function

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

the Jacobian at xx is

Jf(x)=[f1x1f1xnfmx1fmxn]. J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

Given a direction vector

vRn, v \in \mathbb{R}^n,

forward mode computes

Jf(x)v. J_f(x)v.

This product is called a Jacobian-vector product, usually abbreviated JVP.

Geometric interpretation

A differentiable function locally behaves like a linear map. Around a point xx,

f(x+Δx)f(x)+Jf(x)Δx. f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

If we perturb the input in direction vv,

xx+ϵv, x \mapsto x + \epsilon v,

then the first-order output perturbation is

f(x+ϵv)=f(x)+ϵJf(x)v. f(x + \epsilon v) = f(x) + \epsilon J_f(x)v.

So the JVP tells us how infinitesimal motion in input space transforms into infinitesimal motion in output space.

Forward mode computes exactly this transformed direction.

Tangent propagation produces JVPs

Suppose the inputs are seeded with tangents:

xi(xi,vi). x_i \mapsto (x_i, v_i).

Forward propagation computes tangent values for all intermediate variables. The final output tangent is

y˙=Jf(x)v. \dot{y} = J_f(x)v.

The tangent vector is therefore the directional derivative of the function in direction vv.

This is why forward mode is sometimes described as directional differentiation.

Example: scalar output

Consider

f(x,y)=x2y+siny. f(x,y) = x^2y + \sin y.

The Jacobian is

Jf(x,y)=[2xyx2+cosy]. J_f(x,y) = \begin{bmatrix} 2xy & x^2 + \cos y \end{bmatrix}.

Choose direction

v=[vxvy]. v = \begin{bmatrix} v_x \\ v_y \end{bmatrix}.

Then

Jf(x,y)v=2xyvx+(x2+cosy)vy. J_f(x,y)v = 2xyv_x + (x^2 + \cos y)v_y.

Now compute the same result using forward mode.

Seed:

x˙=vx,y˙=vy. \dot{x} = v_x, \qquad \dot{y} = v_y.

Evaluate:

a=x2,a˙=2xvx. a = x^2, \qquad \dot{a} = 2xv_x. b=ay,b˙=a˙y+ay˙. b = ay, \qquad \dot{b} = \dot{a}y + a\dot{y}.

Substitute:

b˙=2xyvx+x2vy. \dot{b} = 2xyv_x + x^2v_y.

Next:

c=siny,c˙=cosyvy. c = \sin y, \qquad \dot{c} = \cos y \, v_y.

Finally:

f=b+c,f˙=b˙+c˙. f = b + c, \qquad \dot{f} = \dot{b} + \dot{c}.

So

f˙=2xyvx+(x2+cosy)vy. \dot{f} = 2xyv_x + (x^2 + \cos y)v_y.

This equals the Jacobian-vector product.

Example: vector output

Now consider

f(x,y)=[xyx+ysinx]. f(x,y) = \begin{bmatrix} xy \\ x+y \\ \sin x \end{bmatrix}.

The Jacobian is

Jf(x,y)=[yx11cosx0]. J_f(x,y) = \begin{bmatrix} y & x \\ 1 & 1 \\ \cos x & 0 \end{bmatrix}.

For direction

v=[vxvy], v = \begin{bmatrix} v_x \\ v_y \end{bmatrix},

the JVP is

Jf(x,y)v=[yvx+xvyvx+vycosxvx]. J_f(x,y)v = \begin{bmatrix} yv_x + xv_y \\ v_x + v_y \\ \cos x \, v_x \end{bmatrix}.

Forward mode computes this directly.

Seed:

x˙=vx,y˙=vy. \dot{x} = v_x, \qquad \dot{y} = v_y.

Then:

f˙1=yvx+xvy, \dot{f}_1 = yv_x + xv_y, f˙2=vx+vy, \dot{f}_2 = v_x + v_y, f˙3=cosxvx. \dot{f}_3 = \cos x \, v_x.

The output tangent vector is exactly the JVP.

JVPs without explicit Jacobians

The important point is that forward mode never forms the Jacobian matrix explicitly.

For a large system, the Jacobian may be enormous. Suppose

f:R106R106. f : \mathbb{R}^{10^6} \to \mathbb{R}^{10^6}.

The full Jacobian contains 101210^{12} entries. Explicit storage is often impossible.

Forward mode avoids this cost. It computes

Jf(x)v J_f(x)v

directly by propagating one tangent vector through the computation graph.

This is especially valuable when:

  1. Only directional derivatives are needed.
  2. The Jacobian is sparse or implicit.
  3. Forming the full matrix would be too expensive.

Computational complexity

Suppose the primal function evaluation costs CC.

A forward-mode JVP typically costs approximately

O(C) O(C)

up to a small constant factor.

The tangent computation follows the same graph as the primal computation. Each primitive performs some extra local derivative work, but the asymptotic complexity is usually unchanged.

Computing the full Jacobian is more expensive.

For

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

one forward pass computes one JVP. To recover the full Jacobian, we usually evaluate:

Jf(x)e1,Jf(x)e2,,Jf(x)en, J_f(x)e_1, \quad J_f(x)e_2, \quad \ldots, \quad J_f(x)e_n,

where eie_i are standard basis vectors.

Thus full Jacobian construction requires approximately nn forward passes.

Forward mode is therefore efficient when:

nm n \ll m

or when only a few directional derivatives are required.

Matrix view of tangent propagation

Each intermediate variable has a tangent:

v˙. \dot{v}.

If the primitive operation is

z=ϕ(x1,,xk), z = \phi(x_1,\ldots,x_k),

then

z˙=iϕxix˙i. \dot{z} = \sum_i \frac{\partial \phi}{\partial x_i} \dot{x}_i.

This is exactly multiplication by the local Jacobian of the primitive.

The entire computation graph therefore performs repeated local matrix-vector multiplications:

vJ1vJ2J1vJfv. v \mapsto J_1v \mapsto J_2J_1v \mapsto \cdots \mapsto J_fv.

Forward mode composes these local linear maps incrementally during execution.

Relation to the chain rule

Suppose

f(x)=h(g(x)). f(x) = h(g(x)).

Then

Jf(x)=Jh(g(x))Jg(x). J_f(x) = J_h(g(x))J_g(x).

Apply this Jacobian to a vector vv:

Jf(x)v=Jh(g(x))(Jg(x)v). J_f(x)v = J_h(g(x))(J_g(x)v).

Forward mode computes exactly this sequence:

  1. Push vv through gg.
  2. Push the resulting tangent through hh.

The tangent vector flows forward through the composed computation.

This is the operational form of the chain rule.

Basis seeding

To compute a specific partial derivative, choose a basis direction.

For

f:R3R, f : \mathbb{R}^3 \to \mathbb{R},

suppose we want

fx2. \frac{\partial f}{\partial x_2}.

Use seed:

v=[010]. v = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}.

Then

Jf(x)v=fx2. J_f(x)v = \frac{\partial f}{\partial x_2}.

More generally:

Seed vectorResult
e1e_1First Jacobian column
e2e_2Second Jacobian column
eie_iii-th Jacobian column
arbitrary vvdirectional derivative

Thus the seed determines the derivative query.

Multiple directions simultaneously

Forward mode can propagate several tangent directions at once.

Instead of scalar tangents,

x˙iR, \dot{x}_i \in \mathbb{R},

use tangent matrices:

x˙iRk. \dot{x}_i \in \mathbb{R}^k.

Each variable now carries kk tangent components.

The output becomes

Jf(x)V, J_f(x)V,

where

VRn×k. V \in \mathbb{R}^{n \times k}.

This computes kk JVPs simultaneously.

If

V=In, V = I_n,

the identity matrix, then

Jf(x)V=Jf(x), J_f(x)V = J_f(x),

so the full Jacobian is recovered in one vectorized pass. However, this may require large tangent storage and substantial arithmetic overhead.

JVPs in machine learning

Modern machine learning systems frequently use JVPs.

Applications include:

ApplicationUse of JVP
Sensitivity analysisperturbation propagation
Meta-learningdifferentiating parameter updates
Implicit layerslinearized solver differentiation
Neural ODEstangent dynamics
Hessian-vector productsnested differentiation
Second-order optimizationcurvature approximations
Physics simulationvariational equations

Many algorithms only require products with derivatives, not explicit derivative matrices.

This distinction is fundamental in large-scale systems.

JVP versus VJP

Forward mode computes

Jv. Jv.

Reverse mode computes

Jv. J^\top v.

The reverse-mode product is called a vector-Jacobian product (VJP) or adjoint product.

The two have complementary complexity profiles:

ModeNatural productEfficient when
Forward modeJvJvfew inputs
Reverse modeJvJ^\top vfew outputs

For scalar-output functions,

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

reverse mode computes the full gradient in one pass, while forward mode needs nn passes.

For scalar-input functions,

f:RRm, f : \mathbb{R} \to \mathbb{R}^m,

forward mode computes the full derivative vector in one pass.

Linearization viewpoint

A JVP can also be viewed as evaluation of the linearized function.

Define the linearization of ff at xx:

Lx(v)=Jf(x)v. L_x(v) = J_f(x)v.

Forward mode computes

Lx(v) L_x(v)

without materializing LxL_x as a matrix.

In many systems, the linearized operator is more important than the Jacobian itself. Optimization methods, Krylov solvers, Newton methods, and sensitivity analysis often only require repeated applications of the linearized operator.

Forward mode naturally exposes this operator form.

Sparse directional propagation

If the seed vector vv is sparse, tangent propagation only activates dependent computations.

For example, if

vi=0 v_i = 0

for most components, many tangent computations remain zero.

This property is useful for:

  • sparse Jacobian estimation,
  • localized sensitivity analysis,
  • block-structured systems,
  • PDE discretizations,
  • graph-based models.

Efficient sparse forward-mode systems exploit this structure to reduce arithmetic and memory cost.

Summary

Forward mode automatic differentiation naturally computes Jacobian-vector products:

Jf(x)v. J_f(x)v.

A tangent seed vector defines an infinitesimal perturbation direction. Tangent propagation pushes this perturbation through the computation graph using local derivative rules. The resulting output tangent is the directional derivative of the function.

The key property is that forward mode computes JVPs directly, without explicitly forming Jacobian matrices. This makes it effective for directional sensitivity analysis, sparse systems, higher-order methods, and problems where the number of input directions is small.