Skip to content

Jacobians and Hessians

The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the...

The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the Jacobian and the Hessian.

The Jacobian stores first-order derivatives of a vector-valued function. The Hessian stores second-order derivatives of a scalar-valued function. Both can be large. In automatic differentiation, we often avoid constructing them explicitly. Instead, we compute products with them.

Jacobian

Let

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

with

f(x)=[f1(x)f2(x)fm(x)] f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}

The Jacobian of ff at xx is the matrix

Jf(x)=[f1x1f1x2f1xnf2x1f2x2f2xnfmx1fmx2fmxn] J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

It has mm rows and nn columns. Rows correspond to output components. Columns correspond to input components.

The entry in row ii, column jj measures how output fif_i changes when input xjx_j changes.

Jacobian as a Linear Map

The Jacobian represents the best local linear approximation to ff.

For a small perturbation Δx\Delta x,

f(x+Δx)f(x)+Jf(x)Δx f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

The Jacobian maps input perturbations to output perturbations:

ΔyJf(x)Δx \Delta y \approx J_f(x)\Delta x

This is the central meaning of the Jacobian. It is not just a table of partial derivatives. It is the linear operator that approximates the function near one point.

Example

Consider

f(x1,x2)=[x1x2x1+sinx2] f(x_1, x_2) = \begin{bmatrix} x_1x_2 \\ x_1 + \sin x_2 \end{bmatrix}

Here, n=2n = 2 and m=2m = 2.

The component functions are

f1(x1,x2)=x1x2 f_1(x_1, x_2) = x_1x_2 f2(x1,x2)=x1+sinx2 f_2(x_1, x_2) = x_1 + \sin x_2

The Jacobian is

Jf(x)=[x2x11cosx2] J_f(x) = \begin{bmatrix} x_2 & x_1 \\ 1 & \cos x_2 \end{bmatrix}

At x=(2,0)x = (2, 0),

Jf(2,0)=[0211] J_f(2,0) = \begin{bmatrix} 0 & 2 \\ 1 & 1 \end{bmatrix}

If the input changes by

Δx=[0.010.02] \Delta x = \begin{bmatrix} 0.01 \\ 0.02 \end{bmatrix}

then the predicted output change is

Δy[0211][0.010.02]=[0.040.03] \Delta y \approx \begin{bmatrix} 0 & 2 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0.01 \\ 0.02 \end{bmatrix} = \begin{bmatrix} 0.04 \\ 0.03 \end{bmatrix}

This approximation becomes more accurate as Δx\Delta x becomes smaller.

Computing a Full Jacobian

A full Jacobian has mnmn entries. For small mm and nn, constructing it directly is reasonable. For large models, it can be infeasible.

There are two common ways to build a full Jacobian with AD.

Forward mode can compute one column at a time. Seed the input tangent with a basis vector eje_j. The resulting output tangent is

Jf(x)ej J_f(x)e_j

which is column jj of the Jacobian. Repeating this for all nn input coordinates gives the full matrix.

Reverse mode can compute one row at a time. Seed the output adjoint with a basis vector eie_i. The resulting input adjoint is

eiTJf(x) e_i^T J_f(x)

which is row ii of the Jacobian. Repeating this for all mm output coordinates gives the full matrix.

MethodOne pass givesNumber of passes for full Jacobian
Forward modeone columnnn
Reverse modeone rowmm

So the shape of the mapping matters.

For

f:R1000000R f : \mathbb{R}^{1000000} \to \mathbb{R}

reverse mode is usually preferred because one reverse pass gives the full gradient.

For

f:RR1000000 f : \mathbb{R} \to \mathbb{R}^{1000000}

forward mode is usually preferred because one forward pass gives all output sensitivities with respect to the single input.

Jacobian-Vector Products

Many algorithms need the action of a Jacobian on a vector, not the full Jacobian.

A Jacobian-vector product has the form

Jf(x)v J_f(x)v

where

vRn v \in \mathbb{R}^n

This gives the output perturbation caused by input perturbation vv.

Forward mode computes this directly. It propagates a tangent vector alongside the ordinary value.

If

y=f(x) y = f(x)

and the input perturbation is vv, then forward mode computes

y˙=Jf(x)v \dot{y} = J_f(x)v

This is often much cheaper than building Jf(x)J_f(x).

Vector-Jacobian Products

Reverse mode naturally computes vector-Jacobian products:

uTJf(x) u^T J_f(x)

where

uRm u \in \mathbb{R}^m

This gives the sensitivity of the weighted output

uTf(x) u^T f(x)

with respect to the input.

If ff is scalar-valued, then m=1m = 1. With seed u=1u = 1, reverse mode computes

Jf(x) J_f(x)

as a row vector, which corresponds to the gradient.

This is the basis of backpropagation. A scalar loss is differentiated with respect to many parameters using one reverse pass.

Hessian

The Hessian is the matrix of second derivatives for a scalar-valued function.

Let

f:RnR f : \mathbb{R}^n \to \mathbb{R}

The Hessian is

Hf(x)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2] H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

If the second partial derivatives are continuous, the Hessian is symmetric:

2fxixj=2fxjxi \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

The Hessian describes curvature. The gradient gives the best local linear approximation. The Hessian gives the second-order correction.

Second-Order Approximation

Near a point xx, a scalar function can be approximated by

f(x+Δx)f(x)+f(x)TΔx+12ΔxTHf(x)Δx f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T H_f(x)\Delta x

The three terms have distinct meanings.

TermMeaning
f(x)f(x)value at the base point
f(x)TΔx\nabla f(x)^T \Delta xfirst-order linear change
12ΔxTHf(x)Δx\frac{1}{2}\Delta x^T H_f(x)\Delta xsecond-order curvature correction

Second-order methods use this model to choose better update steps than first-order gradient descent. Newton’s method, quasi-Newton methods, trust-region methods, and many sensitivity methods depend on curvature information.

Example Hessian

Consider

f(x1,x2)=x12x2+sinx2 f(x_1, x_2) = x_1^2x_2 + \sin x_2

The gradient is

f(x)=[2x1x2x12+cosx2] \nabla f(x) = \begin{bmatrix} 2x_1x_2 \\ x_1^2 + \cos x_2 \end{bmatrix}

The Hessian is

Hf(x)=[2x22x12x1sinx2] H_f(x) = \begin{bmatrix} 2x_2 & 2x_1 \\ 2x_1 & -\sin x_2 \end{bmatrix}

The off-diagonal entries match because the function is smooth.

Hessian-Vector Products

A full Hessian has n2n^2 entries. For large nn, this is often impossible to store.

Many algorithms only need a Hessian-vector product:

Hf(x)v H_f(x)v

This measures how the gradient changes in direction vv.

There is a useful identity:

Hf(x)v=ddϵf(x+ϵv)ϵ=0 H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}

So a Hessian-vector product is the directional derivative of the gradient.

AD systems can compute this efficiently by combining modes. One common strategy is forward-over-reverse:

  1. Use reverse mode to define the gradient computation.
  2. Use forward mode through that gradient computation to obtain the directional derivative.

This avoids explicitly forming the Hessian.

Hessian as Jacobian of the Gradient

The Hessian is the Jacobian of the gradient.

For

f:RnR f : \mathbb{R}^n \to \mathbb{R}

the gradient is a vector-valued function:

f:RnRn \nabla f : \mathbb{R}^n \to \mathbb{R}^n

The Jacobian of this gradient function is the Hessian:

Jf(x)=Hf(x) J_{\nabla f}(x) = H_f(x)

This viewpoint is useful in AD because higher-order differentiation can be built by differentiating derivative programs.

A first AD transform gives a gradient program. A second AD transform can differentiate that gradient program.

Explicit Matrices vs Products

AD users often ask for gradients, Jacobians, or Hessians. AD systems internally prefer products.

Desired objectSizeOften computed as
Gradientnnreverse mode with scalar seed
Full Jacobianmnmnrepeated JVPs or repeated VJPs
Full Hessiann2n^2repeated Hessian-vector products
JVPmmforward mode
VJPnnreverse mode
HVPnnmixed-mode AD

This distinction is important. A full matrix may be conceptually simple but computationally unsuitable. Matrix-free products often give the same information needed by optimization and simulation algorithms at far lower cost.

Practical Meaning for AD

The Jacobian and Hessian are not special symbolic objects inside most AD systems. They are derivative computations obtained from repeated or nested applications of local rules.

A Jacobian arises from first-order sensitivity of vector outputs with respect to vector inputs. A Hessian arises from second-order sensitivity of a scalar output with respect to vector inputs.

The implementation question is usually:

Which derivative product does the user need?

That question leads directly to the right AD mode. Forward mode gives JVPs. Reverse mode gives VJPs. Mixed-mode AD gives second-order products such as Hessian-vector products. Full Jacobians and Hessians are then built only when their explicit entries are truly needed.