Jacobians and Hessians

The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the Jacobian and the Hessian.

The Jacobian stores first-order derivatives of a vector-valued function. The Hessian stores second-order derivatives of a scalar-valued function. Both can be large. In automatic differentiation, we often avoid constructing them explicitly. Instead, we compute products with them.

Jacobian

Let

f : \mathbb{R}^n \to \mathbb{R}^m

with

f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}

The Jacobian of $f$ at $x$ is the matrix

J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

It has $m$ rows and $n$ columns. Rows correspond to output components. Columns correspond to input components.

The entry in row $i$ , column $j$ measures how output $f_i$ changes when input $x_j$ changes.

Jacobian as a Linear Map

The Jacobian represents the best local linear approximation to $f$ .

For a small perturbation $\Delta x$ ,

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

The Jacobian maps input perturbations to output perturbations:

\Delta y \approx J_f(x)\Delta x

This is the central meaning of the Jacobian. It is not just a table of partial derivatives. It is the linear operator that approximates the function near one point.

Example

Consider

f(x_1, x_2) = \begin{bmatrix} x_1x_2 \\ x_1 + \sin x_2 \end{bmatrix}

Here, $n = 2$ and $m = 2$ .

The component functions are

f_1(x_1, x_2) = x_1x_2

f_2(x_1, x_2) = x_1 + \sin x_2

The Jacobian is

J_f(x) = \begin{bmatrix} x_2 & x_1 \\ 1 & \cos x_2 \end{bmatrix}

At $x = (2, 0)$ ,

J_f(2,0) = \begin{bmatrix} 0 & 2 \\ 1 & 1 \end{bmatrix}

If the input changes by

\Delta x = \begin{bmatrix} 0.01 \\ 0.02 \end{bmatrix}

then the predicted output change is

\Delta y \approx \begin{bmatrix} 0 & 2 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0.01 \\ 0.02 \end{bmatrix} = \begin{bmatrix} 0.04 \\ 0.03 \end{bmatrix}

This approximation becomes more accurate as $\Delta x$ becomes smaller.

Computing a Full Jacobian

A full Jacobian has $mn$ entries. For small $m$ and $n$ , constructing it directly is reasonable. For large models, it can be infeasible.

There are two common ways to build a full Jacobian with AD.

Forward mode can compute one column at a time. Seed the input tangent with a basis vector $e_j$ . The resulting output tangent is

J_f(x)e_j

which is column $j$ of the Jacobian. Repeating this for all $n$ input coordinates gives the full matrix.

Reverse mode can compute one row at a time. Seed the output adjoint with a basis vector $e_i$ . The resulting input adjoint is

e_i^T J_f(x)

which is row $i$ of the Jacobian. Repeating this for all $m$ output coordinates gives the full matrix.

Method	One pass gives	Number of passes for full Jacobian
Forward mode	one column	$n$
Reverse mode	one row	$m$

So the shape of the mapping matters.

For

f : \mathbb{R}^{1000000} \to \mathbb{R}

reverse mode is usually preferred because one reverse pass gives the full gradient.

For

f : \mathbb{R} \to \mathbb{R}^{1000000}

forward mode is usually preferred because one forward pass gives all output sensitivities with respect to the single input.

Jacobian-Vector Products

Many algorithms need the action of a Jacobian on a vector, not the full Jacobian.

A Jacobian-vector product has the form

J_f(x)v

where

v \in \mathbb{R}^n

This gives the output perturbation caused by input perturbation $v$ .

Forward mode computes this directly. It propagates a tangent vector alongside the ordinary value.

y = f(x)

and the input perturbation is $v$ , then forward mode computes

\dot{y} = J_f(x)v

This is often much cheaper than building $J_f(x)$ .

Vector-Jacobian Products

Reverse mode naturally computes vector-Jacobian products:

u^T J_f(x)

where

u \in \mathbb{R}^m

This gives the sensitivity of the weighted output

u^T f(x)

with respect to the input.

If $f$ is scalar-valued, then $m = 1$ . With seed $u = 1$ , reverse mode computes

J_f(x)

as a row vector, which corresponds to the gradient.

This is the basis of backpropagation. A scalar loss is differentiated with respect to many parameters using one reverse pass.

Hessian

The Hessian is the matrix of second derivatives for a scalar-valued function.

Let

f : \mathbb{R}^n \to \mathbb{R}

The Hessian is

H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}

If the second partial derivatives are continuous, the Hessian is symmetric:

\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}

The Hessian describes curvature. The gradient gives the best local linear approximation. The Hessian gives the second-order correction.

Second-Order Approximation

Near a point $x$ , a scalar function can be approximated by

f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T H_f(x)\Delta x

The three terms have distinct meanings.

Term	Meaning
$f(x)$	value at the base point
$\nabla f(x)^T \Delta x$	first-order linear change
$\frac{1}{2}\Delta x^T H_f(x)\Delta x$	second-order curvature correction

Second-order methods use this model to choose better update steps than first-order gradient descent. Newton’s method, quasi-Newton methods, trust-region methods, and many sensitivity methods depend on curvature information.

Example Hessian

Consider

f(x_1, x_2) = x_1^2x_2 + \sin x_2

The gradient is

\nabla f(x) = \begin{bmatrix} 2x_1x_2 \\ x_1^2 + \cos x_2 \end{bmatrix}

The Hessian is

H_f(x) = \begin{bmatrix} 2x_2 & 2x_1 \\ 2x_1 & -\sin x_2 \end{bmatrix}

The off-diagonal entries match because the function is smooth.

Hessian-Vector Products

A full Hessian has $n^2$ entries. For large $n$ , this is often impossible to store.

Many algorithms only need a Hessian-vector product:

H_f(x)v

This measures how the gradient changes in direction $v$ .

There is a useful identity:

H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}

So a Hessian-vector product is the directional derivative of the gradient.

AD systems can compute this efficiently by combining modes. One common strategy is forward-over-reverse:

Use reverse mode to define the gradient computation.
Use forward mode through that gradient computation to obtain the directional derivative.

This avoids explicitly forming the Hessian.

Hessian as Jacobian of the Gradient

The Hessian is the Jacobian of the gradient.

For

f : \mathbb{R}^n \to \mathbb{R}

the gradient is a vector-valued function:

\nabla f : \mathbb{R}^n \to \mathbb{R}^n

The Jacobian of this gradient function is the Hessian:

J_{\nabla f}(x) = H_f(x)

This viewpoint is useful in AD because higher-order differentiation can be built by differentiating derivative programs.

A first AD transform gives a gradient program. A second AD transform can differentiate that gradient program.

Explicit Matrices vs Products

AD users often ask for gradients, Jacobians, or Hessians. AD systems internally prefer products.

Desired object	Size	Often computed as
Gradient	$n$	reverse mode with scalar seed
Full Jacobian	$mn$	repeated JVPs or repeated VJPs
Full Hessian	$n^2$	repeated Hessian-vector products
JVP	$m$	forward mode
VJP	$n$	reverse mode
HVP	$n$	mixed-mode AD

This distinction is important. A full matrix may be conceptually simple but computationally unsuitable. Matrix-free products often give the same information needed by optimization and simulation algorithms at far lower cost.

Practical Meaning for AD

The Jacobian and Hessian are not special symbolic objects inside most AD systems. They are derivative computations obtained from repeated or nested applications of local rules.

A Jacobian arises from first-order sensitivity of vector outputs with respect to vector inputs. A Hessian arises from second-order sensitivity of a scalar output with respect to vector inputs.

The implementation question is usually:

Which derivative product does the user need?

That question leads directly to the right AD mode. Forward mode gives JVPs. Reverse mode gives VJPs. Mixed-mode AD gives second-order products such as Hessian-vector products. Full Jacobians and Hessians are then built only when their explicit entries are truly needed.