Skip to content

Jacobians and Hessians

Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters.

Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters. Some problems require a more detailed view of derivatives. Jacobians describe first derivatives of vector-valued functions. Hessians describe second derivatives of scalar-valued functions.

These objects are central to optimization theory, sensitivity analysis, uncertainty estimation, curvature-aware training, and some advanced methods in meta-learning and scientific machine learning.

From Derivatives to Jacobians

For a scalar function

y=f(x), y = f(x),

the derivative is a single number:

dydx. \frac{dy}{dx}.

For a vector-valued function

y=f(x), y = f(x),

where

xRn,yRm, x\in\mathbb{R}^{n}, \quad y\in\mathbb{R}^{m},

each output component may depend on each input component.

Write

y=[y1y2ym],x=[x1x2xn]. y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}.

The Jacobian is the matrix of all first partial derivatives:

Jf(x)=yx=[y1x1y1x2y1xny2x1y2x2y2xnymx1ymx2ymxn]. J_f(x) = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}.

Thus

Jf(x)Rm×n. J_f(x)\in\mathbb{R}^{m\times n}.

The row index corresponds to an output component. The column index corresponds to an input component.

A Simple Jacobian Example

Consider the function

f(x1,x2)=[x1+x2x1x2x12]. f(x_1,x_2) = \begin{bmatrix} x_1 + x_2 \\ x_1x_2 \\ x_1^2 \end{bmatrix}.

Here the input has dimension 2 and the output has dimension 3. Therefore the Jacobian has shape 3×23\times 2.

The partial derivatives are:

y1x1=1,y1x2=1, \frac{\partial y_1}{\partial x_1}=1, \quad \frac{\partial y_1}{\partial x_2}=1, y2x1=x2,y2x2=x1, \frac{\partial y_2}{\partial x_1}=x_2, \quad \frac{\partial y_2}{\partial x_2}=x_1, y3x1=2x1,y3x2=0. \frac{\partial y_3}{\partial x_1}=2x_1, \quad \frac{\partial y_3}{\partial x_2}=0.

So

Jf(x)=[11x2x12x10]. J_f(x) = \begin{bmatrix} 1 & 1 \\ x_2 & x_1 \\ 2x_1 & 0 \end{bmatrix}.

At x1=2x_1=2, x2=3x_2=3,

Jf(x)=[113240]. J_f(x) = \begin{bmatrix} 1 & 1 \\ 3 & 2 \\ 4 & 0 \end{bmatrix}.

In PyTorch:

import torch

def f(x):
    x1, x2 = x[0], x[1]
    return torch.stack([
        x1 + x2,
        x1 * x2,
        x1 ** 2,
    ])

x = torch.tensor([2.0, 3.0])
J = torch.autograd.functional.jacobian(f, x)

print(J)
print(J.shape)  # torch.Size([3, 2])

The result matches the analytic Jacobian.

Jacobians in Neural Networks

A neural network maps inputs to outputs:

y^=fθ(x). \hat{y}=f_\theta(x).

If

xRd,y^RK, x\in\mathbb{R}^{d}, \quad \hat{y}\in\mathbb{R}^{K},

then the input-output Jacobian is

Jx=y^xRK×d. J_x = \frac{\partial \hat{y}}{\partial x} \in\mathbb{R}^{K\times d}.

This matrix tells how each output changes when each input component changes.

For an image classifier, dd may be very large. A 224×224224\times224 RGB image has

d=3224224=150,528 d = 3\cdot224\cdot224 = 150{,}528

input values. If the model has 1000 output classes, then the input-output Jacobian has more than 150 million entries for one image.

This is one reason full Jacobians are rarely materialized in standard training. Instead, deep learning systems compute products involving Jacobians.

Vector-Jacobian Products

Reverse-mode differentiation computes vector-Jacobian products.

Suppose

y=f(x),yRm,xRn. y=f(x), \quad y\in\mathbb{R}^{m}, \quad x\in\mathbb{R}^{n}.

Let a scalar loss LL depend on yy. The upstream gradient is

yˉ=LyRm. \bar{y} = \frac{\partial L}{\partial y} \in\mathbb{R}^{m}.

The gradient with respect to xx is

xˉ=Jf(x)yˉ. \bar{x} = J_f(x)^\top \bar{y}.

This product gives the effect of the downstream loss on the input. It avoids constructing the full Jacobian.

In PyTorch, this is what backward() computes:

import torch

x = torch.tensor([2.0, 3.0], requires_grad=True)

y = torch.stack([
    x[0] + x[1],
    x[0] * x[1],
    x[0] ** 2,
])

v = torch.tensor([1.0, 10.0, 100.0])

y.backward(v)

print(x.grad)

Here v is the upstream gradient yˉ\bar{y}. PyTorch computes

Jv. J^\top v.

For the Jacobian

J=[113240], J = \begin{bmatrix} 1 & 1 \\ 3 & 2 \\ 4 & 0 \end{bmatrix},

and

v=[110100], v = \begin{bmatrix} 1 \\ 10 \\ 100 \end{bmatrix},

we get

Jv=[134120][110100]=[43121]. J^\top v = \begin{bmatrix} 1 & 3 & 4 \\ 1 & 2 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 10 \\ 100 \end{bmatrix} = \begin{bmatrix} 431 \\ 21 \end{bmatrix}.

The printed gradient is therefore close to [431, 21].

Jacobian-Vector Products

Forward-mode differentiation computes Jacobian-vector products.

Given

Jf(x)Rm×n, J_f(x)\in\mathbb{R}^{m\times n},

and a vector

vRn, v\in\mathbb{R}^{n},

the Jacobian-vector product is

Jf(x)v. J_f(x)v.

It measures how the output changes when the input moves in direction vv.

PyTorch supports forward-mode tools through torch.func and related APIs. A simple example:

import torch
from torch.func import jvp

def f(x):
    return torch.stack([
        x[0] + x[1],
        x[0] * x[1],
        x[0] ** 2,
    ])

x = torch.tensor([2.0, 3.0])
v = torch.tensor([1.0, 0.0])

y, jvp_value = jvp(f, (x,), (v,))

print(y)
print(jvp_value)

Here v = [1, 0] asks how the output changes when only x1x_1 changes. The result is the first column of the Jacobian:

[134]. \begin{bmatrix} 1 \\ 3 \\ 4 \end{bmatrix}.

Why Products Matter More Than Matrices

The Jacobian can be too large to store. For modern models, even a single layer can have enormous derivative matrices.

Backpropagation avoids this problem. Instead of storing full Jacobians, it applies local vector-Jacobian products. Each operation receives an upstream gradient and returns gradients for its inputs.

For example, a linear layer

y=Wx y = Wx

has Jacobian with respect to xx:

Jx=W. J_x = W.

The backward pass does not need to store a separate Jacobian object. It computes

xˉ=Wyˉ. \bar{x}=W^\top \bar{y}.

For batched matrix multiplication, convolution, attention, and normalization, PyTorch uses efficient backward kernels that compute the required products directly.

Hessians

A Hessian is a matrix of second derivatives. For a scalar function

L=f(x),xRn, L=f(x), \quad x\in\mathbb{R}^{n},

the Hessian is

Hf(x)=x2f(x)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]. H_f(x) = \nabla_x^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.

The Hessian describes curvature. While the gradient tells the direction of steepest local increase, the Hessian tells how the gradient itself changes.

A Simple Hessian Example

Consider

f(x1,x2)=x12+x1x2+3x22. f(x_1,x_2)=x_1^2 + x_1x_2 + 3x_2^2.

The first derivatives are

fx1=2x1+x2, \frac{\partial f}{\partial x_1}=2x_1+x_2, fx2=x1+6x2. \frac{\partial f}{\partial x_2}=x_1+6x_2.

The second derivatives are

2fx12=2, \frac{\partial^2 f}{\partial x_1^2}=2, 2fx1x2=1, \frac{\partial^2 f}{\partial x_1\partial x_2}=1, 2fx2x1=1, \frac{\partial^2 f}{\partial x_2\partial x_1}=1, 2fx22=6. \frac{\partial^2 f}{\partial x_2^2}=6.

So

Hf(x)=[2116]. H_f(x) = \begin{bmatrix} 2 & 1 \\ 1 & 6 \end{bmatrix}.

In PyTorch:

import torch

def f(x):
    return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2

x = torch.tensor([1.0, 2.0])
H = torch.autograd.functional.hessian(f, x)

print(H)

The Hessian is constant for this quadratic function.

Hessians and Optimization

The Hessian appears in second-order optimization. Around a point xx, a smooth scalar function can be approximated by a second-order Taylor expansion:

f(x+Δ)f(x)+f(x)Δ+12ΔHf(x)Δ. f(x+\Delta) \approx f(x) + \nabla f(x)^\top \Delta + \frac{1}{2}\Delta^\top H_f(x)\Delta.

The gradient term describes slope. The Hessian term describes curvature.

Newton’s method uses the Hessian to choose an update direction:

Δ=H1f(x). \Delta = -H^{-1}\nabla f(x).

This can converge rapidly for some problems. For deep learning, full Newton methods are usually impractical because the Hessian is too large.

If a model has pp parameters, the Hessian of the loss with respect to parameters has shape

p×p. p\times p.

For p=108p=10^8, this matrix would have 101610^{16} entries. Storing it directly is impossible on ordinary hardware.

Hessian-Vector Products

As with Jacobians, products are often more useful than full matrices.

A Hessian-vector product has the form

Hv. Hv.

It can be computed without explicitly forming HH. This is useful in curvature estimation, conjugate gradient methods, influence functions, and some meta-learning algorithms.

In PyTorch, one way to compute a Hessian-vector product is to differentiate a gradient-vector product:

import torch

def f(x):
    return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2

x = torch.tensor([1.0, 2.0], requires_grad=True)
v = torch.tensor([1.0, 0.5])

loss = f(x)
grad = torch.autograd.grad(loss, x, create_graph=True)[0]

grad_dot_v = (grad * v).sum()
hvp = torch.autograd.grad(grad_dot_v, x)[0]

print(hvp)

For

H=[2116] H = \begin{bmatrix} 2 & 1 \\ 1 & 6 \end{bmatrix}

and

v=[10.5], v = \begin{bmatrix} 1 \\ 0.5 \end{bmatrix},

we get

Hv=[2.54]. Hv = \begin{bmatrix} 2.5 \\ 4 \end{bmatrix}.

Higher-Order Derivatives in PyTorch

PyTorch can compute higher-order derivatives when the backward computation itself is recorded as a differentiable graph.

The key argument is create_graph=True.

import torch

x = torch.tensor(3.0, requires_grad=True)

y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]

d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]

print(dy_dx)     # tensor(27., grad_fn=<MulBackward0>)
print(d2y_dx2)   # tensor(18.)

Since

y=x3, y=x^3,

we have

dydx=3x2, \frac{dy}{dx}=3x^2,

and

d2ydx2=6x. \frac{d^2y}{dx^2}=6x.

At x=3x=3, these are 2727 and 1818.

Higher-order derivatives use more memory and computation. They should be used when the algorithm explicitly needs them.

Jacobians, Hessians, and Batches

Batches add another layer of shape complexity.

Suppose a model maps a batch

XRB×d X\in\mathbb{R}^{B\times d}

to outputs

YRB×K. Y\in\mathbb{R}^{B\times K}.

The full Jacobian of YY with respect to XX would have shape

B×K×B×d. B\times K\times B\times d.

This includes derivatives between every output example and every input example. In ordinary feedforward models, examples in the same batch are independent, so most cross-example derivatives are zero. However, some operations, such as batch normalization, can couple examples in a batch.

For this reason, per-sample gradients and per-example Jacobians require care. PyTorch’s torch.func.vmap can help compute such quantities efficiently.

Example pattern:

import torch
from torch.func import jacrev, vmap

def model_single(x):
    return model(x.unsqueeze(0)).squeeze(0)

per_example_jacobian = vmap(jacrev(model_single))(X)

This computes a Jacobian for each example rather than one large batch-level Jacobian.

Curvature in Deep Learning

The Hessian helps describe the geometry of the loss surface.

At a local minimum, the gradient is near zero. The Hessian indicates whether the loss curves upward in nearby directions. If the Hessian has large positive eigenvalues, the loss rises sharply in some directions. If it has small eigenvalues, the loss is flat in some directions. If it has negative eigenvalues, the point is not a strict local minimum.

Deep neural networks often have high-dimensional loss surfaces with many flat directions. This makes curvature analysis difficult but useful. Hessian eigenvalues, trace estimates, and sharpness measures are often used to study optimization and generalization.

However, these quantities are expensive and sensitive to parameterization. They should be interpreted carefully.

Practical Use Cases

Full Jacobians and Hessians are uncommon in ordinary training, but derivative products are common.

Jacobians are used in sensitivity analysis, adversarial examples, normalizing flows, implicit layers, neural differential equations, and some regularization methods.

Hessians and Hessian-vector products are used in second-order optimization, meta-learning, influence functions, uncertainty estimation, pruning, and loss-surface analysis.

Most day-to-day model training uses only first-order gradients. Advanced methods use Jacobian-vector products, vector-Jacobian products, and Hessian-vector products to get more information without materializing huge derivative matrices.

Summary

A Jacobian is the matrix of first derivatives for a vector-valued function. A Hessian is the matrix of second derivatives for a scalar-valued function.

Full Jacobians and Hessians are usually too large for modern deep learning models. PyTorch and other autograd systems therefore focus on efficient derivative products: vector-Jacobian products, Jacobian-vector products, and Hessian-vector products.

These tools form the bridge between ordinary backpropagation and more advanced methods in optimization, uncertainty, scientific computing, and interpretability.