# Jacobians and Hessians

Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters. Some problems require a more detailed view of derivatives. Jacobians describe first derivatives of vector-valued functions. Hessians describe second derivatives of scalar-valued functions.

These objects are central to optimization theory, sensitivity analysis, uncertainty estimation, curvature-aware training, and some advanced methods in meta-learning and scientific machine learning.

### From Derivatives to Jacobians

For a scalar function

$$
y = f(x),
$$

the derivative is a single number:

$$
\frac{dy}{dx}.
$$

For a vector-valued function

$$
y = f(x),
$$

where

$$
x\in\mathbb{R}^{n},
\quad
y\in\mathbb{R}^{m},
$$

each output component may depend on each input component.

Write

$$
y =
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m
\end{bmatrix},
\quad
x =
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}.
$$

The Jacobian is the matrix of all first partial derivatives:

$$
J_f(x) =
\frac{\partial y}{\partial x} =
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} &
\frac{\partial y_1}{\partial x_2} &
\cdots &
\frac{\partial y_1}{\partial x_n}
\\
\frac{\partial y_2}{\partial x_1} &
\frac{\partial y_2}{\partial x_2} &
\cdots &
\frac{\partial y_2}{\partial x_n}
\\
\vdots & \vdots & \ddots & \vdots
\\
\frac{\partial y_m}{\partial x_1} &
\frac{\partial y_m}{\partial x_2} &
\cdots &
\frac{\partial y_m}{\partial x_n}
\end{bmatrix}.
$$

Thus

$$
J_f(x)\in\mathbb{R}^{m\times n}.
$$

The row index corresponds to an output component. The column index corresponds to an input component.

### A Simple Jacobian Example

Consider the function

$$
f(x_1,x_2) =
\begin{bmatrix}
x_1 + x_2 \\
x_1x_2 \\
x_1^2
\end{bmatrix}.
$$

Here the input has dimension 2 and the output has dimension 3. Therefore the Jacobian has shape \(3\times 2\).

The partial derivatives are:

$$
\frac{\partial y_1}{\partial x_1}=1,
\quad
\frac{\partial y_1}{\partial x_2}=1,
$$

$$
\frac{\partial y_2}{\partial x_1}=x_2,
\quad
\frac{\partial y_2}{\partial x_2}=x_1,
$$

$$
\frac{\partial y_3}{\partial x_1}=2x_1,
\quad
\frac{\partial y_3}{\partial x_2}=0.
$$

So

$$
J_f(x) =
\begin{bmatrix}
1 & 1 \\
x_2 & x_1 \\
2x_1 & 0
\end{bmatrix}.
$$

At \(x_1=2\), \(x_2=3\),

$$
J_f(x) =
\begin{bmatrix}
1 & 1 \\
3 & 2 \\
4 & 0
\end{bmatrix}.
$$

In PyTorch:

```python
import torch

def f(x):
    x1, x2 = x[0], x[1]
    return torch.stack([
        x1 + x2,
        x1 * x2,
        x1 ** 2,
    ])

x = torch.tensor([2.0, 3.0])
J = torch.autograd.functional.jacobian(f, x)

print(J)
print(J.shape)  # torch.Size([3, 2])
```

The result matches the analytic Jacobian.

### Jacobians in Neural Networks

A neural network maps inputs to outputs:

$$
\hat{y}=f_\theta(x).
$$

If

$$
x\in\mathbb{R}^{d},
\quad
\hat{y}\in\mathbb{R}^{K},
$$

then the input-output Jacobian is

$$
J_x =
\frac{\partial \hat{y}}{\partial x}
\in\mathbb{R}^{K\times d}.
$$

This matrix tells how each output changes when each input component changes.

For an image classifier, \(d\) may be very large. A \(224\times224\) RGB image has

$$
d = 3\cdot224\cdot224 = 150{,}528
$$

input values. If the model has 1000 output classes, then the input-output Jacobian has more than 150 million entries for one image.

This is one reason full Jacobians are rarely materialized in standard training. Instead, deep learning systems compute products involving Jacobians.

### Vector-Jacobian Products

Reverse-mode differentiation computes vector-Jacobian products.

Suppose

$$
y=f(x),
\quad
y\in\mathbb{R}^{m},
\quad
x\in\mathbb{R}^{n}.
$$

Let a scalar loss \(L\) depend on \(y\). The upstream gradient is

$$
\bar{y} =
\frac{\partial L}{\partial y}
\in\mathbb{R}^{m}.
$$

The gradient with respect to \(x\) is

$$
\bar{x} =
J_f(x)^\top \bar{y}.
$$

This product gives the effect of the downstream loss on the input. It avoids constructing the full Jacobian.

In PyTorch, this is what `backward()` computes:

```python
import torch

x = torch.tensor([2.0, 3.0], requires_grad=True)

y = torch.stack([
    x[0] + x[1],
    x[0] * x[1],
    x[0] ** 2,
])

v = torch.tensor([1.0, 10.0, 100.0])

y.backward(v)

print(x.grad)
```

Here `v` is the upstream gradient \(\bar{y}\). PyTorch computes

$$
J^\top v.
$$

For the Jacobian

$$
J =
\begin{bmatrix}
1 & 1 \\
3 & 2 \\
4 & 0
\end{bmatrix},
$$

and

$$
v =
\begin{bmatrix}
1 \\
10 \\
100
\end{bmatrix},
$$

we get

$$
J^\top v =
\begin{bmatrix}
1 & 3 & 4 \\
1 & 2 & 0
\end{bmatrix}
\begin{bmatrix}
1 \\
10 \\
100
\end{bmatrix} =
\begin{bmatrix}
431 \\
21
\end{bmatrix}.
$$

The printed gradient is therefore close to `[431, 21]`.

### Jacobian-Vector Products

Forward-mode differentiation computes Jacobian-vector products.

Given

$$
J_f(x)\in\mathbb{R}^{m\times n},
$$

and a vector

$$
v\in\mathbb{R}^{n},
$$

the Jacobian-vector product is

$$
J_f(x)v.
$$

It measures how the output changes when the input moves in direction \(v\).

PyTorch supports forward-mode tools through `torch.func` and related APIs. A simple example:

```python
import torch
from torch.func import jvp

def f(x):
    return torch.stack([
        x[0] + x[1],
        x[0] * x[1],
        x[0] ** 2,
    ])

x = torch.tensor([2.0, 3.0])
v = torch.tensor([1.0, 0.0])

y, jvp_value = jvp(f, (x,), (v,))

print(y)
print(jvp_value)
```

Here `v = [1, 0]` asks how the output changes when only \(x_1\) changes. The result is the first column of the Jacobian:

$$
\begin{bmatrix}
1 \\
3 \\
4
\end{bmatrix}.
$$

### Why Products Matter More Than Matrices

The Jacobian can be too large to store. For modern models, even a single layer can have enormous derivative matrices.

Backpropagation avoids this problem. Instead of storing full Jacobians, it applies local vector-Jacobian products. Each operation receives an upstream gradient and returns gradients for its inputs.

For example, a linear layer

$$
y = Wx
$$

has Jacobian with respect to \(x\):

$$
J_x = W.
$$

The backward pass does not need to store a separate Jacobian object. It computes

$$
\bar{x}=W^\top \bar{y}.
$$

For batched matrix multiplication, convolution, attention, and normalization, PyTorch uses efficient backward kernels that compute the required products directly.

### Hessians

A Hessian is a matrix of second derivatives. For a scalar function

$$
L=f(x),
\quad
x\in\mathbb{R}^{n},
$$

the Hessian is

$$
H_f(x) =
\nabla_x^2 f(x) =
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} &
\frac{\partial^2 f}{\partial x_1\partial x_2} &
\cdots &
\frac{\partial^2 f}{\partial x_1\partial x_n}
\\
\frac{\partial^2 f}{\partial x_2\partial x_1} &
\frac{\partial^2 f}{\partial x_2^2} &
\cdots &
\frac{\partial^2 f}{\partial x_2\partial x_n}
\\
\vdots & \vdots & \ddots & \vdots
\\
\frac{\partial^2 f}{\partial x_n\partial x_1} &
\frac{\partial^2 f}{\partial x_n\partial x_2} &
\cdots &
\frac{\partial^2 f}{\partial x_n^2}
\end{bmatrix}.
$$

The Hessian describes curvature. While the gradient tells the direction of steepest local increase, the Hessian tells how the gradient itself changes.

### A Simple Hessian Example

Consider

$$
f(x_1,x_2)=x_1^2 + x_1x_2 + 3x_2^2.
$$

The first derivatives are

$$
\frac{\partial f}{\partial x_1}=2x_1+x_2,
$$

$$
\frac{\partial f}{\partial x_2}=x_1+6x_2.
$$

The second derivatives are

$$
\frac{\partial^2 f}{\partial x_1^2}=2,
$$

$$
\frac{\partial^2 f}{\partial x_1\partial x_2}=1,
$$

$$
\frac{\partial^2 f}{\partial x_2\partial x_1}=1,
$$

$$
\frac{\partial^2 f}{\partial x_2^2}=6.
$$

So

$$
H_f(x) =
\begin{bmatrix}
2 & 1 \\
1 & 6
\end{bmatrix}.
$$

In PyTorch:

```python
import torch

def f(x):
    return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2

x = torch.tensor([1.0, 2.0])
H = torch.autograd.functional.hessian(f, x)

print(H)
```

The Hessian is constant for this quadratic function.

### Hessians and Optimization

The Hessian appears in second-order optimization. Around a point \(x\), a smooth scalar function can be approximated by a second-order Taylor expansion:

$$
f(x+\Delta)
\approx
f(x)
+
\nabla f(x)^\top \Delta
+
\frac{1}{2}\Delta^\top H_f(x)\Delta.
$$

The gradient term describes slope. The Hessian term describes curvature.

Newton’s method uses the Hessian to choose an update direction:

$$
\Delta = -H^{-1}\nabla f(x).
$$

This can converge rapidly for some problems. For deep learning, full Newton methods are usually impractical because the Hessian is too large.

If a model has \(p\) parameters, the Hessian of the loss with respect to parameters has shape

$$
p\times p.
$$

For \(p=10^8\), this matrix would have \(10^{16}\) entries. Storing it directly is impossible on ordinary hardware.

### Hessian-Vector Products

As with Jacobians, products are often more useful than full matrices.

A Hessian-vector product has the form

$$
Hv.
$$

It can be computed without explicitly forming \(H\). This is useful in curvature estimation, conjugate gradient methods, influence functions, and some meta-learning algorithms.

In PyTorch, one way to compute a Hessian-vector product is to differentiate a gradient-vector product:

```python
import torch

def f(x):
    return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2

x = torch.tensor([1.0, 2.0], requires_grad=True)
v = torch.tensor([1.0, 0.5])

loss = f(x)
grad = torch.autograd.grad(loss, x, create_graph=True)[0]

grad_dot_v = (grad * v).sum()
hvp = torch.autograd.grad(grad_dot_v, x)[0]

print(hvp)
```

For

$$
H =
\begin{bmatrix}
2 & 1 \\
1 & 6
\end{bmatrix}
$$

and

$$
v =
\begin{bmatrix}
1 \\
0.5
\end{bmatrix},
$$

we get

$$
Hv =
\begin{bmatrix}
2.5 \\
4
\end{bmatrix}.
$$

### Higher-Order Derivatives in PyTorch

PyTorch can compute higher-order derivatives when the backward computation itself is recorded as a differentiable graph.

The key argument is `create_graph=True`.

```python
import torch

x = torch.tensor(3.0, requires_grad=True)

y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]

d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]

print(dy_dx)     # tensor(27., grad_fn=<MulBackward0>)
print(d2y_dx2)   # tensor(18.)
```

Since

$$
y=x^3,
$$

we have

$$
\frac{dy}{dx}=3x^2,
$$

and

$$
\frac{d^2y}{dx^2}=6x.
$$

At \(x=3\), these are \(27\) and \(18\).

Higher-order derivatives use more memory and computation. They should be used when the algorithm explicitly needs them.

### Jacobians, Hessians, and Batches

Batches add another layer of shape complexity.

Suppose a model maps a batch

$$
X\in\mathbb{R}^{B\times d}
$$

to outputs

$$
Y\in\mathbb{R}^{B\times K}.
$$

The full Jacobian of \(Y\) with respect to \(X\) would have shape

$$
B\times K\times B\times d.
$$

This includes derivatives between every output example and every input example. In ordinary feedforward models, examples in the same batch are independent, so most cross-example derivatives are zero. However, some operations, such as batch normalization, can couple examples in a batch.

For this reason, per-sample gradients and per-example Jacobians require care. PyTorch’s `torch.func.vmap` can help compute such quantities efficiently.

Example pattern:

```python
import torch
from torch.func import jacrev, vmap

def model_single(x):
    return model(x.unsqueeze(0)).squeeze(0)

per_example_jacobian = vmap(jacrev(model_single))(X)
```

This computes a Jacobian for each example rather than one large batch-level Jacobian.

### Curvature in Deep Learning

The Hessian helps describe the geometry of the loss surface.

At a local minimum, the gradient is near zero. The Hessian indicates whether the loss curves upward in nearby directions. If the Hessian has large positive eigenvalues, the loss rises sharply in some directions. If it has small eigenvalues, the loss is flat in some directions. If it has negative eigenvalues, the point is not a strict local minimum.

Deep neural networks often have high-dimensional loss surfaces with many flat directions. This makes curvature analysis difficult but useful. Hessian eigenvalues, trace estimates, and sharpness measures are often used to study optimization and generalization.

However, these quantities are expensive and sensitive to parameterization. They should be interpreted carefully.

### Practical Use Cases

Full Jacobians and Hessians are uncommon in ordinary training, but derivative products are common.

Jacobians are used in sensitivity analysis, adversarial examples, normalizing flows, implicit layers, neural differential equations, and some regularization methods.

Hessians and Hessian-vector products are used in second-order optimization, meta-learning, influence functions, uncertainty estimation, pruning, and loss-surface analysis.

Most day-to-day model training uses only first-order gradients. Advanced methods use Jacobian-vector products, vector-Jacobian products, and Hessian-vector products to get more information without materializing huge derivative matrices.

### Summary

A Jacobian is the matrix of first derivatives for a vector-valued function. A Hessian is the matrix of second derivatives for a scalar-valued function.

Full Jacobians and Hessians are usually too large for modern deep learning models. PyTorch and other autograd systems therefore focus on efficient derivative products: vector-Jacobian products, Jacobian-vector products, and Hessian-vector products.

These tools form the bridge between ordinary backpropagation and more advanced methods in optimization, uncertainty, scientific computing, and interpretability.

