Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters.
Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters. Some problems require a more detailed view of derivatives. Jacobians describe first derivatives of vector-valued functions. Hessians describe second derivatives of scalar-valued functions.
These objects are central to optimization theory, sensitivity analysis, uncertainty estimation, curvature-aware training, and some advanced methods in meta-learning and scientific machine learning.
From Derivatives to Jacobians
For a scalar function
the derivative is a single number:
For a vector-valued function
where
each output component may depend on each input component.
Write
The Jacobian is the matrix of all first partial derivatives:
Thus
The row index corresponds to an output component. The column index corresponds to an input component.
A Simple Jacobian Example
Consider the function
Here the input has dimension 2 and the output has dimension 3. Therefore the Jacobian has shape .
The partial derivatives are:
So
At , ,
In PyTorch:
import torch
def f(x):
x1, x2 = x[0], x[1]
return torch.stack([
x1 + x2,
x1 * x2,
x1 ** 2,
])
x = torch.tensor([2.0, 3.0])
J = torch.autograd.functional.jacobian(f, x)
print(J)
print(J.shape) # torch.Size([3, 2])The result matches the analytic Jacobian.
Jacobians in Neural Networks
A neural network maps inputs to outputs:
If
then the input-output Jacobian is
This matrix tells how each output changes when each input component changes.
For an image classifier, may be very large. A RGB image has
input values. If the model has 1000 output classes, then the input-output Jacobian has more than 150 million entries for one image.
This is one reason full Jacobians are rarely materialized in standard training. Instead, deep learning systems compute products involving Jacobians.
Vector-Jacobian Products
Reverse-mode differentiation computes vector-Jacobian products.
Suppose
Let a scalar loss depend on . The upstream gradient is
The gradient with respect to is
This product gives the effect of the downstream loss on the input. It avoids constructing the full Jacobian.
In PyTorch, this is what backward() computes:
import torch
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.stack([
x[0] + x[1],
x[0] * x[1],
x[0] ** 2,
])
v = torch.tensor([1.0, 10.0, 100.0])
y.backward(v)
print(x.grad)Here v is the upstream gradient . PyTorch computes
For the Jacobian
and
we get
The printed gradient is therefore close to [431, 21].
Jacobian-Vector Products
Forward-mode differentiation computes Jacobian-vector products.
Given
and a vector
the Jacobian-vector product is
It measures how the output changes when the input moves in direction .
PyTorch supports forward-mode tools through torch.func and related APIs. A simple example:
import torch
from torch.func import jvp
def f(x):
return torch.stack([
x[0] + x[1],
x[0] * x[1],
x[0] ** 2,
])
x = torch.tensor([2.0, 3.0])
v = torch.tensor([1.0, 0.0])
y, jvp_value = jvp(f, (x,), (v,))
print(y)
print(jvp_value)Here v = [1, 0] asks how the output changes when only changes. The result is the first column of the Jacobian:
Why Products Matter More Than Matrices
The Jacobian can be too large to store. For modern models, even a single layer can have enormous derivative matrices.
Backpropagation avoids this problem. Instead of storing full Jacobians, it applies local vector-Jacobian products. Each operation receives an upstream gradient and returns gradients for its inputs.
For example, a linear layer
has Jacobian with respect to :
The backward pass does not need to store a separate Jacobian object. It computes
For batched matrix multiplication, convolution, attention, and normalization, PyTorch uses efficient backward kernels that compute the required products directly.
Hessians
A Hessian is a matrix of second derivatives. For a scalar function
the Hessian is
The Hessian describes curvature. While the gradient tells the direction of steepest local increase, the Hessian tells how the gradient itself changes.
A Simple Hessian Example
Consider
The first derivatives are
The second derivatives are
So
In PyTorch:
import torch
def f(x):
return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2
x = torch.tensor([1.0, 2.0])
H = torch.autograd.functional.hessian(f, x)
print(H)The Hessian is constant for this quadratic function.
Hessians and Optimization
The Hessian appears in second-order optimization. Around a point , a smooth scalar function can be approximated by a second-order Taylor expansion:
The gradient term describes slope. The Hessian term describes curvature.
Newton’s method uses the Hessian to choose an update direction:
This can converge rapidly for some problems. For deep learning, full Newton methods are usually impractical because the Hessian is too large.
If a model has parameters, the Hessian of the loss with respect to parameters has shape
For , this matrix would have entries. Storing it directly is impossible on ordinary hardware.
Hessian-Vector Products
As with Jacobians, products are often more useful than full matrices.
A Hessian-vector product has the form
It can be computed without explicitly forming . This is useful in curvature estimation, conjugate gradient methods, influence functions, and some meta-learning algorithms.
In PyTorch, one way to compute a Hessian-vector product is to differentiate a gradient-vector product:
import torch
def f(x):
return x[0] ** 2 + x[0] * x[1] + 3 * x[1] ** 2
x = torch.tensor([1.0, 2.0], requires_grad=True)
v = torch.tensor([1.0, 0.5])
loss = f(x)
grad = torch.autograd.grad(loss, x, create_graph=True)[0]
grad_dot_v = (grad * v).sum()
hvp = torch.autograd.grad(grad_dot_v, x)[0]
print(hvp)For
and
we get
Higher-Order Derivatives in PyTorch
PyTorch can compute higher-order derivatives when the backward computation itself is recorded as a differentiable graph.
The key argument is create_graph=True.
import torch
x = torch.tensor(3.0, requires_grad=True)
y = x ** 3
dy_dx = torch.autograd.grad(y, x, create_graph=True)[0]
d2y_dx2 = torch.autograd.grad(dy_dx, x)[0]
print(dy_dx) # tensor(27., grad_fn=<MulBackward0>)
print(d2y_dx2) # tensor(18.)Since
we have
and
At , these are and .
Higher-order derivatives use more memory and computation. They should be used when the algorithm explicitly needs them.
Jacobians, Hessians, and Batches
Batches add another layer of shape complexity.
Suppose a model maps a batch
to outputs
The full Jacobian of with respect to would have shape
This includes derivatives between every output example and every input example. In ordinary feedforward models, examples in the same batch are independent, so most cross-example derivatives are zero. However, some operations, such as batch normalization, can couple examples in a batch.
For this reason, per-sample gradients and per-example Jacobians require care. PyTorch’s torch.func.vmap can help compute such quantities efficiently.
Example pattern:
import torch
from torch.func import jacrev, vmap
def model_single(x):
return model(x.unsqueeze(0)).squeeze(0)
per_example_jacobian = vmap(jacrev(model_single))(X)This computes a Jacobian for each example rather than one large batch-level Jacobian.
Curvature in Deep Learning
The Hessian helps describe the geometry of the loss surface.
At a local minimum, the gradient is near zero. The Hessian indicates whether the loss curves upward in nearby directions. If the Hessian has large positive eigenvalues, the loss rises sharply in some directions. If it has small eigenvalues, the loss is flat in some directions. If it has negative eigenvalues, the point is not a strict local minimum.
Deep neural networks often have high-dimensional loss surfaces with many flat directions. This makes curvature analysis difficult but useful. Hessian eigenvalues, trace estimates, and sharpness measures are often used to study optimization and generalization.
However, these quantities are expensive and sensitive to parameterization. They should be interpreted carefully.
Practical Use Cases
Full Jacobians and Hessians are uncommon in ordinary training, but derivative products are common.
Jacobians are used in sensitivity analysis, adversarial examples, normalizing flows, implicit layers, neural differential equations, and some regularization methods.
Hessians and Hessian-vector products are used in second-order optimization, meta-learning, influence functions, uncertainty estimation, pruning, and loss-surface analysis.
Most day-to-day model training uses only first-order gradients. Advanced methods use Jacobian-vector products, vector-Jacobian products, and Hessian-vector products to get more information without materializing huge derivative matrices.
Summary
A Jacobian is the matrix of first derivatives for a vector-valued function. A Hessian is the matrix of second derivatives for a scalar-valued function.
Full Jacobians and Hessians are usually too large for modern deep learning models. PyTorch and other autograd systems therefore focus on efficient derivative products: vector-Jacobian products, Jacobian-vector products, and Hessian-vector products.
These tools form the bridge between ordinary backpropagation and more advanced methods in optimization, uncertainty, scientific computing, and interpretability.