A Hessian-vector product computes
where
and
This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.
For large systems, this distinction is fundamental. A dense Hessian requires storage. A Hessian-vector product requires only storage for the input and output vectors.
Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.
Directional Interpretation
The Hessian-vector product measures how the gradient changes in direction .
Define
Then
Equivalently,
So a Hessian-vector product is simply the directional derivative of the gradient.
This identity is the basis of most AD implementations.
Geometric Meaning
The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.
Given a direction :
| Quantity | Meaning |
|---|---|
| local slope | |
| change of slope along | |
| curvature along |
The vector
points in the direction the gradient moves when the input moves infinitesimally along .
Example
Let
The gradient is
The Hessian is
For
the Hessian-vector product is
Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.
Why Hessian-Vector Products Matter
Suppose
A dense Hessian contains
entries.
Even with 8-byte floating point values, storage would require multiple terabytes.
But a Hessian-vector product maps one vector to another:
The storage cost is linear in .
Many optimization algorithms only need this operator action.
| Algorithm | Needs full Hessian? | Needs HVP? |
|---|---|---|
| Newton-CG | no | yes |
| Conjugate gradient | no | yes |
| Lanczos | no | yes |
| Trust-region methods | sometimes | yes |
| Curvature diagnostics | no | yes |
| Krylov methods | no | yes |
This makes Hessian-vector products the central second-order primitive in large-scale optimization.
Forward-over-Reverse AD
The standard AD implementation uses forward-over-reverse differentiation.
First compute the gradient using reverse mode:
Then differentiate the gradient computation using forward mode:
Algorithmically:
g = grad(f)
hvp = jvp(g, x, v)This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.
The total cost is usually comparable to a small multiple of gradient evaluation.
Reverse-over-Forward AD
An alternative uses reverse-over-forward.
First compute the directional derivative:
Then differentiate this scalar function with reverse mode:
Algorithmically:
phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)This is mathematically equivalent to forward-over-reverse.
The preferred method depends on:
| Factor | Influence |
|---|---|
| compiler structure | nesting overhead |
| tape implementation | reverse-mode complexity |
| memory reuse | checkpointing strategy |
| primitive support | availability of JVP/VJP rules |
| hardware backend | GPU fusion behavior |
Modern systems often support both.
Pearlmutter’s Method
A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.
The core identity is:
where
is a forward-mode directional derivative operator.
This means:
- Run reverse mode to define the gradient computation.
- Push a tangent vector through that reverse computation.
The resulting complexity is approximately:
| Quantity | Complexity |
|---|---|
| function evaluation | |
| gradient evaluation | |
| Hessian-vector product |
up to constant factors and implementation details.
This result is important because it avoids quadratic scaling.
Hessian-Free Methods
Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.
Instead, they repeatedly query:
The Hessian becomes an implicit linear operator.
Newton-CG is a standard example.
Newton’s equation is
Rather than forming , conjugate gradient solves the linear system using only repeated Hessian-vector products.
Algorithmically:
while not converged:
w = hvp(v)
update Krylov basis
update search directionThis scales to extremely large systems because the algorithm never materializes the dense matrix.
Curvature Along Directions
A Hessian-vector product also gives directional curvature information.
The scalar
measures curvature along direction .
This quantity appears in:
| Area | Usage |
|---|---|
| optimization | trust-region methods |
| deep learning | curvature diagnostics |
| physics | local energy curvature |
| numerical analysis | stability estimation |
| statistics | Fisher-like approximations |
If
the function bends upward along .
If
the function bends downward.
If the value is near zero, the direction is locally flat.
Finite Difference Approximation
A Hessian-vector product can be approximated using finite differences of gradients:
A centered approximation is more accurate:
This requires only gradient evaluations.
However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.
Linear Operator View
A Hessian-vector product defines a linear map:
The Hessian therefore behaves as a linear operator.
This suggests an important design principle.
Instead of exposing Hessians as dense arrays:
H = hessian(f)(x)large systems should expose operator interfaces:
H = hessian_operator(f, x)
y = H.matvec(v)This enables:
| Capability | Benefit |
|---|---|
| iterative solvers | no dense matrix |
| lazy evaluation | compute only needed products |
| sparse backends | exploit structure |
| distributed execution | shard operator application |
| GPU kernels | fuse operator evaluation |
This operator-centric design appears throughout modern scientific computing.
HVPs in Neural Networks
Deep neural networks rarely use explicit Hessians because parameter counts are enormous.
Still, Hessian-vector products are heavily used in:
| Application | Role |
|---|---|
| second-order optimization | curvature-aware updates |
| meta-learning | implicit gradients |
| influence functions | parameter sensitivity |
| pruning | curvature estimation |
| uncertainty estimation | local geometry |
| neural tangent analysis | linearization studies |
In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.
Memory Considerations
A Hessian-vector product can often be computed with memory usage close to gradient computation.
However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.
Forward-over-reverse is usually preferred because:
| Property | Forward-over-reverse |
|---|---|
| memory usage | moderate |
| implementation complexity | manageable |
| nesting stability | relatively good |
| operator efficiency | high |
Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.
Structured Hessian Operators
Some systems expose specialized Hessian operators.
Examples include:
| Structure | Optimization |
|---|---|
| diagonal Hessian | store only diagonal |
| block Hessian | blockwise matvec |
| sparse Hessian | sparse kernels |
| low-rank Hessian | factored representation |
| Kronecker approximations | efficient inversion |
| Fisher approximations | probabilistic geometry |
The operator abstraction makes these interchangeable from the optimizer’s perspective.
Practical Principle
The full Hessian is usually a debugging or analysis object.
The Hessian-vector product is the scalable computational primitive.
A modern AD system should therefore optimize:
as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.