Skip to content

Hessian-Vector Products

A Hessian-vector product computes

A Hessian-vector product computes

Hf(x)v=2f(x)v, H_f(x)v = \nabla^2 f(x)v,

where

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

and

vRn. v \in \mathbb{R}^n.

This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.

For large systems, this distinction is fundamental. A dense Hessian requires O(n2)O(n^2) storage. A Hessian-vector product requires only O(n)O(n) storage for the input and output vectors.

Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.

Directional Interpretation

The Hessian-vector product measures how the gradient changes in direction vv.

Define

g(x)=f(x). g(x) = \nabla f(x).

Then

Hf(x)v=Dg(x)[v]. H_f(x)v = Dg(x)[v].

Equivalently,

Hf(x)v=ddϵf(x+ϵv)ϵ=0. H_f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}.

So a Hessian-vector product is simply the directional derivative of the gradient.

This identity is the basis of most AD implementations.

Geometric Meaning

The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.

Given a direction vv:

QuantityMeaning
f(x)\nabla f(x)local slope
Hf(x)vH_f(x)vchange of slope along vv
vHf(x)vv^\top H_f(x)vcurvature along vv

The vector

Hf(x)v H_f(x)v

points in the direction the gradient moves when the input moves infinitesimally along vv.

Example

Let

f(x,y)=x2y+siny. f(x, y) = x^2 y + \sin y.

The gradient is

f(x,y)=[2xyx2+cosy]. \nabla f(x, y) = \begin{bmatrix} 2xy \\ x^2 + \cos y \end{bmatrix}.

The Hessian is

Hf(x,y)=[2y2x2xsiny]. H_f(x, y) = \begin{bmatrix} 2y & 2x \\ 2x & -\sin y \end{bmatrix}.

For

v=[v1v2], v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix},

the Hessian-vector product is

Hf(x,y)v=[2yv1+2xv22xv1sin(y)v2]. H_f(x, y)v = \begin{bmatrix} 2yv_1 + 2xv_2 \\ 2xv_1 - \sin(y)v_2 \end{bmatrix}.

Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.

Why Hessian-Vector Products Matter

Suppose

n=106. n = 10^6.

A dense Hessian contains

1012 10^{12}

entries.

Even with 8-byte floating point values, storage would require multiple terabytes.

But a Hessian-vector product maps one vector to another:

RnRn. \mathbb{R}^n \to \mathbb{R}^n.

The storage cost is linear in nn.

Many optimization algorithms only need this operator action.

AlgorithmNeeds full Hessian?Needs HVP?
Newton-CGnoyes
Conjugate gradientnoyes
Lanczosnoyes
Trust-region methodssometimesyes
Curvature diagnosticsnoyes
Krylov methodsnoyes

This makes Hessian-vector products the central second-order primitive in large-scale optimization.

Forward-over-Reverse AD

The standard AD implementation uses forward-over-reverse differentiation.

First compute the gradient using reverse mode:

g(x)=f(x). g(x) = \nabla f(x).

Then differentiate the gradient computation using forward mode:

Dg(x)[v]=Hf(x)v. Dg(x)[v] = H_f(x)v.

Algorithmically:

g = grad(f)
hvp = jvp(g, x, v)

This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.

The total cost is usually comparable to a small multiple of gradient evaluation.

Reverse-over-Forward AD

An alternative uses reverse-over-forward.

First compute the directional derivative:

ϕv(x)=Df(x)[v]=f(x)v. \phi_v(x) = Df(x)[v] = \nabla f(x)^\top v.

Then differentiate this scalar function with reverse mode:

ϕv(x)=Hf(x)v. \nabla \phi_v(x) = H_f(x)v.

Algorithmically:

phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)

This is mathematically equivalent to forward-over-reverse.

The preferred method depends on:

FactorInfluence
compiler structurenesting overhead
tape implementationreverse-mode complexity
memory reusecheckpointing strategy
primitive supportavailability of JVP/VJP rules
hardware backendGPU fusion behavior

Modern systems often support both.

Pearlmutter’s Method

A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.

The core identity is:

Hf(x)v=Rv{f(x)}, H_f(x)v = R_v\{\nabla f(x)\},

where

Rv R_v

is a forward-mode directional derivative operator.

This means:

  1. Run reverse mode to define the gradient computation.
  2. Push a tangent vector through that reverse computation.

The resulting complexity is approximately:

QuantityComplexity
function evaluationCC
gradient evaluation2C\approx 2C
Hessian-vector productO(C)O(C)

up to constant factors and implementation details.

This result is important because it avoids quadratic scaling.

Hessian-Free Methods

Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.

Instead, they repeatedly query:

vHf(x)v. v \mapsto H_f(x)v.

The Hessian becomes an implicit linear operator.

Newton-CG is a standard example.

Newton’s equation is

Hf(x)p=f(x). H_f(x)p = -\nabla f(x).

Rather than forming Hf(x)H_f(x), conjugate gradient solves the linear system using only repeated Hessian-vector products.

Algorithmically:

while not converged:
    w = hvp(v)
    update Krylov basis
    update search direction

This scales to extremely large systems because the algorithm never materializes the dense matrix.

Curvature Along Directions

A Hessian-vector product also gives directional curvature information.

The scalar

vHf(x)v v^\top H_f(x)v

measures curvature along direction vv.

This quantity appears in:

AreaUsage
optimizationtrust-region methods
deep learningcurvature diagnostics
physicslocal energy curvature
numerical analysisstability estimation
statisticsFisher-like approximations

If

vHf(x)v>0, v^\top H_f(x)v > 0,

the function bends upward along vv.

If

vHf(x)v<0, v^\top H_f(x)v < 0,

the function bends downward.

If the value is near zero, the direction is locally flat.

Finite Difference Approximation

A Hessian-vector product can be approximated using finite differences of gradients:

Hf(x)vf(x+ϵv)f(x)ϵ. H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}.

A centered approximation is more accurate:

Hf(x)vf(x+ϵv)f(xϵv)2ϵ. H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)} {2\epsilon}.

This requires only gradient evaluations.

However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.

Linear Operator View

A Hessian-vector product defines a linear map:

vHf(x)v. v \mapsto H_f(x)v.

The Hessian therefore behaves as a linear operator.

This suggests an important design principle.

Instead of exposing Hessians as dense arrays:

H = hessian(f)(x)

large systems should expose operator interfaces:

H = hessian_operator(f, x)

y = H.matvec(v)

This enables:

CapabilityBenefit
iterative solversno dense matrix
lazy evaluationcompute only needed products
sparse backendsexploit structure
distributed executionshard operator application
GPU kernelsfuse operator evaluation

This operator-centric design appears throughout modern scientific computing.

HVPs in Neural Networks

Deep neural networks rarely use explicit Hessians because parameter counts are enormous.

Still, Hessian-vector products are heavily used in:

ApplicationRole
second-order optimizationcurvature-aware updates
meta-learningimplicit gradients
influence functionsparameter sensitivity
pruningcurvature estimation
uncertainty estimationlocal geometry
neural tangent analysislinearization studies

In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.

Memory Considerations

A Hessian-vector product can often be computed with memory usage close to gradient computation.

However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.

Forward-over-reverse is usually preferred because:

PropertyForward-over-reverse
memory usagemoderate
implementation complexitymanageable
nesting stabilityrelatively good
operator efficiencyhigh

Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.

Structured Hessian Operators

Some systems expose specialized Hessian operators.

Examples include:

StructureOptimization
diagonal Hessianstore only diagonal
block Hessianblockwise matvec
sparse Hessiansparse kernels
low-rank Hessianfactored representation
Kronecker approximationsefficient inversion
Fisher approximationsprobabilistic geometry

The operator abstraction makes these interchangeable from the optimizer’s perspective.

Practical Principle

The full Hessian is usually a debugging or analysis object.

The Hessian-vector product is the scalable computational primitive.

A modern AD system should therefore optimize:

vHf(x)v v \mapsto H_f(x)v

as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.