# Hessian-Vector Products

## Hessian-Vector Products

A Hessian-vector product computes

$$
H_f(x)v =
\nabla^2 f(x)v,
$$

where

$$
f : \mathbb{R}^n \to \mathbb{R},
$$

and

$$
v \in \mathbb{R}^n.
$$

This operation applies the Hessian as a linear operator to a vector without explicitly constructing the full Hessian matrix.

For large systems, this distinction is fundamental. A dense Hessian requires $O(n^2)$ storage. A Hessian-vector product requires only $O(n)$ storage for the input and output vectors.

Most scalable second-order optimization methods are built around Hessian-vector products rather than explicit Hessians.

## Directional Interpretation

The Hessian-vector product measures how the gradient changes in direction $v$.

Define

$$
g(x) = \nabla f(x).
$$

Then

$$
H_f(x)v =
Dg(x)[v].
$$

Equivalently,

$$
H_f(x)v =
\frac{d}{d\epsilon}
\nabla f(x + \epsilon v)
\bigg|_{\epsilon = 0}.
$$

So a Hessian-vector product is simply the directional derivative of the gradient.

This identity is the basis of most AD implementations.

## Geometric Meaning

The gradient defines the local slope field of a function. The Hessian describes how that slope field bends.

Given a direction $v$:

| Quantity | Meaning |
|---|---|
| $\nabla f(x)$ | local slope |
| $H_f(x)v$ | change of slope along $v$ |
| $v^\top H_f(x)v$ | curvature along $v$ |

The vector

$$
H_f(x)v
$$

points in the direction the gradient moves when the input moves infinitesimally along $v$.

## Example

Let

$$
f(x, y) = x^2 y + \sin y.
$$

The gradient is

$$
\nabla f(x, y) =
\begin{bmatrix}
2xy \\
x^2 + \cos y
\end{bmatrix}.
$$

The Hessian is

$$
H_f(x, y) =
\begin{bmatrix}
2y & 2x \\
2x & -\sin y
\end{bmatrix}.
$$

For

$$
v =
\begin{bmatrix}
v_1 \\
v_2
\end{bmatrix},
$$

the Hessian-vector product is

$$
H_f(x, y)v =
\begin{bmatrix}
2yv_1 + 2xv_2 \\
2xv_1 - \sin(y)v_2
\end{bmatrix}.
$$

Notice that we never need to materialize the full Hessian matrix to compute this product. We only need the action of the matrix on the vector.

## Why Hessian-Vector Products Matter

Suppose

$$
n = 10^6.
$$

A dense Hessian contains

$$
10^{12}
$$

entries.

Even with 8-byte floating point values, storage would require multiple terabytes.

But a Hessian-vector product maps one vector to another:

$$
\mathbb{R}^n \to \mathbb{R}^n.
$$

The storage cost is linear in $n$.

Many optimization algorithms only need this operator action.

| Algorithm | Needs full Hessian? | Needs HVP? |
|---|---|---|
| Newton-CG | no | yes |
| Conjugate gradient | no | yes |
| Lanczos | no | yes |
| Trust-region methods | sometimes | yes |
| Curvature diagnostics | no | yes |
| Krylov methods | no | yes |

This makes Hessian-vector products the central second-order primitive in large-scale optimization.

## Forward-over-Reverse AD

The standard AD implementation uses forward-over-reverse differentiation.

First compute the gradient using reverse mode:

$$
g(x) = \nabla f(x).
$$

Then differentiate the gradient computation using forward mode:

$$
Dg(x)[v] =
H_f(x)v.
$$

Algorithmically:

```text
g = grad(f)
hvp = jvp(g, x, v)
```

This is efficient because reverse mode computes gradients well for scalar-output functions, and forward mode efficiently propagates a directional derivative through that gradient computation.

The total cost is usually comparable to a small multiple of gradient evaluation.

## Reverse-over-Forward AD

An alternative uses reverse-over-forward.

First compute the directional derivative:

$$
\phi_v(x) =
Df(x)[v] =
\nabla f(x)^\top v.
$$

Then differentiate this scalar function with reverse mode:

$$
\nabla \phi_v(x) =
H_f(x)v.
$$

Algorithmically:

```text
phi(x) = jvp(f, x, v)
hvp = grad(phi)(x)
```

This is mathematically equivalent to forward-over-reverse.

The preferred method depends on:

| Factor | Influence |
|---|---|
| compiler structure | nesting overhead |
| tape implementation | reverse-mode complexity |
| memory reuse | checkpointing strategy |
| primitive support | availability of JVP/VJP rules |
| hardware backend | GPU fusion behavior |

Modern systems often support both.

## Pearlmutter's Method

A classical result due to Pearlmutter gives an efficient way to compute Hessian-vector products directly.

The core identity is:

$$
H_f(x)v =
R_v\{\nabla f(x)\},
$$

where

$$
R_v
$$

is a forward-mode directional derivative operator.

This means:

1. Run reverse mode to define the gradient computation.
2. Push a tangent vector through that reverse computation.

The resulting complexity is approximately:

| Quantity | Complexity |
|---|---|
| function evaluation | $C$ |
| gradient evaluation | $\approx 2C$ |
| Hessian-vector product | $O(C)$ |

up to constant factors and implementation details.

This result is important because it avoids quadratic scaling.

## Hessian-Free Methods

Many large-scale optimization methods are called Hessian-free because they never explicitly construct the Hessian.

Instead, they repeatedly query:

$$
v \mapsto H_f(x)v.
$$

The Hessian becomes an implicit linear operator.

Newton-CG is a standard example.

Newton's equation is

$$
H_f(x)p = -\nabla f(x).
$$

Rather than forming $H_f(x)$, conjugate gradient solves the linear system using only repeated Hessian-vector products.

Algorithmically:

```text
while not converged:
    w = hvp(v)
    update Krylov basis
    update search direction
```

This scales to extremely large systems because the algorithm never materializes the dense matrix.

## Curvature Along Directions

A Hessian-vector product also gives directional curvature information.

The scalar

$$
v^\top H_f(x)v
$$

measures curvature along direction $v$.

This quantity appears in:

| Area | Usage |
|---|---|
| optimization | trust-region methods |
| deep learning | curvature diagnostics |
| physics | local energy curvature |
| numerical analysis | stability estimation |
| statistics | Fisher-like approximations |

If

$$
v^\top H_f(x)v > 0,
$$

the function bends upward along $v$.

If

$$
v^\top H_f(x)v < 0,
$$

the function bends downward.

If the value is near zero, the direction is locally flat.

## Finite Difference Approximation

A Hessian-vector product can be approximated using finite differences of gradients:

$$
H_f(x)v
\approx
\frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}.
$$

A centered approximation is more accurate:

$$
H_f(x)v
\approx
\frac{\nabla f(x + \epsilon v) -
\nabla f(x - \epsilon v)}
{2\epsilon}.
$$

This requires only gradient evaluations.

However, finite differences introduce truncation error and floating point cancellation. AD-based Hessian-vector products are usually more accurate and often faster.

## Linear Operator View

A Hessian-vector product defines a linear map:

$$
v \mapsto H_f(x)v.
$$

The Hessian therefore behaves as a linear operator.

This suggests an important design principle.

Instead of exposing Hessians as dense arrays:

```text
H = hessian(f)(x)
```

large systems should expose operator interfaces:

```text
H = hessian_operator(f, x)

y = H.matvec(v)
```

This enables:

| Capability | Benefit |
|---|---|
| iterative solvers | no dense matrix |
| lazy evaluation | compute only needed products |
| sparse backends | exploit structure |
| distributed execution | shard operator application |
| GPU kernels | fuse operator evaluation |

This operator-centric design appears throughout modern scientific computing.

## HVPs in Neural Networks

Deep neural networks rarely use explicit Hessians because parameter counts are enormous.

Still, Hessian-vector products are heavily used in:

| Application | Role |
|---|---|
| second-order optimization | curvature-aware updates |
| meta-learning | implicit gradients |
| influence functions | parameter sensitivity |
| pruning | curvature estimation |
| uncertainty estimation | local geometry |
| neural tangent analysis | linearization studies |

In practice, most large ML systems treat Hessians as implicit operators accessed through HVP queries.

## Memory Considerations

A Hessian-vector product can often be computed with memory usage close to gradient computation.

However, nesting reverse mode inside reverse mode can dramatically increase memory consumption.

Forward-over-reverse is usually preferred because:

| Property | Forward-over-reverse |
|---|---|
| memory usage | moderate |
| implementation complexity | manageable |
| nesting stability | relatively good |
| operator efficiency | high |

Reverse-over-reverse may require differentiating tape management and adjoint accumulation logic, which is harder to optimize.

## Structured Hessian Operators

Some systems expose specialized Hessian operators.

Examples include:

| Structure | Optimization |
|---|---|
| diagonal Hessian | store only diagonal |
| block Hessian | blockwise matvec |
| sparse Hessian | sparse kernels |
| low-rank Hessian | factored representation |
| Kronecker approximations | efficient inversion |
| Fisher approximations | probabilistic geometry |

The operator abstraction makes these interchangeable from the optimizer's perspective.

## Practical Principle

The full Hessian is usually a debugging or analysis object.

The Hessian-vector product is the scalable computational primitive.

A modern AD system should therefore optimize:

$$
v \mapsto H_f(x)v
$$

as a first-class operation rather than treating it as a derived convenience built from dense Hessian construction.

