Hessian Computation

For a scalar function

f : \mathbb{R}^n \to \mathbb{R},

the Hessian is the matrix of all second partial derivatives:

H_f(x) = \nabla^2 f(x).

Its entries are

(H_f(x))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x).

Computing the Hessian means computing the derivative of the gradient:

\nabla^2 f(x) = D(\nabla f)(x).

This definition is simple. The difficulty is cost. A gradient has $n$ entries. A dense Hessian has $n^2$ entries. Any method that explicitly materializes the full Hessian must at least pay $O(n^2)$ space and output cost.

The Hessian as a Jacobian

The cleanest way to view Hessian computation is:

H_f(x) = J_{\nabla f}(x).

That is, the Hessian is the Jacobian of the gradient map.

g(x) = \nabla f(x),

then

H_f(x) = J_g(x).

So every Jacobian computation method also gives a Hessian method once we can compute $g$ .

This gives two direct strategies.

Forward mode over the gradient computes columns of the Hessian:

H_f(x)e_i.

Reverse mode over the gradient computes rows of the Hessian:

e_i^\top H_f(x).

For scalar functions with smooth second derivatives, these rows and columns contain the same information because the Hessian is symmetric.

Column-by-Column Hessian Computation

The standard forward-mode construction uses basis vectors.

Let

e_1, e_2, \ldots, e_n

be the coordinate basis of $\mathbb{R}^n$ . For each $i$ , compute the directional derivative of the gradient in direction $e_i$ :

D(\nabla f)(x)[e_i] = H_f(x)e_i.

This gives the $i$ -th column of the Hessian.

Algorithmically:

for i in 1..n:
    v = basis_vector(i)
    column_i = jvp(grad(f), x, v)
    write column_i into H[:, i]

This is simple and reliable. It is also expensive for large $n$ . It requires $n$ Hessian-vector products and stores $n^2$ entries.

For small dense problems, this method is often acceptable. For high-dimensional machine learning models, it is usually infeasible.

Row-by-Row Hessian Computation

Reverse mode can compute rows of the Hessian by applying vector-Jacobian products to the gradient map.

For each basis vector $e_i$ ,

e_i^\top H_f(x) = \operatorname{vjp}(\nabla f)(x, e_i).

Algorithmically:

for i in 1..n:
    w = basis_vector(i)
    row_i = vjp(grad(f), x, w)
    write row_i into H[i, :]

This method is also direct. It has the same $O(n^2)$ output cost.

The choice between row-wise and column-wise construction depends on the AD system, the cost of forward mode, the cost of reverse mode, and the shape of the intermediate computations.

Forward-over-Reverse

A common method for Hessian computation is forward-over-reverse.

First define the gradient computation using reverse mode:

g(x) = \nabla f(x).

Then apply forward mode to $g$ .

For a direction $v$ ,

\operatorname{jvp}(g, x, v) = H_f(x)v.

To build the full Hessian, use basis vectors:

H_f(x) = \begin{bmatrix} H_f(x)e_1 & H_f(x)e_2 & \cdots & H_f(x)e_n \end{bmatrix}.

This is often the preferred dense-Hessian method when reverse-mode gradients are efficient and forward-mode JVPs are available through the reverse-mode computation.

The cost is roughly $n$ Hessian-vector products. The memory cost includes the memory needed to evaluate and differentiate the gradient computation.

Reverse-over-Forward

Reverse-over-forward starts from a forward-mode directional derivative.

For a direction $v$ , define

\phi_v(x) = Df(x)[v] = \nabla f(x)^\top v.

Then reverse mode gives

\nabla \phi_v(x) = \nabla^2 f(x)v.

So reverse-over-forward also computes Hessian-vector products.

To construct the full Hessian, repeat for basis vectors $v=e_i$ .

for i in 1..n:
    phi = lambda x: jvp(f, x, e_i)
    column_i = grad(phi)(x)
    write column_i into H[:, i]

This is mathematically equivalent to computing Hessian columns. In implementation, it may have different memory and compilation behavior.

Reverse-over-Reverse

Reverse-over-reverse differentiates reverse-mode derivative code using reverse mode again.

Conceptually, this is valid:

\nabla^2 f(x) = D(\nabla f)(x).

But implementation is delicate. The first reverse pass usually builds or traverses a tape. The second reverse pass must differentiate the operations that created and consumed adjoints.

This can create high memory pressure. It can also expose implementation details that are invisible in first-order AD: mutation, aliasing, tape lifetime, saved intermediates, and custom gradient rules.

Reverse-over-reverse is powerful, but it is rarely the simplest way to compute a dense Hessian. It is more often used indirectly in systems where nested differentiation is a first-class feature.

Hessian Symmetry

For smooth scalar functions,

H_f(x) = H_f(x)^\top.

A dense Hessian implementation can exploit this symmetry.

Instead of computing all $n^2$ entries independently, it can compute only entries with

i \le j.

Then it fills the opposite triangle by symmetry:

H_{ji} = H_{ij}.

This reduces storage by nearly half. It does not automatically halve the derivative work unless the computation method is designed to emit triangular entries directly.

Symmetry also provides a useful correctness check. If a computed Hessian is strongly asymmetric for a smooth function, one of the following is likely true:

Cause	Meaning
Numerical error	Floating point roundoff or unstable computation
Non-smooth function	Mixed partials may fail to agree
Incorrect custom derivative	Primitive derivative rule is wrong
AD nesting bug	Perturbation or adjoint levels were mixed
Stateful computation	Differentiated program does not represent a pure function

Sparse Hessians

Many Hessians are sparse. A Hessian entry is zero when two input variables do not interact through the computation.

For example,

f(x) = \sum_{i=1}^n x_i^2

has

H_f(x) = 2I.

The dense Hessian has $n^2$ entries, but only $n$ are nonzero.

Sparse Hessian computation tries to recover only the nonzero structure and values. The structure may be known analytically, inferred from the computational graph, or estimated by probing.

A sparse AD system may use:

Technique	Purpose
dependency analysis	find which variables can interact
graph coloring	combine independent seed directions
compressed Hessian recovery	reconstruct sparse entries from fewer products
block structure	group variables by program structure
custom primitive rules	exploit known sparsity in linear algebra kernels

Sparse Hessian methods are essential in scientific computing, optimal control, PDE-constrained optimization, and large structured models.

Cost Model

The unavoidable lower bound for dense Hessian materialization is

O(n^2)

space, because the output itself has $n^2$ entries.

The derivative computation may be summarized as follows.

Method	Output	Typical derivative passes	Storage
Forward-over-reverse with basis seeds	full Hessian	$n$ HVPs	$O(n^2)$
Reverse-over-forward with basis seeds	full Hessian	$n$ HVPs	$O(n^2)$
Row-wise VJP of gradient	full Hessian	$n$ VJPs	$O(n^2)$
Column-wise JVP of gradient	full Hessian	$n$ JVPs	$O(n^2)$
Sparse compressed Hessian	sparse Hessian	depends on coloring	$O(\operatorname{nnz}(H))$
Hessian-vector product only	$Hv$	1 HVP	$O(n)$

The most important distinction is between computing the Hessian as a matrix and computing the action of the Hessian as an operator.

Full Hessian API

A user-facing AD API might expose:

hessian(f)(x)

For small problems, this is convenient.

Internally, the system may implement it as:

hessian(f)(x):
    g = grad(f)
    H = zeros(n, n)

    for i in 1..n:
        v = basis_vector(i)
        H[:, i] = jvp(g, x, v)

    return H

This API should document the cost clearly. Users often assume that because gradients are cheap, Hessians are also cheap. That assumption fails at large dimension.

Hessian as an Operator

For large problems, a better abstraction is:

hessian_operator(f, x)

which returns an object that supports:

matvec(v) = H v

The Hessian is then represented by its action on vectors.

This representation works well with iterative algorithms:

Algorithm	Uses
conjugate gradient	Hessian-vector products
Lanczos	Hessian-vector products
Newton-CG	Hessian-vector products
trust-region methods	Hessian-vector products
curvature diagnostics	directional second derivatives

This design avoids materializing a dense matrix while still exposing useful second-order information.

Numerical Checks

Second derivatives are more sensitive than first derivatives. A common diagnostic is to compare Hessian-vector products against finite differences of gradients:

H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}.

A centered version is usually more accurate:

H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)}{2\epsilon}.

This check should be used carefully. If $\epsilon$ is too large, truncation error dominates. If $\epsilon$ is too small, floating point cancellation dominates.

The check is useful for debugging custom derivative rules and verifying nested AD behavior.

Practical Rule

Use a full Hessian only when the input dimension is small or when the Hessian is sparse and the sparsity is exploited.

For large dense problems, expose Hessian-vector products. They preserve the computational content needed by second-order methods without paying the full quadratic storage cost.