Skip to content

Hessian Computation

For a scalar function

For a scalar function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the Hessian is the matrix of all second partial derivatives:

Hf(x)=2f(x). H_f(x) = \nabla^2 f(x).

Its entries are

(Hf(x))ij=2fxixj(x). (H_f(x))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x).

Computing the Hessian means computing the derivative of the gradient:

2f(x)=D(f)(x). \nabla^2 f(x) = D(\nabla f)(x).

This definition is simple. The difficulty is cost. A gradient has nn entries. A dense Hessian has n2n^2 entries. Any method that explicitly materializes the full Hessian must at least pay O(n2)O(n^2) space and output cost.

The Hessian as a Jacobian

The cleanest way to view Hessian computation is:

Hf(x)=Jf(x). H_f(x) = J_{\nabla f}(x).

That is, the Hessian is the Jacobian of the gradient map.

If

g(x)=f(x), g(x) = \nabla f(x),

then

Hf(x)=Jg(x). H_f(x) = J_g(x).

So every Jacobian computation method also gives a Hessian method once we can compute gg.

This gives two direct strategies.

Forward mode over the gradient computes columns of the Hessian:

Hf(x)ei. H_f(x)e_i.

Reverse mode over the gradient computes rows of the Hessian:

eiHf(x). e_i^\top H_f(x).

For scalar functions with smooth second derivatives, these rows and columns contain the same information because the Hessian is symmetric.

Column-by-Column Hessian Computation

The standard forward-mode construction uses basis vectors.

Let

e1,e2,,en e_1, e_2, \ldots, e_n

be the coordinate basis of Rn\mathbb{R}^n. For each ii, compute the directional derivative of the gradient in direction eie_i:

D(f)(x)[ei]=Hf(x)ei. D(\nabla f)(x)[e_i] = H_f(x)e_i.

This gives the ii-th column of the Hessian.

Algorithmically:

for i in 1..n:
    v = basis_vector(i)
    column_i = jvp(grad(f), x, v)
    write column_i into H[:, i]

This is simple and reliable. It is also expensive for large nn. It requires nn Hessian-vector products and stores n2n^2 entries.

For small dense problems, this method is often acceptable. For high-dimensional machine learning models, it is usually infeasible.

Row-by-Row Hessian Computation

Reverse mode can compute rows of the Hessian by applying vector-Jacobian products to the gradient map.

For each basis vector eie_i,

eiHf(x)=vjp(f)(x,ei). e_i^\top H_f(x) = \operatorname{vjp}(\nabla f)(x, e_i).

Algorithmically:

for i in 1..n:
    w = basis_vector(i)
    row_i = vjp(grad(f), x, w)
    write row_i into H[i, :]

This method is also direct. It has the same O(n2)O(n^2) output cost.

The choice between row-wise and column-wise construction depends on the AD system, the cost of forward mode, the cost of reverse mode, and the shape of the intermediate computations.

Forward-over-Reverse

A common method for Hessian computation is forward-over-reverse.

First define the gradient computation using reverse mode:

g(x)=f(x). g(x) = \nabla f(x).

Then apply forward mode to gg.

For a direction vv,

jvp(g,x,v)=Hf(x)v. \operatorname{jvp}(g, x, v) = H_f(x)v.

To build the full Hessian, use basis vectors:

Hf(x)=[Hf(x)e1Hf(x)e2Hf(x)en]. H_f(x) = \begin{bmatrix} H_f(x)e_1 & H_f(x)e_2 & \cdots & H_f(x)e_n \end{bmatrix}.

This is often the preferred dense-Hessian method when reverse-mode gradients are efficient and forward-mode JVPs are available through the reverse-mode computation.

The cost is roughly nn Hessian-vector products. The memory cost includes the memory needed to evaluate and differentiate the gradient computation.

Reverse-over-Forward

Reverse-over-forward starts from a forward-mode directional derivative.

For a direction vv, define

ϕv(x)=Df(x)[v]=f(x)v. \phi_v(x) = Df(x)[v] = \nabla f(x)^\top v.

Then reverse mode gives

ϕv(x)=2f(x)v. \nabla \phi_v(x) = \nabla^2 f(x)v.

So reverse-over-forward also computes Hessian-vector products.

To construct the full Hessian, repeat for basis vectors v=eiv=e_i.

for i in 1..n:
    phi = lambda x: jvp(f, x, e_i)
    column_i = grad(phi)(x)
    write column_i into H[:, i]

This is mathematically equivalent to computing Hessian columns. In implementation, it may have different memory and compilation behavior.

Reverse-over-Reverse

Reverse-over-reverse differentiates reverse-mode derivative code using reverse mode again.

Conceptually, this is valid:

2f(x)=D(f)(x). \nabla^2 f(x) = D(\nabla f)(x).

But implementation is delicate. The first reverse pass usually builds or traverses a tape. The second reverse pass must differentiate the operations that created and consumed adjoints.

This can create high memory pressure. It can also expose implementation details that are invisible in first-order AD: mutation, aliasing, tape lifetime, saved intermediates, and custom gradient rules.

Reverse-over-reverse is powerful, but it is rarely the simplest way to compute a dense Hessian. It is more often used indirectly in systems where nested differentiation is a first-class feature.

Hessian Symmetry

For smooth scalar functions,

Hf(x)=Hf(x). H_f(x) = H_f(x)^\top.

A dense Hessian implementation can exploit this symmetry.

Instead of computing all n2n^2 entries independently, it can compute only entries with

ij. i \le j.

Then it fills the opposite triangle by symmetry:

Hji=Hij. H_{ji} = H_{ij}.

This reduces storage by nearly half. It does not automatically halve the derivative work unless the computation method is designed to emit triangular entries directly.

Symmetry also provides a useful correctness check. If a computed Hessian is strongly asymmetric for a smooth function, one of the following is likely true:

CauseMeaning
Numerical errorFloating point roundoff or unstable computation
Non-smooth functionMixed partials may fail to agree
Incorrect custom derivativePrimitive derivative rule is wrong
AD nesting bugPerturbation or adjoint levels were mixed
Stateful computationDifferentiated program does not represent a pure function

Sparse Hessians

Many Hessians are sparse. A Hessian entry is zero when two input variables do not interact through the computation.

For example,

f(x)=i=1nxi2 f(x) = \sum_{i=1}^n x_i^2

has

Hf(x)=2I. H_f(x) = 2I.

The dense Hessian has n2n^2 entries, but only nn are nonzero.

Sparse Hessian computation tries to recover only the nonzero structure and values. The structure may be known analytically, inferred from the computational graph, or estimated by probing.

A sparse AD system may use:

TechniquePurpose
dependency analysisfind which variables can interact
graph coloringcombine independent seed directions
compressed Hessian recoveryreconstruct sparse entries from fewer products
block structuregroup variables by program structure
custom primitive rulesexploit known sparsity in linear algebra kernels

Sparse Hessian methods are essential in scientific computing, optimal control, PDE-constrained optimization, and large structured models.

Cost Model

The unavoidable lower bound for dense Hessian materialization is

O(n2) O(n^2)

space, because the output itself has n2n^2 entries.

The derivative computation may be summarized as follows.

MethodOutputTypical derivative passesStorage
Forward-over-reverse with basis seedsfull Hessiannn HVPsO(n2)O(n^2)
Reverse-over-forward with basis seedsfull Hessiannn HVPsO(n2)O(n^2)
Row-wise VJP of gradientfull Hessiannn VJPsO(n2)O(n^2)
Column-wise JVP of gradientfull Hessiannn JVPsO(n2)O(n^2)
Sparse compressed Hessiansparse Hessiandepends on coloringO(nnz(H))O(\operatorname{nnz}(H))
Hessian-vector product onlyHvHv1 HVPO(n)O(n)

The most important distinction is between computing the Hessian as a matrix and computing the action of the Hessian as an operator.

Full Hessian API

A user-facing AD API might expose:

hessian(f)(x)

For small problems, this is convenient.

Internally, the system may implement it as:

hessian(f)(x):
    g = grad(f)
    H = zeros(n, n)

    for i in 1..n:
        v = basis_vector(i)
        H[:, i] = jvp(g, x, v)

    return H

This API should document the cost clearly. Users often assume that because gradients are cheap, Hessians are also cheap. That assumption fails at large dimension.

Hessian as an Operator

For large problems, a better abstraction is:

hessian_operator(f, x)

which returns an object that supports:

matvec(v) = H v

The Hessian is then represented by its action on vectors.

This representation works well with iterative algorithms:

AlgorithmUses
conjugate gradientHessian-vector products
LanczosHessian-vector products
Newton-CGHessian-vector products
trust-region methodsHessian-vector products
curvature diagnosticsdirectional second derivatives

This design avoids materializing a dense matrix while still exposing useful second-order information.

Numerical Checks

Second derivatives are more sensitive than first derivatives. A common diagnostic is to compare Hessian-vector products against finite differences of gradients:

Hf(x)vf(x+ϵv)f(x)ϵ. H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}.

A centered version is usually more accurate:

Hf(x)vf(x+ϵv)f(xϵv)2ϵ. H_f(x)v \approx \frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)}{2\epsilon}.

This check should be used carefully. If ϵ\epsilon is too large, truncation error dominates. If ϵ\epsilon is too small, floating point cancellation dominates.

The check is useful for debugging custom derivative rules and verifying nested AD behavior.

Practical Rule

Use a full Hessian only when the input dimension is small or when the Hessian is sparse and the sparsity is exploited.

For large dense problems, expose Hessian-vector products. They preserve the computational content needed by second-order methods without paying the full quadratic storage cost.