For a scalar function
the Hessian is the matrix of all second partial derivatives:
Its entries are
Computing the Hessian means computing the derivative of the gradient:
This definition is simple. The difficulty is cost. A gradient has entries. A dense Hessian has entries. Any method that explicitly materializes the full Hessian must at least pay space and output cost.
The Hessian as a Jacobian
The cleanest way to view Hessian computation is:
That is, the Hessian is the Jacobian of the gradient map.
If
then
So every Jacobian computation method also gives a Hessian method once we can compute .
This gives two direct strategies.
Forward mode over the gradient computes columns of the Hessian:
Reverse mode over the gradient computes rows of the Hessian:
For scalar functions with smooth second derivatives, these rows and columns contain the same information because the Hessian is symmetric.
Column-by-Column Hessian Computation
The standard forward-mode construction uses basis vectors.
Let
be the coordinate basis of . For each , compute the directional derivative of the gradient in direction :
This gives the -th column of the Hessian.
Algorithmically:
for i in 1..n:
v = basis_vector(i)
column_i = jvp(grad(f), x, v)
write column_i into H[:, i]This is simple and reliable. It is also expensive for large . It requires Hessian-vector products and stores entries.
For small dense problems, this method is often acceptable. For high-dimensional machine learning models, it is usually infeasible.
Row-by-Row Hessian Computation
Reverse mode can compute rows of the Hessian by applying vector-Jacobian products to the gradient map.
For each basis vector ,
Algorithmically:
for i in 1..n:
w = basis_vector(i)
row_i = vjp(grad(f), x, w)
write row_i into H[i, :]This method is also direct. It has the same output cost.
The choice between row-wise and column-wise construction depends on the AD system, the cost of forward mode, the cost of reverse mode, and the shape of the intermediate computations.
Forward-over-Reverse
A common method for Hessian computation is forward-over-reverse.
First define the gradient computation using reverse mode:
Then apply forward mode to .
For a direction ,
To build the full Hessian, use basis vectors:
This is often the preferred dense-Hessian method when reverse-mode gradients are efficient and forward-mode JVPs are available through the reverse-mode computation.
The cost is roughly Hessian-vector products. The memory cost includes the memory needed to evaluate and differentiate the gradient computation.
Reverse-over-Forward
Reverse-over-forward starts from a forward-mode directional derivative.
For a direction , define
Then reverse mode gives
So reverse-over-forward also computes Hessian-vector products.
To construct the full Hessian, repeat for basis vectors .
for i in 1..n:
phi = lambda x: jvp(f, x, e_i)
column_i = grad(phi)(x)
write column_i into H[:, i]This is mathematically equivalent to computing Hessian columns. In implementation, it may have different memory and compilation behavior.
Reverse-over-Reverse
Reverse-over-reverse differentiates reverse-mode derivative code using reverse mode again.
Conceptually, this is valid:
But implementation is delicate. The first reverse pass usually builds or traverses a tape. The second reverse pass must differentiate the operations that created and consumed adjoints.
This can create high memory pressure. It can also expose implementation details that are invisible in first-order AD: mutation, aliasing, tape lifetime, saved intermediates, and custom gradient rules.
Reverse-over-reverse is powerful, but it is rarely the simplest way to compute a dense Hessian. It is more often used indirectly in systems where nested differentiation is a first-class feature.
Hessian Symmetry
For smooth scalar functions,
A dense Hessian implementation can exploit this symmetry.
Instead of computing all entries independently, it can compute only entries with
Then it fills the opposite triangle by symmetry:
This reduces storage by nearly half. It does not automatically halve the derivative work unless the computation method is designed to emit triangular entries directly.
Symmetry also provides a useful correctness check. If a computed Hessian is strongly asymmetric for a smooth function, one of the following is likely true:
| Cause | Meaning |
|---|---|
| Numerical error | Floating point roundoff or unstable computation |
| Non-smooth function | Mixed partials may fail to agree |
| Incorrect custom derivative | Primitive derivative rule is wrong |
| AD nesting bug | Perturbation or adjoint levels were mixed |
| Stateful computation | Differentiated program does not represent a pure function |
Sparse Hessians
Many Hessians are sparse. A Hessian entry is zero when two input variables do not interact through the computation.
For example,
has
The dense Hessian has entries, but only are nonzero.
Sparse Hessian computation tries to recover only the nonzero structure and values. The structure may be known analytically, inferred from the computational graph, or estimated by probing.
A sparse AD system may use:
| Technique | Purpose |
|---|---|
| dependency analysis | find which variables can interact |
| graph coloring | combine independent seed directions |
| compressed Hessian recovery | reconstruct sparse entries from fewer products |
| block structure | group variables by program structure |
| custom primitive rules | exploit known sparsity in linear algebra kernels |
Sparse Hessian methods are essential in scientific computing, optimal control, PDE-constrained optimization, and large structured models.
Cost Model
The unavoidable lower bound for dense Hessian materialization is
space, because the output itself has entries.
The derivative computation may be summarized as follows.
| Method | Output | Typical derivative passes | Storage |
|---|---|---|---|
| Forward-over-reverse with basis seeds | full Hessian | HVPs | |
| Reverse-over-forward with basis seeds | full Hessian | HVPs | |
| Row-wise VJP of gradient | full Hessian | VJPs | |
| Column-wise JVP of gradient | full Hessian | JVPs | |
| Sparse compressed Hessian | sparse Hessian | depends on coloring | |
| Hessian-vector product only | 1 HVP |
The most important distinction is between computing the Hessian as a matrix and computing the action of the Hessian as an operator.
Full Hessian API
A user-facing AD API might expose:
hessian(f)(x)For small problems, this is convenient.
Internally, the system may implement it as:
hessian(f)(x):
g = grad(f)
H = zeros(n, n)
for i in 1..n:
v = basis_vector(i)
H[:, i] = jvp(g, x, v)
return HThis API should document the cost clearly. Users often assume that because gradients are cheap, Hessians are also cheap. That assumption fails at large dimension.
Hessian as an Operator
For large problems, a better abstraction is:
hessian_operator(f, x)which returns an object that supports:
matvec(v) = H vThe Hessian is then represented by its action on vectors.
This representation works well with iterative algorithms:
| Algorithm | Uses |
|---|---|
| conjugate gradient | Hessian-vector products |
| Lanczos | Hessian-vector products |
| Newton-CG | Hessian-vector products |
| trust-region methods | Hessian-vector products |
| curvature diagnostics | directional second derivatives |
This design avoids materializing a dense matrix while still exposing useful second-order information.
Numerical Checks
Second derivatives are more sensitive than first derivatives. A common diagnostic is to compare Hessian-vector products against finite differences of gradients:
A centered version is usually more accurate:
This check should be used carefully. If is too large, truncation error dominates. If is too small, floating point cancellation dominates.
The check is useful for debugging custom derivative rules and verifying nested AD behavior.
Practical Rule
Use a full Hessian only when the input dimension is small or when the Hessian is sparse and the sparsity is exploited.
For large dense problems, expose Hessian-vector products. They preserve the computational content needed by second-order methods without paying the full quadratic storage cost.