# Hessian Computation

## Hessian Computation

For a scalar function

$$
f : \mathbb{R}^n \to \mathbb{R},
$$

the Hessian is the matrix of all second partial derivatives:

$$
H_f(x) = \nabla^2 f(x).
$$

Its entries are

$$
(H_f(x))_{ij} =
\frac{\partial^2 f}{\partial x_i \partial x_j}(x).
$$

Computing the Hessian means computing the derivative of the gradient:

$$
\nabla^2 f(x) =
D(\nabla f)(x).
$$

This definition is simple. The difficulty is cost. A gradient has $n$ entries. A dense Hessian has $n^2$ entries. Any method that explicitly materializes the full Hessian must at least pay $O(n^2)$ space and output cost.

## The Hessian as a Jacobian

The cleanest way to view Hessian computation is:

$$
H_f(x) = J_{\nabla f}(x).
$$

That is, the Hessian is the Jacobian of the gradient map.

If

$$
g(x) = \nabla f(x),
$$

then

$$
H_f(x) = J_g(x).
$$

So every Jacobian computation method also gives a Hessian method once we can compute $g$.

This gives two direct strategies.

Forward mode over the gradient computes columns of the Hessian:

$$
H_f(x)e_i.
$$

Reverse mode over the gradient computes rows of the Hessian:

$$
e_i^\top H_f(x).
$$

For scalar functions with smooth second derivatives, these rows and columns contain the same information because the Hessian is symmetric.

## Column-by-Column Hessian Computation

The standard forward-mode construction uses basis vectors.

Let

$$
e_1, e_2, \ldots, e_n
$$

be the coordinate basis of $\mathbb{R}^n$. For each $i$, compute the directional derivative of the gradient in direction $e_i$:

$$
D(\nabla f)(x)[e_i] =
H_f(x)e_i.
$$

This gives the $i$-th column of the Hessian.

Algorithmically:

```text
for i in 1..n:
    v = basis_vector(i)
    column_i = jvp(grad(f), x, v)
    write column_i into H[:, i]
```

This is simple and reliable. It is also expensive for large $n$. It requires $n$ Hessian-vector products and stores $n^2$ entries.

For small dense problems, this method is often acceptable. For high-dimensional machine learning models, it is usually infeasible.

## Row-by-Row Hessian Computation

Reverse mode can compute rows of the Hessian by applying vector-Jacobian products to the gradient map.

For each basis vector $e_i$,

$$
e_i^\top H_f(x) =
\operatorname{vjp}(\nabla f)(x, e_i).
$$

Algorithmically:

```text
for i in 1..n:
    w = basis_vector(i)
    row_i = vjp(grad(f), x, w)
    write row_i into H[i, :]
```

This method is also direct. It has the same $O(n^2)$ output cost.

The choice between row-wise and column-wise construction depends on the AD system, the cost of forward mode, the cost of reverse mode, and the shape of the intermediate computations.

## Forward-over-Reverse

A common method for Hessian computation is forward-over-reverse.

First define the gradient computation using reverse mode:

$$
g(x) = \nabla f(x).
$$

Then apply forward mode to $g$.

For a direction $v$,

$$
\operatorname{jvp}(g, x, v) =
H_f(x)v.
$$

To build the full Hessian, use basis vectors:

$$
H_f(x) =
\begin{bmatrix}
H_f(x)e_1 &
H_f(x)e_2 &
\cdots &
H_f(x)e_n
\end{bmatrix}.
$$

This is often the preferred dense-Hessian method when reverse-mode gradients are efficient and forward-mode JVPs are available through the reverse-mode computation.

The cost is roughly $n$ Hessian-vector products. The memory cost includes the memory needed to evaluate and differentiate the gradient computation.

## Reverse-over-Forward

Reverse-over-forward starts from a forward-mode directional derivative.

For a direction $v$, define

$$
\phi_v(x) = Df(x)[v] = \nabla f(x)^\top v.
$$

Then reverse mode gives

$$
\nabla \phi_v(x) =
\nabla^2 f(x)v.
$$

So reverse-over-forward also computes Hessian-vector products.

To construct the full Hessian, repeat for basis vectors $v=e_i$.

```text
for i in 1..n:
    phi = lambda x: jvp(f, x, e_i)
    column_i = grad(phi)(x)
    write column_i into H[:, i]
```

This is mathematically equivalent to computing Hessian columns. In implementation, it may have different memory and compilation behavior.

## Reverse-over-Reverse

Reverse-over-reverse differentiates reverse-mode derivative code using reverse mode again.

Conceptually, this is valid:

$$
\nabla^2 f(x) =
D(\nabla f)(x).
$$

But implementation is delicate. The first reverse pass usually builds or traverses a tape. The second reverse pass must differentiate the operations that created and consumed adjoints.

This can create high memory pressure. It can also expose implementation details that are invisible in first-order AD: mutation, aliasing, tape lifetime, saved intermediates, and custom gradient rules.

Reverse-over-reverse is powerful, but it is rarely the simplest way to compute a dense Hessian. It is more often used indirectly in systems where nested differentiation is a first-class feature.

## Hessian Symmetry

For smooth scalar functions,

$$
H_f(x) = H_f(x)^\top.
$$

A dense Hessian implementation can exploit this symmetry.

Instead of computing all $n^2$ entries independently, it can compute only entries with

$$
i \le j.
$$

Then it fills the opposite triangle by symmetry:

$$
H_{ji} = H_{ij}.
$$

This reduces storage by nearly half. It does not automatically halve the derivative work unless the computation method is designed to emit triangular entries directly.

Symmetry also provides a useful correctness check. If a computed Hessian is strongly asymmetric for a smooth function, one of the following is likely true:

| Cause | Meaning |
|---|---|
| Numerical error | Floating point roundoff or unstable computation |
| Non-smooth function | Mixed partials may fail to agree |
| Incorrect custom derivative | Primitive derivative rule is wrong |
| AD nesting bug | Perturbation or adjoint levels were mixed |
| Stateful computation | Differentiated program does not represent a pure function |

## Sparse Hessians

Many Hessians are sparse. A Hessian entry is zero when two input variables do not interact through the computation.

For example,

$$
f(x) = \sum_{i=1}^n x_i^2
$$

has

$$
H_f(x) = 2I.
$$

The dense Hessian has $n^2$ entries, but only $n$ are nonzero.

Sparse Hessian computation tries to recover only the nonzero structure and values. The structure may be known analytically, inferred from the computational graph, or estimated by probing.

A sparse AD system may use:

| Technique | Purpose |
|---|---|
| dependency analysis | find which variables can interact |
| graph coloring | combine independent seed directions |
| compressed Hessian recovery | reconstruct sparse entries from fewer products |
| block structure | group variables by program structure |
| custom primitive rules | exploit known sparsity in linear algebra kernels |

Sparse Hessian methods are essential in scientific computing, optimal control, PDE-constrained optimization, and large structured models.

## Cost Model

The unavoidable lower bound for dense Hessian materialization is

$$
O(n^2)
$$

space, because the output itself has $n^2$ entries.

The derivative computation may be summarized as follows.

| Method | Output | Typical derivative passes | Storage |
|---|---:|---:|---:|
| Forward-over-reverse with basis seeds | full Hessian | $n$ HVPs | $O(n^2)$ |
| Reverse-over-forward with basis seeds | full Hessian | $n$ HVPs | $O(n^2)$ |
| Row-wise VJP of gradient | full Hessian | $n$ VJPs | $O(n^2)$ |
| Column-wise JVP of gradient | full Hessian | $n$ JVPs | $O(n^2)$ |
| Sparse compressed Hessian | sparse Hessian | depends on coloring | $O(\operatorname{nnz}(H))$ |
| Hessian-vector product only | $Hv$ | 1 HVP | $O(n)$ |

The most important distinction is between computing the Hessian as a matrix and computing the action of the Hessian as an operator.

## Full Hessian API

A user-facing AD API might expose:

```text
hessian(f)(x)
```

For small problems, this is convenient.

Internally, the system may implement it as:

```text
hessian(f)(x):
    g = grad(f)
    H = zeros(n, n)

    for i in 1..n:
        v = basis_vector(i)
        H[:, i] = jvp(g, x, v)

    return H
```

This API should document the cost clearly. Users often assume that because gradients are cheap, Hessians are also cheap. That assumption fails at large dimension.

## Hessian as an Operator

For large problems, a better abstraction is:

```text
hessian_operator(f, x)
```

which returns an object that supports:

```text
matvec(v) = H v
```

The Hessian is then represented by its action on vectors.

This representation works well with iterative algorithms:

| Algorithm | Uses |
|---|---|
| conjugate gradient | Hessian-vector products |
| Lanczos | Hessian-vector products |
| Newton-CG | Hessian-vector products |
| trust-region methods | Hessian-vector products |
| curvature diagnostics | directional second derivatives |

This design avoids materializing a dense matrix while still exposing useful second-order information.

## Numerical Checks

Second derivatives are more sensitive than first derivatives. A common diagnostic is to compare Hessian-vector products against finite differences of gradients:

$$
H_f(x)v
\approx
\frac{\nabla f(x + \epsilon v) - \nabla f(x)}{\epsilon}.
$$

A centered version is usually more accurate:

$$
H_f(x)v
\approx
\frac{\nabla f(x + \epsilon v) - \nabla f(x - \epsilon v)}{2\epsilon}.
$$

This check should be used carefully. If $\epsilon$ is too large, truncation error dominates. If $\epsilon$ is too small, floating point cancellation dominates.

The check is useful for debugging custom derivative rules and verifying nested AD behavior.

## Practical Rule

Use a full Hessian only when the input dimension is small or when the Hessian is sparse and the sparsity is exploited.

For large dense problems, expose Hessian-vector products. They preserve the computational content needed by second-order methods without paying the full quadratic storage cost.