Chapter 8. Higher-Order Differentiation

Second Derivatives

First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is acceleration. In sensitivity analysis, it tells us whether a response is stable, amplifying, or changing direction.

For a scalar function

f : \mathbb{R} \to \mathbb{R},

the first derivative is

f'(x),

and the second derivative is

f''(x) = \frac{d}{dx} f'(x).

For example, if

f(x) = x^3,

then

f'(x) = 3x^2,

and

f''(x) = 6x.

The first derivative says the slope grows like $3x^2$ . The second derivative says the slope itself grows at rate $6x$ .

Second Derivatives in Several Variables

For a scalar function of several variables,

f : \mathbb{R}^n \to \mathbb{R},

the first derivative is the gradient:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The second derivative is the derivative of the gradient. This gives the Hessian matrix:

\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{bmatrix}.

Each entry measures how one component of the gradient changes with respect to one input variable.

f(x, y) = x^2 y + \sin y,

then

\nabla f(x, y) = \begin{bmatrix} 2xy \\ x^2 + \cos y \end{bmatrix},

and

\nabla^2 f(x, y) = \begin{bmatrix} 2y & 2x \\ 2x & -\sin y \end{bmatrix}.

The diagonal entries measure curvature along coordinate directions. The off-diagonal entries measure interaction between variables.

Symmetry of the Hessian

When the mixed partial derivatives are continuous, the Hessian is symmetric:

\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}.

Therefore,

\nabla^2 f(x) = \nabla^2 f(x)^\top.

This symmetry matters in automatic differentiation because a naive implementation may compute both mixed partials separately. A well-designed system can exploit symmetry to reduce work and storage.

For dense Hessians, storing all entries costs $O(n^2)$ . Storing only the upper triangular part costs roughly half as much. For sparse Hessians, the savings can be much larger.

Second Derivatives as Quadratic Approximation

Second derivatives appear naturally in the second-order Taylor approximation:

f(x + h) \approx f(x) + \nabla f(x)^\top h + \frac{1}{2} h^\top \nabla^2 f(x) h.

The gradient gives the best local linear approximation. The Hessian gives the correction that accounts for curvature.

The term

h^\top \nabla^2 f(x) h

measures curvature in the direction $h$ . This quantity is often more useful than the full Hessian itself, especially when $n$ is large.

Directional Second Derivatives

Given a direction $v \in \mathbb{R}^n$ , the first directional derivative is

D_v f(x) = \nabla f(x)^\top v.

The second directional derivative is

D_v^2 f(x) = v^\top \nabla^2 f(x) v.

This measures curvature along the line

x(t) = x + tv.

Define

g(t) = f(x + tv).

Then

g''(0) = v^\top \nabla^2 f(x) v.

Automatic differentiation can compute this without materializing the full Hessian. This is important in large-scale optimization, where $n$ may be millions or billions.

Computing Second Derivatives with AD

Automatic differentiation computes second derivatives by differentiating derivative computations.

There are several common strategies.

Method	Computes	Typical use
Forward over forward	second directional derivatives	small input dimension
Reverse over forward	gradients of directional derivatives	Hessian-vector products
Forward over reverse	directional derivatives of gradients	Hessian-vector products
Reverse over reverse	full second-order reverse AD	delicate, memory-heavy
Taylor mode	higher-order univariate expansions	high-order derivatives

The simplest idea is nesting.

Suppose forward mode computes

f(x), \quad J_f(x)v.

If we apply forward mode again, we can compute second-order directional information.

For scalar $f$ , forward-over-forward can compute

v^\top \nabla^2 f(x) v.

For full Hessians, one can seed basis directions:

e_1, e_2, \ldots, e_n.

This recovers Hessian columns, but costs $O(n)$ derivative passes. That is acceptable for small $n$ , but expensive for high-dimensional models.

Hessian-Vector Products

A Hessian-vector product is

\nabla^2 f(x) v.

It gives the action of the Hessian on a vector without constructing the Hessian matrix.

One useful identity is

\nabla^2 f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}.

So a Hessian-vector product can be computed by taking the directional derivative of the gradient.

In AD terms:

Use reverse mode to compute $\nabla f(x)$ .
Use forward mode through that gradient computation in direction $v$ .

This is forward-over-reverse AD.

The result has roughly the cost of computing a gradient, up to a small constant factor, for many programs. This is why Hessian-vector products are central in second-order optimization.

Why Full Hessians Are Often Avoided

For

f : \mathbb{R}^n \to \mathbb{R},

the gradient has size $n$ , but the Hessian has size $n \times n$ .

If $n = 10^6$ , the Hessian has $10^{12}$ entries. Storing it densely is infeasible.

Even when the Hessian is theoretically useful, most practical systems avoid constructing it directly. Instead, they compute:

\nabla f(x),

\nabla^2 f(x)v,

v^\top \nabla^2 f(x)v.

These objects preserve the parts of second-order information needed by optimization algorithms while avoiding quadratic storage.

Role in Optimization

Second derivatives describe the local shape of an objective.

At a point $x$ :

Hessian behavior	Meaning
positive definite	locally convex bowl
negative definite	locally concave cap
indefinite	saddle-like region
singular	flat or degenerate curvature

Newton’s method uses the Hessian to choose a step:

\nabla^2 f(x) p = -\nabla f(x).

Then

x_{\text{new}} = x + p.

This step accounts for curvature. Gradient descent only uses the slope. Newton’s method uses both slope and curvature.

However, exact Newton steps require solving a linear system involving the Hessian. Large-scale variants often use Hessian-vector products with iterative solvers such as conjugate gradient.

Second Derivatives in AD Systems

An AD system that supports second derivatives must handle several issues.

First, derivative code must itself be differentiable. Reverse-mode implementations often use tapes, mutation, and saved intermediates. Differentiating such machinery can be difficult unless the system has a clean internal representation.

Second, nesting must keep perturbations distinct. If two forward-mode passes accidentally share the same infinitesimal tag, the system may mix derivative levels. This is the perturbation confusion problem.

Third, memory use grows quickly. Reverse mode already stores intermediate values for the backward pass. Higher-order reverse mode may need to store values needed to differentiate the backward pass itself.

Fourth, primitives need second derivative rules. For example, multiplication has first-order rules, but second-order AD must correctly propagate second-order interactions.

For

z = xy,

the first-order differential is

dz = x\,dy + y\,dx.

The second-order differential includes the mixed interaction:

d^2z = x\,d^2y + y\,d^2x + 2\,dx\,dy.

That cross term is exactly the kind of information a second-order AD implementation must preserve.

Practical Design Principle

A production AD system should treat second derivatives as structured linear algebra, rather than as a request to build a dense matrix.

The common API should expose operations such as:

grad(f)(x)
jvp(f)(x, v)
vjp(f)(x, w)
hvp(f)(x, v)

A full Hessian API can exist, but it should be understood as a convenience for small problems.

For large systems, the central abstraction is the action of the derivative operator, not its dense coordinate representation.