Skip to content

Chapter 8. Higher-Order Differentiation

First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is...

Second Derivatives

First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is acceleration. In sensitivity analysis, it tells us whether a response is stable, amplifying, or changing direction.

For a scalar function

f:RR, f : \mathbb{R} \to \mathbb{R},

the first derivative is

f(x), f'(x),

and the second derivative is

f(x)=ddxf(x). f''(x) = \frac{d}{dx} f'(x).

For example, if

f(x)=x3, f(x) = x^3,

then

f(x)=3x2, f'(x) = 3x^2,

and

f(x)=6x. f''(x) = 6x.

The first derivative says the slope grows like 3x23x^2. The second derivative says the slope itself grows at rate 6x6x.

Second Derivatives in Several Variables

For a scalar function of several variables,

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the first derivative is the gradient:

f(x)=[fx1fx2fxn]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The second derivative is the derivative of the gradient. This gives the Hessian matrix:

2f(x)=[2fx1x12fx1x22fx1xn2fx2x12fx2x22fx2xn2fxnx12fxnx22fxnxn]. \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{bmatrix}.

Each entry measures how one component of the gradient changes with respect to one input variable.

If

f(x,y)=x2y+siny, f(x, y) = x^2 y + \sin y,

then

f(x,y)=[2xyx2+cosy], \nabla f(x, y) = \begin{bmatrix} 2xy \\ x^2 + \cos y \end{bmatrix},

and

2f(x,y)=[2y2x2xsiny]. \nabla^2 f(x, y) = \begin{bmatrix} 2y & 2x \\ 2x & -\sin y \end{bmatrix}.

The diagonal entries measure curvature along coordinate directions. The off-diagonal entries measure interaction between variables.

Symmetry of the Hessian

When the mixed partial derivatives are continuous, the Hessian is symmetric:

2fxixj=2fxjxi. \frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}.

Therefore,

2f(x)=2f(x). \nabla^2 f(x) = \nabla^2 f(x)^\top.

This symmetry matters in automatic differentiation because a naive implementation may compute both mixed partials separately. A well-designed system can exploit symmetry to reduce work and storage.

For dense Hessians, storing all entries costs O(n2)O(n^2). Storing only the upper triangular part costs roughly half as much. For sparse Hessians, the savings can be much larger.

Second Derivatives as Quadratic Approximation

Second derivatives appear naturally in the second-order Taylor approximation:

f(x+h)f(x)+f(x)h+12h2f(x)h. f(x + h) \approx f(x) + \nabla f(x)^\top h + \frac{1}{2} h^\top \nabla^2 f(x) h.

The gradient gives the best local linear approximation. The Hessian gives the correction that accounts for curvature.

The term

h2f(x)h h^\top \nabla^2 f(x) h

measures curvature in the direction hh. This quantity is often more useful than the full Hessian itself, especially when nn is large.

Directional Second Derivatives

Given a direction vRnv \in \mathbb{R}^n, the first directional derivative is

Dvf(x)=f(x)v. D_v f(x) = \nabla f(x)^\top v.

The second directional derivative is

Dv2f(x)=v2f(x)v. D_v^2 f(x) = v^\top \nabla^2 f(x) v.

This measures curvature along the line

x(t)=x+tv. x(t) = x + tv.

Define

g(t)=f(x+tv). g(t) = f(x + tv).

Then

g(0)=v2f(x)v. g''(0) = v^\top \nabla^2 f(x) v.

Automatic differentiation can compute this without materializing the full Hessian. This is important in large-scale optimization, where nn may be millions or billions.

Computing Second Derivatives with AD

Automatic differentiation computes second derivatives by differentiating derivative computations.

There are several common strategies.

MethodComputesTypical use
Forward over forwardsecond directional derivativessmall input dimension
Reverse over forwardgradients of directional derivativesHessian-vector products
Forward over reversedirectional derivatives of gradientsHessian-vector products
Reverse over reversefull second-order reverse ADdelicate, memory-heavy
Taylor modehigher-order univariate expansionshigh-order derivatives

The simplest idea is nesting.

Suppose forward mode computes

f(x),Jf(x)v. f(x), \quad J_f(x)v.

If we apply forward mode again, we can compute second-order directional information.

For scalar ff, forward-over-forward can compute

v2f(x)v. v^\top \nabla^2 f(x) v.

For full Hessians, one can seed basis directions:

e1,e2,,en. e_1, e_2, \ldots, e_n.

This recovers Hessian columns, but costs O(n)O(n) derivative passes. That is acceptable for small nn, but expensive for high-dimensional models.

Hessian-Vector Products

A Hessian-vector product is

2f(x)v. \nabla^2 f(x) v.

It gives the action of the Hessian on a vector without constructing the Hessian matrix.

One useful identity is

2f(x)v=ddϵf(x+ϵv)ϵ=0. \nabla^2 f(x)v = \frac{d}{d\epsilon} \nabla f(x + \epsilon v) \bigg|_{\epsilon = 0}.

So a Hessian-vector product can be computed by taking the directional derivative of the gradient.

In AD terms:

  1. Use reverse mode to compute f(x)\nabla f(x).
  2. Use forward mode through that gradient computation in direction vv.

This is forward-over-reverse AD.

The result has roughly the cost of computing a gradient, up to a small constant factor, for many programs. This is why Hessian-vector products are central in second-order optimization.

Why Full Hessians Are Often Avoided

For

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the gradient has size nn, but the Hessian has size n×nn \times n.

If n=106n = 10^6, the Hessian has 101210^{12} entries. Storing it densely is infeasible.

Even when the Hessian is theoretically useful, most practical systems avoid constructing it directly. Instead, they compute:

f(x), \nabla f(x), 2f(x)v, \nabla^2 f(x)v,

or

v2f(x)v. v^\top \nabla^2 f(x)v.

These objects preserve the parts of second-order information needed by optimization algorithms while avoiding quadratic storage.

Role in Optimization

Second derivatives describe the local shape of an objective.

At a point xx:

Hessian behaviorMeaning
positive definitelocally convex bowl
negative definitelocally concave cap
indefinitesaddle-like region
singularflat or degenerate curvature

Newton’s method uses the Hessian to choose a step:

2f(x)p=f(x). \nabla^2 f(x) p = -\nabla f(x).

Then

xnew=x+p. x_{\text{new}} = x + p.

This step accounts for curvature. Gradient descent only uses the slope. Newton’s method uses both slope and curvature.

However, exact Newton steps require solving a linear system involving the Hessian. Large-scale variants often use Hessian-vector products with iterative solvers such as conjugate gradient.

Second Derivatives in AD Systems

An AD system that supports second derivatives must handle several issues.

First, derivative code must itself be differentiable. Reverse-mode implementations often use tapes, mutation, and saved intermediates. Differentiating such machinery can be difficult unless the system has a clean internal representation.

Second, nesting must keep perturbations distinct. If two forward-mode passes accidentally share the same infinitesimal tag, the system may mix derivative levels. This is the perturbation confusion problem.

Third, memory use grows quickly. Reverse mode already stores intermediate values for the backward pass. Higher-order reverse mode may need to store values needed to differentiate the backward pass itself.

Fourth, primitives need second derivative rules. For example, multiplication has first-order rules, but second-order AD must correctly propagate second-order interactions.

For

z=xy, z = xy,

the first-order differential is

dz=xdy+ydx. dz = x\,dy + y\,dx.

The second-order differential includes the mixed interaction:

d2z=xd2y+yd2x+2dxdy. d^2z = x\,d^2y + y\,d^2x + 2\,dx\,dy.

That cross term is exactly the kind of information a second-order AD implementation must preserve.

Practical Design Principle

A production AD system should treat second derivatives as structured linear algebra, rather than as a request to build a dense matrix.

The common API should expose operations such as:

grad(f)(x)
jvp(f)(x, v)
vjp(f)(x, w)
hvp(f)(x, v)

A full Hessian API can exist, but it should be understood as a convenience for small problems.

For large systems, the central abstraction is the action of the derivative operator, not its dense coordinate representation.