First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is...
Second Derivatives
First derivatives describe local rate of change. Second derivatives describe how that rate of change itself changes. In optimization, this is curvature. In dynamics, it is acceleration. In sensitivity analysis, it tells us whether a response is stable, amplifying, or changing direction.
For a scalar function
the first derivative is
and the second derivative is
For example, if
then
and
The first derivative says the slope grows like . The second derivative says the slope itself grows at rate .
Second Derivatives in Several Variables
For a scalar function of several variables,
the first derivative is the gradient:
The second derivative is the derivative of the gradient. This gives the Hessian matrix:
Each entry measures how one component of the gradient changes with respect to one input variable.
If
then
and
The diagonal entries measure curvature along coordinate directions. The off-diagonal entries measure interaction between variables.
Symmetry of the Hessian
When the mixed partial derivatives are continuous, the Hessian is symmetric:
Therefore,
This symmetry matters in automatic differentiation because a naive implementation may compute both mixed partials separately. A well-designed system can exploit symmetry to reduce work and storage.
For dense Hessians, storing all entries costs . Storing only the upper triangular part costs roughly half as much. For sparse Hessians, the savings can be much larger.
Second Derivatives as Quadratic Approximation
Second derivatives appear naturally in the second-order Taylor approximation:
The gradient gives the best local linear approximation. The Hessian gives the correction that accounts for curvature.
The term
measures curvature in the direction . This quantity is often more useful than the full Hessian itself, especially when is large.
Directional Second Derivatives
Given a direction , the first directional derivative is
The second directional derivative is
This measures curvature along the line
Define
Then
Automatic differentiation can compute this without materializing the full Hessian. This is important in large-scale optimization, where may be millions or billions.
Computing Second Derivatives with AD
Automatic differentiation computes second derivatives by differentiating derivative computations.
There are several common strategies.
| Method | Computes | Typical use |
|---|---|---|
| Forward over forward | second directional derivatives | small input dimension |
| Reverse over forward | gradients of directional derivatives | Hessian-vector products |
| Forward over reverse | directional derivatives of gradients | Hessian-vector products |
| Reverse over reverse | full second-order reverse AD | delicate, memory-heavy |
| Taylor mode | higher-order univariate expansions | high-order derivatives |
The simplest idea is nesting.
Suppose forward mode computes
If we apply forward mode again, we can compute second-order directional information.
For scalar , forward-over-forward can compute
For full Hessians, one can seed basis directions:
This recovers Hessian columns, but costs derivative passes. That is acceptable for small , but expensive for high-dimensional models.
Hessian-Vector Products
A Hessian-vector product is
It gives the action of the Hessian on a vector without constructing the Hessian matrix.
One useful identity is
So a Hessian-vector product can be computed by taking the directional derivative of the gradient.
In AD terms:
- Use reverse mode to compute .
- Use forward mode through that gradient computation in direction .
This is forward-over-reverse AD.
The result has roughly the cost of computing a gradient, up to a small constant factor, for many programs. This is why Hessian-vector products are central in second-order optimization.
Why Full Hessians Are Often Avoided
For
the gradient has size , but the Hessian has size .
If , the Hessian has entries. Storing it densely is infeasible.
Even when the Hessian is theoretically useful, most practical systems avoid constructing it directly. Instead, they compute:
or
These objects preserve the parts of second-order information needed by optimization algorithms while avoiding quadratic storage.
Role in Optimization
Second derivatives describe the local shape of an objective.
At a point :
| Hessian behavior | Meaning |
|---|---|
| positive definite | locally convex bowl |
| negative definite | locally concave cap |
| indefinite | saddle-like region |
| singular | flat or degenerate curvature |
Newton’s method uses the Hessian to choose a step:
Then
This step accounts for curvature. Gradient descent only uses the slope. Newton’s method uses both slope and curvature.
However, exact Newton steps require solving a linear system involving the Hessian. Large-scale variants often use Hessian-vector products with iterative solvers such as conjugate gradient.
Second Derivatives in AD Systems
An AD system that supports second derivatives must handle several issues.
First, derivative code must itself be differentiable. Reverse-mode implementations often use tapes, mutation, and saved intermediates. Differentiating such machinery can be difficult unless the system has a clean internal representation.
Second, nesting must keep perturbations distinct. If two forward-mode passes accidentally share the same infinitesimal tag, the system may mix derivative levels. This is the perturbation confusion problem.
Third, memory use grows quickly. Reverse mode already stores intermediate values for the backward pass. Higher-order reverse mode may need to store values needed to differentiate the backward pass itself.
Fourth, primitives need second derivative rules. For example, multiplication has first-order rules, but second-order AD must correctly propagate second-order interactions.
For
the first-order differential is
The second-order differential includes the mixed interaction:
That cross term is exactly the kind of information a second-order AD implementation must preserve.
Practical Design Principle
A production AD system should treat second derivatives as structured linear algebra, rather than as a request to build a dense matrix.
The common API should expose operations such as:
grad(f)(x)
jvp(f)(x, v)
vjp(f)(x, w)
hvp(f)(x, v)A full Hessian API can exist, but it should be understood as a convenience for small problems.
For large systems, the central abstraction is the action of the derivative operator, not its dense coordinate representation.