The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the...
The gradient is enough when a function has many inputs and one scalar output. More general programs need more general derivative objects. Two of the most important are the Jacobian and the Hessian.
The Jacobian stores first-order derivatives of a vector-valued function. The Hessian stores second-order derivatives of a scalar-valued function. Both can be large. In automatic differentiation, we often avoid constructing them explicitly. Instead, we compute products with them.
Jacobian
Let
with
The Jacobian of at is the matrix
It has rows and columns. Rows correspond to output components. Columns correspond to input components.
The entry in row , column measures how output changes when input changes.
Jacobian as a Linear Map
The Jacobian represents the best local linear approximation to .
For a small perturbation ,
The Jacobian maps input perturbations to output perturbations:
This is the central meaning of the Jacobian. It is not just a table of partial derivatives. It is the linear operator that approximates the function near one point.
Example
Consider
Here, and .
The component functions are
The Jacobian is
At ,
If the input changes by
then the predicted output change is
This approximation becomes more accurate as becomes smaller.
Computing a Full Jacobian
A full Jacobian has entries. For small and , constructing it directly is reasonable. For large models, it can be infeasible.
There are two common ways to build a full Jacobian with AD.
Forward mode can compute one column at a time. Seed the input tangent with a basis vector . The resulting output tangent is
which is column of the Jacobian. Repeating this for all input coordinates gives the full matrix.
Reverse mode can compute one row at a time. Seed the output adjoint with a basis vector . The resulting input adjoint is
which is row of the Jacobian. Repeating this for all output coordinates gives the full matrix.
| Method | One pass gives | Number of passes for full Jacobian |
|---|---|---|
| Forward mode | one column | |
| Reverse mode | one row |
So the shape of the mapping matters.
For
reverse mode is usually preferred because one reverse pass gives the full gradient.
For
forward mode is usually preferred because one forward pass gives all output sensitivities with respect to the single input.
Jacobian-Vector Products
Many algorithms need the action of a Jacobian on a vector, not the full Jacobian.
A Jacobian-vector product has the form
where
This gives the output perturbation caused by input perturbation .
Forward mode computes this directly. It propagates a tangent vector alongside the ordinary value.
If
and the input perturbation is , then forward mode computes
This is often much cheaper than building .
Vector-Jacobian Products
Reverse mode naturally computes vector-Jacobian products:
where
This gives the sensitivity of the weighted output
with respect to the input.
If is scalar-valued, then . With seed , reverse mode computes
as a row vector, which corresponds to the gradient.
This is the basis of backpropagation. A scalar loss is differentiated with respect to many parameters using one reverse pass.
Hessian
The Hessian is the matrix of second derivatives for a scalar-valued function.
Let
The Hessian is
If the second partial derivatives are continuous, the Hessian is symmetric:
The Hessian describes curvature. The gradient gives the best local linear approximation. The Hessian gives the second-order correction.
Second-Order Approximation
Near a point , a scalar function can be approximated by
The three terms have distinct meanings.
| Term | Meaning |
|---|---|
| value at the base point | |
| first-order linear change | |
| second-order curvature correction |
Second-order methods use this model to choose better update steps than first-order gradient descent. Newton’s method, quasi-Newton methods, trust-region methods, and many sensitivity methods depend on curvature information.
Example Hessian
Consider
The gradient is
The Hessian is
The off-diagonal entries match because the function is smooth.
Hessian-Vector Products
A full Hessian has entries. For large , this is often impossible to store.
Many algorithms only need a Hessian-vector product:
This measures how the gradient changes in direction .
There is a useful identity:
So a Hessian-vector product is the directional derivative of the gradient.
AD systems can compute this efficiently by combining modes. One common strategy is forward-over-reverse:
- Use reverse mode to define the gradient computation.
- Use forward mode through that gradient computation to obtain the directional derivative.
This avoids explicitly forming the Hessian.
Hessian as Jacobian of the Gradient
The Hessian is the Jacobian of the gradient.
For
the gradient is a vector-valued function:
The Jacobian of this gradient function is the Hessian:
This viewpoint is useful in AD because higher-order differentiation can be built by differentiating derivative programs.
A first AD transform gives a gradient program. A second AD transform can differentiate that gradient program.
Explicit Matrices vs Products
AD users often ask for gradients, Jacobians, or Hessians. AD systems internally prefer products.
| Desired object | Size | Often computed as |
|---|---|---|
| Gradient | reverse mode with scalar seed | |
| Full Jacobian | repeated JVPs or repeated VJPs | |
| Full Hessian | repeated Hessian-vector products | |
| JVP | forward mode | |
| VJP | reverse mode | |
| HVP | mixed-mode AD |
This distinction is important. A full matrix may be conceptually simple but computationally unsuitable. Matrix-free products often give the same information needed by optimization and simulation algorithms at far lower cost.
Practical Meaning for AD
The Jacobian and Hessian are not special symbolic objects inside most AD systems. They are derivative computations obtained from repeated or nested applications of local rules.
A Jacobian arises from first-order sensitivity of vector outputs with respect to vector inputs. A Hessian arises from second-order sensitivity of a scalar output with respect to vector inputs.
The implementation question is usually:
Which derivative product does the user need?
That question leads directly to the right AD mode. Forward mode gives JVPs. Reverse mode gives VJPs. Mixed-mode AD gives second-order products such as Hessian-vector products. Full Jacobians and Hessians are then built only when their explicit entries are truly needed.