Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a...
Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a quantity changes when several variables change at once.
A scalar function of several variables has the form
For example,
depends on two independent input directions. We can ask how changes when only changes, when only changes, or when both change together.
Partial Derivatives
A partial derivative measures change along one input coordinate while holding the others fixed.
For
the partial derivative with respect to is
The partial derivative with respect to is
Each partial derivative is itself a function. At a concrete input, such as , , each partial derivative becomes a number.
Partial derivatives are the entries from which gradients, Jacobians, and Hessians are built.
Gradient
For a scalar-valued function
the gradient collects all first partial derivatives into a vector:
The gradient points in the direction of steepest local increase under the usual Euclidean geometry. Its negative points in the direction of steepest local decrease.
This is why gradients drive optimization. If a loss function measures the error of a model with parameters , then gradient descent updates parameters by moving against the gradient:
Here, is the learning rate.
AD gives a way to compute efficiently, even when has millions of components.
Directional Derivatives
A partial derivative changes one coordinate at a time. A directional derivative changes several coordinates together.
Let
be a direction vector. The directional derivative of at in direction is
For a differentiable scalar function, this equals
This formula says that the directional derivative is the projection of the gradient onto the direction .
Forward mode AD computes this kind of quantity naturally. Given a seed direction , forward mode propagates how every intermediate value changes when the input moves along .
Differential
The differential is the linear map that best describes local change.
For
the differential at is a linear map
such that
In coordinates,
The distinction between gradient and differential matters. The differential is naturally a linear map from input perturbations to output perturbations. The gradient is a vector representation of that linear map after choosing an inner product.
AD is fundamentally about propagating differentials. In code and machine learning libraries, this often appears as vectors and tensors. Mathematically, the clean object is the local linear map.
Vector-Valued Functions
For a vector-valued function
the output has components:
Each component has its own gradient. Stacking these gradients gives the Jacobian matrix:
Equivalently,
The Jacobian represents the differential of a vector-valued function:
So a small input perturbation gives the first-order approximation
This is the local linear model used throughout AD.
Jacobian-Vector Products
A Jacobian-vector product, usually abbreviated JVP, has the form
It answers the question:
If the input moves in direction , how does the output move?
Forward mode AD computes JVPs efficiently. It does not need to materialize the full Jacobian. It can propagate the pair
through each primitive operation.
For example, suppose
If has tangent , and has tangent , then
This is simply the product rule written in tangent form.
Vector-Jacobian Products
A vector-Jacobian product, usually abbreviated VJP, has the form
where
It answers the reverse question:
If the output is weighted by , how does that weighted output depend on the inputs?
Reverse mode AD computes VJPs efficiently. It propagates adjoints backward from outputs to inputs.
For a scalar loss
the Jacobian has one row:
A reverse pass with seed computes this row, which is the transpose of the usual gradient vector.
This is why reverse mode is the standard method for training neural networks. One reverse pass gives derivatives with respect to many parameters for one scalar objective.
Hessian
For a scalar function
the Hessian collects all second partial derivatives:
The Hessian describes local curvature. The first-order approximation uses the gradient:
The second-order approximation adds curvature:
Second-order optimization methods use this curvature information, but full Hessians are often too expensive to store. For large systems, AD is frequently used to compute Hessian-vector products instead:
This is enough for many iterative optimization methods.
Local Linearity
The main idea of multivariate calculus for AD is local linearity.
A differentiable function may be nonlinear globally, but near a point it behaves approximately like a linear map. AD computes that local linear map, or useful products involving it.
For a program
x -> intermediate values -> yAD computes how perturbations in affect perturbations in . Forward mode pushes perturbations forward. Reverse mode pulls sensitivities backward.
The key objects are:
| Object | Function shape | Meaning |
|---|---|---|
| Gradient | sensitivity of one scalar output to many inputs | |
| Jacobian | local linear map from input changes to output changes | |
| JVP | output change caused by one input direction | |
| VJP | input sensitivity induced by one output weighting | |
| Hessian | second-order curvature of a scalar function | |
| Hessian-vector product | curvature along one direction |
Automatic differentiation is the algorithmic realization of these objects for programs. It avoids symbolic expansion and avoids finite-difference approximation. It evaluates the program and propagates exact local derivative rules through its executed computation.