Skip to content

Multivariate Calculus

Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a...

Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a quantity changes when several variables change at once.

A scalar function of several variables has the form

f:RnR f : \mathbb{R}^n \to \mathbb{R}

For example,

f(x1,x2)=x12+3x1x2+sinx2 f(x_1, x_2) = x_1^2 + 3x_1x_2 + \sin x_2

depends on two independent input directions. We can ask how ff changes when only x1x_1 changes, when only x2x_2 changes, or when both change together.

Partial Derivatives

A partial derivative measures change along one input coordinate while holding the others fixed.

For

f(x1,x2)=x12+3x1x2+sinx2 f(x_1, x_2) = x_1^2 + 3x_1x_2 + \sin x_2

the partial derivative with respect to x1x_1 is

fx1=2x1+3x2 \frac{\partial f}{\partial x_1} = 2x_1 + 3x_2

The partial derivative with respect to x2x_2 is

fx2=3x1+cosx2 \frac{\partial f}{\partial x_2} = 3x_1 + \cos x_2

Each partial derivative is itself a function. At a concrete input, such as x1=2x_1 = 2, x2=1x_2 = 1, each partial derivative becomes a number.

Partial derivatives are the entries from which gradients, Jacobians, and Hessians are built.

Gradient

For a scalar-valued function

f:RnR f : \mathbb{R}^n \to \mathbb{R}

the gradient collects all first partial derivatives into a vector:

f(x)=[fx1(x)fx2(x)fxn(x)] \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \frac{\partial f}{\partial x_2}(x) \\ \vdots \\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}

The gradient points in the direction of steepest local increase under the usual Euclidean geometry. Its negative points in the direction of steepest local decrease.

This is why gradients drive optimization. If a loss function L(θ)L(\theta) measures the error of a model with parameters θ\theta, then gradient descent updates parameters by moving against the gradient:

θk+1=θkηL(θk) \theta_{k+1} = \theta_k - \eta \nabla L(\theta_k)

Here, η\eta is the learning rate.

AD gives a way to compute L(θ)\nabla L(\theta) efficiently, even when θ\theta has millions of components.

Directional Derivatives

A partial derivative changes one coordinate at a time. A directional derivative changes several coordinates together.

Let

vRn v \in \mathbb{R}^n

be a direction vector. The directional derivative of ff at xx in direction vv is

Df(x)[v]=limϵ0f(x+ϵv)f(x)ϵ D f(x)[v] = \lim_{\epsilon \to 0} \frac{f(x + \epsilon v) - f(x)}{\epsilon}

For a differentiable scalar function, this equals

Df(x)[v]=f(x)Tv D f(x)[v] = \nabla f(x)^T v

This formula says that the directional derivative is the projection of the gradient onto the direction vv.

Forward mode AD computes this kind of quantity naturally. Given a seed direction vv, forward mode propagates how every intermediate value changes when the input moves along vv.

Differential

The differential is the linear map that best describes local change.

For

f:RnR f : \mathbb{R}^n \to \mathbb{R}

the differential at xx is a linear map

dfx:RnR df_x : \mathbb{R}^n \to \mathbb{R}

such that

dfx(v)=Df(x)[v] df_x(v) = Df(x)[v]

In coordinates,

dfx(v)=f(x)Tv df_x(v) = \nabla f(x)^T v

The distinction between gradient and differential matters. The differential is naturally a linear map from input perturbations to output perturbations. The gradient is a vector representation of that linear map after choosing an inner product.

AD is fundamentally about propagating differentials. In code and machine learning libraries, this often appears as vectors and tensors. Mathematically, the clean object is the local linear map.

Vector-Valued Functions

For a vector-valued function

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

the output has mm components:

f(x)=[f1(x)f2(x)fm(x)] f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}

Each component fif_i has its own gradient. Stacking these gradients gives the Jacobian matrix:

Jf(x)=[f1(x)Tf2(x)Tfm(x)T] J_f(x) = \begin{bmatrix} \nabla f_1(x)^T \\ \nabla f_2(x)^T \\ \vdots \\ \nabla f_m(x)^T \end{bmatrix}

Equivalently,

[Jf(x)]ij=fixj(x) [J_f(x)]_{ij} = \frac{\partial f_i}{\partial x_j}(x)

The Jacobian represents the differential of a vector-valued function:

dfx(v)=Jf(x)v df_x(v) = J_f(x)v

So a small input perturbation Δx\Delta x gives the first-order approximation

f(x+Δx)f(x)+Jf(x)Δx f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

This is the local linear model used throughout AD.

Jacobian-Vector Products

A Jacobian-vector product, usually abbreviated JVP, has the form

Jf(x)v J_f(x)v

It answers the question:

If the input moves in direction vv, how does the output move?

Forward mode AD computes JVPs efficiently. It does not need to materialize the full Jacobian. It can propagate the pair

(value,tangent) (\text{value}, \text{tangent})

through each primitive operation.

For example, suppose

z=xy z = xy

If xx has tangent x˙\dot{x}, and yy has tangent y˙\dot{y}, then

z˙=x˙y+xy˙ \dot{z} = \dot{x}y + x\dot{y}

This is simply the product rule written in tangent form.

Vector-Jacobian Products

A vector-Jacobian product, usually abbreviated VJP, has the form

uTJf(x) u^T J_f(x)

where

uRm u \in \mathbb{R}^m

It answers the reverse question:

If the output is weighted by uu, how does that weighted output depend on the inputs?

Reverse mode AD computes VJPs efficiently. It propagates adjoints backward from outputs to inputs.

For a scalar loss

L:RnR L : \mathbb{R}^n \to \mathbb{R}

the Jacobian has one row:

JL(x)=[Lx1Lxn] J_L(x) = \begin{bmatrix} \frac{\partial L}{\partial x_1} & \cdots & \frac{\partial L}{\partial x_n} \end{bmatrix}

A reverse pass with seed 11 computes this row, which is the transpose of the usual gradient vector.

This is why reverse mode is the standard method for training neural networks. One reverse pass gives derivatives with respect to many parameters for one scalar objective.

Hessian

For a scalar function

f:RnR f : \mathbb{R}^n \to \mathbb{R}

the Hessian collects all second partial derivatives:

Hf(x)=[2fx1x12fx1xn2fxnx12fxnxn] H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{bmatrix}

The Hessian describes local curvature. The first-order approximation uses the gradient:

f(x+Δx)f(x)+f(x)TΔx f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x

The second-order approximation adds curvature:

f(x+Δx)f(x)+f(x)TΔx+12ΔxTHf(x)Δx f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2}\Delta x^T H_f(x)\Delta x

Second-order optimization methods use this curvature information, but full Hessians are often too expensive to store. For large systems, AD is frequently used to compute Hessian-vector products instead:

Hf(x)v H_f(x)v

This is enough for many iterative optimization methods.

Local Linearity

The main idea of multivariate calculus for AD is local linearity.

A differentiable function may be nonlinear globally, but near a point it behaves approximately like a linear map. AD computes that local linear map, or useful products involving it.

For a program

x -> intermediate values -> y

AD computes how perturbations in xx affect perturbations in yy. Forward mode pushes perturbations forward. Reverse mode pulls sensitivities backward.

The key objects are:

ObjectFunction shapeMeaning
GradientRnR\mathbb{R}^n \to \mathbb{R}sensitivity of one scalar output to many inputs
JacobianRnRm\mathbb{R}^n \to \mathbb{R}^mlocal linear map from input changes to output changes
JVPJf(x)vJ_f(x)voutput change caused by one input direction
VJPuTJf(x)u^T J_f(x)input sensitivity induced by one output weighting
HessianRnR\mathbb{R}^n \to \mathbb{R}second-order curvature of a scalar function
Hessian-vector productHf(x)vH_f(x)vcurvature along one direction

Automatic differentiation is the algorithmic realization of these objects for programs. It avoids symbolic expansion and avoids finite-difference approximation. It evaluates the program and propagates exact local derivative rules through its executed computation.