Multivariate Calculus

Automatic differentiation is usually applied to functions with many inputs and many outputs. The calculus needed for this setting is multivariate calculus: the study of how a quantity changes when several variables change at once.

A scalar function of several variables has the form

f : \mathbb{R}^n \to \mathbb{R}

For example,

f(x_1, x_2) = x_1^2 + 3x_1x_2 + \sin x_2

depends on two independent input directions. We can ask how $f$ changes when only $x_1$ changes, when only $x_2$ changes, or when both change together.

Partial Derivatives

A partial derivative measures change along one input coordinate while holding the others fixed.

For

f(x_1, x_2) = x_1^2 + 3x_1x_2 + \sin x_2

the partial derivative with respect to $x_1$ is

\frac{\partial f}{\partial x_1} = 2x_1 + 3x_2

The partial derivative with respect to $x_2$ is

\frac{\partial f}{\partial x_2} = 3x_1 + \cos x_2

Each partial derivative is itself a function. At a concrete input, such as $x_1 = 2$ , $x_2 = 1$ , each partial derivative becomes a number.

Partial derivatives are the entries from which gradients, Jacobians, and Hessians are built.

Gradient

For a scalar-valued function

f : \mathbb{R}^n \to \mathbb{R}

the gradient collects all first partial derivatives into a vector:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \frac{\partial f}{\partial x_2}(x) \\ \vdots \\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}

The gradient points in the direction of steepest local increase under the usual Euclidean geometry. Its negative points in the direction of steepest local decrease.

This is why gradients drive optimization. If a loss function $L(\theta)$ measures the error of a model with parameters $\theta$ , then gradient descent updates parameters by moving against the gradient:

\theta_{k+1} = \theta_k - \eta \nabla L(\theta_k)

Here, $\eta$ is the learning rate.

AD gives a way to compute $\nabla L(\theta)$ efficiently, even when $\theta$ has millions of components.

Directional Derivatives

A partial derivative changes one coordinate at a time. A directional derivative changes several coordinates together.

Let

v \in \mathbb{R}^n

be a direction vector. The directional derivative of $f$ at $x$ in direction $v$ is

D f(x)[v] = \lim_{\epsilon \to 0} \frac{f(x + \epsilon v) - f(x)}{\epsilon}

For a differentiable scalar function, this equals

D f(x)[v] = \nabla f(x)^T v

This formula says that the directional derivative is the projection of the gradient onto the direction $v$ .

Forward mode AD computes this kind of quantity naturally. Given a seed direction $v$ , forward mode propagates how every intermediate value changes when the input moves along $v$ .

Differential

The differential is the linear map that best describes local change.

For

f : \mathbb{R}^n \to \mathbb{R}

the differential at $x$ is a linear map

df_x : \mathbb{R}^n \to \mathbb{R}

such that

df_x(v) = Df(x)[v]

In coordinates,

df_x(v) = \nabla f(x)^T v

The distinction between gradient and differential matters. The differential is naturally a linear map from input perturbations to output perturbations. The gradient is a vector representation of that linear map after choosing an inner product.

AD is fundamentally about propagating differentials. In code and machine learning libraries, this often appears as vectors and tensors. Mathematically, the clean object is the local linear map.

Vector-Valued Functions

For a vector-valued function

f : \mathbb{R}^n \to \mathbb{R}^m

the output has $m$ components:

f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}

Each component $f_i$ has its own gradient. Stacking these gradients gives the Jacobian matrix:

J_f(x) = \begin{bmatrix} \nabla f_1(x)^T \\ \nabla f_2(x)^T \\ \vdots \\ \nabla f_m(x)^T \end{bmatrix}

Equivalently,

[J_f(x)]_{ij} = \frac{\partial f_i}{\partial x_j}(x)

The Jacobian represents the differential of a vector-valued function:

df_x(v) = J_f(x)v

So a small input perturbation $\Delta x$ gives the first-order approximation

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x

This is the local linear model used throughout AD.

Jacobian-Vector Products

A Jacobian-vector product, usually abbreviated JVP, has the form

J_f(x)v

It answers the question:

If the input moves in direction $v$ , how does the output move?

Forward mode AD computes JVPs efficiently. It does not need to materialize the full Jacobian. It can propagate the pair

(\text{value}, \text{tangent})

through each primitive operation.

For example, suppose

z = xy

If $x$ has tangent $\dot{x}$ , and $y$ has tangent $\dot{y}$ , then

\dot{z} = \dot{x}y + x\dot{y}

This is simply the product rule written in tangent form.

Vector-Jacobian Products

A vector-Jacobian product, usually abbreviated VJP, has the form

u^T J_f(x)

where

u \in \mathbb{R}^m

It answers the reverse question:

If the output is weighted by $u$ , how does that weighted output depend on the inputs?

Reverse mode AD computes VJPs efficiently. It propagates adjoints backward from outputs to inputs.

For a scalar loss

L : \mathbb{R}^n \to \mathbb{R}

the Jacobian has one row:

J_L(x) = \begin{bmatrix} \frac{\partial L}{\partial x_1} & \cdots & \frac{\partial L}{\partial x_n} \end{bmatrix}

A reverse pass with seed $1$ computes this row, which is the transpose of the usual gradient vector.

This is why reverse mode is the standard method for training neural networks. One reverse pass gives derivatives with respect to many parameters for one scalar objective.

Hessian

For a scalar function

f : \mathbb{R}^n \to \mathbb{R}

the Hessian collects all second partial derivatives:

H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{bmatrix}

The Hessian describes local curvature. The first-order approximation uses the gradient:

f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x

The second-order approximation adds curvature:

f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2}\Delta x^T H_f(x)\Delta x

Second-order optimization methods use this curvature information, but full Hessians are often too expensive to store. For large systems, AD is frequently used to compute Hessian-vector products instead:

H_f(x)v

This is enough for many iterative optimization methods.

Local Linearity

The main idea of multivariate calculus for AD is local linearity.

A differentiable function may be nonlinear globally, but near a point it behaves approximately like a linear map. AD computes that local linear map, or useful products involving it.

For a program

x -> intermediate values -> y

AD computes how perturbations in $x$ affect perturbations in $y$ . Forward mode pushes perturbations forward. Reverse mode pulls sensitivities backward.

The key objects are:

Object	Function shape	Meaning
Gradient	$\mathbb{R}^n \to \mathbb{R}$	sensitivity of one scalar output to many inputs
Jacobian	$\mathbb{R}^n \to \mathbb{R}^m$	local linear map from input changes to output changes
JVP	$J_f(x)v$	output change caused by one input direction
VJP	$u^T J_f(x)$	input sensitivity induced by one output weighting
Hessian	$\mathbb{R}^n \to \mathbb{R}$	second-order curvature of a scalar function
Hessian-vector product	$H_f(x)v$	curvature along one direction

Automatic differentiation is the algorithmic realization of these objects for programs. It avoids symbolic expansion and avoids finite-difference approximation. It evaluates the program and propagates exact local derivative rules through its executed computation.