Chapter 2. Mathematical Foundations

Functions and Mappings

Automatic differentiation begins with a simple object: a function.

A function maps an input value to an output value. In elementary calculus, this is often written as

f : \mathbb{R} \to \mathbb{R}

meaning that $f$ takes one real number and returns one real number. For example,

f(x) = x^2 + 3x + 1

takes a scalar input $x$ and returns a scalar output.

In automatic differentiation, this scalar case is useful but too small. Most real programs do not compute one number from one number. They compute many outputs from many inputs. A neural network maps parameters and data to predictions. A physics simulator maps an initial state to a later state. An optimization routine maps design variables to an objective value. A renderer maps scene parameters to pixels.

The more general form is

f : \mathbb{R}^n \to \mathbb{R}^m

Here, the input is a vector with $n$ real components, and the output is a vector with $m$ real components.

x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \qquad f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots \\ f_m(x) \end{bmatrix}

Each output component $f_i$ is itself a scalar-valued function of the full input vector.

Scalar Functions

A scalar function has one scalar output:

f : \mathbb{R}^n \to \mathbb{R}

This form appears often in optimization. A loss function, cost function, energy function, or objective function usually maps many input variables to one number.

For example,

f(x_1, x_2) = x_1^2 + x_1x_2 + \sin(x_2)

has two inputs and one output.

The derivative of such a function is the gradient:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

The gradient tells how the scalar output changes as each input component changes. In machine learning, reverse mode AD is powerful precisely because many training problems have this shape:

f : \mathbb{R}^n \to \mathbb{R}

where $n$ may be millions or billions.

Vector Functions

A vector function has multiple outputs:

f : \mathbb{R}^n \to \mathbb{R}^m

For example,

f(x_1, x_2) = \begin{bmatrix} x_1 + x_2 \\ x_1x_2 \\ \sin(x_1) \end{bmatrix}

maps two inputs to three outputs.

The derivative of a vector function is the Jacobian matrix:

J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

The Jacobian is a local linear approximation of the function. Near a point $x$ , a small change $\Delta x$ produces an approximate output change

\Delta y \approx J_f(x)\Delta x

This local linear view is central to AD. Automatic differentiation does not manipulate full symbolic expressions. It propagates local linear information through a computation.

Programs as Functions

A program can be viewed as a function when its outputs are determined by its inputs.

For example:

input:  x1, x2
a = x1 * x2
b = sin(a)
c = b + x1
output: c

This program computes the mathematical function

f(x_1, x_2) = \sin(x_1x_2) + x_1

AD works on the program structure, not only on the final expression. It differentiates each elementary operation and composes the results.

The program above can be decomposed into smaller mappings:

a = g_1(x_1, x_2) = x_1x_2

b = g_2(a) = \sin(a)

c = g_3(b, x_1) = b + x_1

The full function is a composition of these smaller functions. AD exploits exactly this structure.

Composition

Function composition is the mathematical core of automatic differentiation.

g : \mathbb{R}^n \to \mathbb{R}^k

and

h : \mathbb{R}^k \to \mathbb{R}^m

then their composition is

f = h \circ g

meaning

f(x) = h(g(x))

The derivative of the composition is given by the chain rule:

J_f(x) = J_h(g(x)) J_g(x)

AD is a systematic way to apply this rule to programs with many intermediate values. The program may have thousands or millions of primitive operations, but each operation has a simple local derivative rule.

For example:

u = x * y
v = sin(u)
z = v + x

This is a composition of multiplication, sine, and addition. AD records or transforms this computation so that derivative information flows through the same structure.

Inputs, Outputs, and Intermediate Values

A useful way to model an AD computation is to distinguish three kinds of variables:

Kind	Meaning	Example
Input variables	Values supplied from outside the computation	model parameters, data, initial conditions
Intermediate variables	Values produced inside the computation	hidden activations, temporary arithmetic values
Output variables	Values returned by the computation	loss, prediction, state update

Suppose we compute

y = (x_1 + x_2)^2

We can rewrite it as a sequence of primitive operations:

v_1 = x_1 + x_2

y = v_1^2

The intermediate variable $v_1$ matters because the derivative propagates through it. The derivative of $y$ with respect to $x_1$ depends on the derivative of $y$ with respect to $v_1$ , and the derivative of $v_1$ with respect to $x_1$ .

This decomposition is what makes AD different from symbolic differentiation. AD does not need to expand

(x_1 + x_2)^2

into

x_1^2 + 2x_1x_2 + x_2^2

It can differentiate the computation as written.

Local Behavior

Derivatives describe local behavior.

For a scalar function

f : \mathbb{R} \to \mathbb{R}

the derivative $f'(x)$ gives the slope of the function near $x$ . For a multivariate function

f : \mathbb{R}^n \to \mathbb{R}^m

the Jacobian gives the best local linear approximation.

This means AD is usually evaluated at concrete input values. Given a program and an input $x$ , AD computes both:

f(x)

and derivative information at that same point.

For forward mode, this derivative information moves in the same direction as the computation. For reverse mode, derivative information moves backward from outputs to inputs.

Both modes rely on the same fact: each primitive mapping has a local derivative, and the whole program is built by composing these primitive mappings.

Primitive Operations

An AD system must know how to differentiate primitive operations.

Typical primitive operations include:

Operation	Example	Local derivative information
Addition	$z = x + y$	$\partial z / \partial x = 1,\ \partial z / \partial y = 1$
Multiplication	$z = xy$	$\partial z / \partial x = y,\ \partial z / \partial y = x$
Division	$z = x / y$	depends on $x$ and $y$
Sine	$z = \sin x$	$\partial z / \partial x = \cos x$
Exponential	$z = e^x$	$\partial z / \partial x = e^x$
Logarithm	$z = \log x$	$\partial z / \partial x = 1/x$

The full derivative of a program is assembled from these local rules.

This is why AD is neither finite differencing nor symbolic algebra. It evaluates the original computation and propagates exact derivative rules for each primitive operation, subject only to floating point arithmetic.

Shape of a Mapping

The shape of a function strongly affects the best AD mode.

Function shape	Common derivative object	Usually efficient mode
$\mathbb{R} \to \mathbb{R}$	scalar derivative	either
$\mathbb{R}^n \to \mathbb{R}$	gradient	reverse mode
$\mathbb{R} \to \mathbb{R}^m$	tangent vector	forward mode
$\mathbb{R}^n \to \mathbb{R}^m$	Jacobian	depends on $n$ , $m$ , and structure

Reverse mode is efficient when there are many inputs and few outputs. Forward mode is efficient when there are few inputs and many outputs.

This distinction follows directly from the mapping shape. A loss function in deep learning usually has many parameters and one scalar loss, so reverse mode is natural. A simulation depending on a small number of design parameters but producing a large output field may be better suited to forward mode.

Total Functions and Partial Functions

Mathematical functions are often treated as total functions: every input in the domain has a valid output. Programs are more complicated.

A program may fail, diverge, overflow, divide by zero, access invalid memory, or branch into undefined regions. Therefore, many program mappings are better viewed as partial functions.

For example,

f(x) = \log x

is valid only for

x > 0

Similarly,

f(x) = \frac{1}{x}

is undefined at

x = 0

AD systems must respect domains. A derivative can only be computed where the original computation is valid and sufficiently smooth. Near discontinuities, branch boundaries, or undefined points, derivative values may be misleading or undefined.

Smoothness

Automatic differentiation works best for smooth computations.

A smooth function has derivatives that behave regularly. Many elementary functions used in scientific computing are smooth over their valid domains:

\sin x,\quad \cos x,\quad e^x,\quad \log x

However, programs often contain non-smooth operations:

\max(x, 0)

|x|

\operatorname{floor}(x)

\operatorname{sort}(x)

Some of these functions are differentiable almost everywhere. Others have discrete behavior that does not fit ordinary calculus well.

For example, ReLU is defined as

\operatorname{ReLU}(x) = \max(0, x)

It has derivative $0$ for $x < 0$ and derivative $1$ for $x > 0$ . At $x = 0$ , the classical derivative is undefined. Many AD systems assign a conventional value there, often $0$ , because practical optimization requires some rule.

This distinction matters. AD computes derivatives of the executed program under the derivative rules supplied by the system. It does not automatically solve every mathematical ambiguity.

The Central View

For automatic differentiation, a function is not just an input-output relation. It is a computation built from primitive mappings.

The same mathematical function may be represented by different programs. These programs can have different numerical behavior, different memory behavior, and different derivative evaluation costs.

For example,

f(x) = x^2 - 1

and

g(x) = (x - 1)(x + 1)

represent the same mathematical function over real numbers. In floating point arithmetic, they may behave differently near certain values. An AD system differentiates the computation it is given, not an abstract idealized formula detached from implementation.

This gives the basic principle for the rest of the book:

Automatic differentiation treats a program as a composed mapping, evaluates that mapping at concrete inputs, and propagates local derivative information through the same structure.