Automatic differentiation begins with a simple object: a function.
Functions and Mappings
Automatic differentiation begins with a simple object: a function.
A function maps an input value to an output value. In elementary calculus, this is often written as
meaning that takes one real number and returns one real number. For example,
takes a scalar input and returns a scalar output.
In automatic differentiation, this scalar case is useful but too small. Most real programs do not compute one number from one number. They compute many outputs from many inputs. A neural network maps parameters and data to predictions. A physics simulator maps an initial state to a later state. An optimization routine maps design variables to an objective value. A renderer maps scene parameters to pixels.
The more general form is
Here, the input is a vector with real components, and the output is a vector with real components.
Each output component is itself a scalar-valued function of the full input vector.
Scalar Functions
A scalar function has one scalar output:
This form appears often in optimization. A loss function, cost function, energy function, or objective function usually maps many input variables to one number.
For example,
has two inputs and one output.
The derivative of such a function is the gradient:
The gradient tells how the scalar output changes as each input component changes. In machine learning, reverse mode AD is powerful precisely because many training problems have this shape:
where may be millions or billions.
Vector Functions
A vector function has multiple outputs:
For example,
maps two inputs to three outputs.
The derivative of a vector function is the Jacobian matrix:
The Jacobian is a local linear approximation of the function. Near a point , a small change produces an approximate output change
This local linear view is central to AD. Automatic differentiation does not manipulate full symbolic expressions. It propagates local linear information through a computation.
Programs as Functions
A program can be viewed as a function when its outputs are determined by its inputs.
For example:
input: x1, x2
a = x1 * x2
b = sin(a)
c = b + x1
output: cThis program computes the mathematical function
AD works on the program structure, not only on the final expression. It differentiates each elementary operation and composes the results.
The program above can be decomposed into smaller mappings:
The full function is a composition of these smaller functions. AD exploits exactly this structure.
Composition
Function composition is the mathematical core of automatic differentiation.
If
and
then their composition is
meaning
The derivative of the composition is given by the chain rule:
AD is a systematic way to apply this rule to programs with many intermediate values. The program may have thousands or millions of primitive operations, but each operation has a simple local derivative rule.
For example:
u = x * y
v = sin(u)
z = v + xThis is a composition of multiplication, sine, and addition. AD records or transforms this computation so that derivative information flows through the same structure.
Inputs, Outputs, and Intermediate Values
A useful way to model an AD computation is to distinguish three kinds of variables:
| Kind | Meaning | Example |
|---|---|---|
| Input variables | Values supplied from outside the computation | model parameters, data, initial conditions |
| Intermediate variables | Values produced inside the computation | hidden activations, temporary arithmetic values |
| Output variables | Values returned by the computation | loss, prediction, state update |
Suppose we compute
We can rewrite it as a sequence of primitive operations:
The intermediate variable matters because the derivative propagates through it. The derivative of with respect to depends on the derivative of with respect to , and the derivative of with respect to .
This decomposition is what makes AD different from symbolic differentiation. AD does not need to expand
into
It can differentiate the computation as written.
Local Behavior
Derivatives describe local behavior.
For a scalar function
the derivative gives the slope of the function near . For a multivariate function
the Jacobian gives the best local linear approximation.
This means AD is usually evaluated at concrete input values. Given a program and an input , AD computes both:
and derivative information at that same point.
For forward mode, this derivative information moves in the same direction as the computation. For reverse mode, derivative information moves backward from outputs to inputs.
Both modes rely on the same fact: each primitive mapping has a local derivative, and the whole program is built by composing these primitive mappings.
Primitive Operations
An AD system must know how to differentiate primitive operations.
Typical primitive operations include:
| Operation | Example | Local derivative information |
|---|---|---|
| Addition | ||
| Multiplication | ||
| Division | depends on and | |
| Sine | ||
| Exponential | ||
| Logarithm |
The full derivative of a program is assembled from these local rules.
This is why AD is neither finite differencing nor symbolic algebra. It evaluates the original computation and propagates exact derivative rules for each primitive operation, subject only to floating point arithmetic.
Shape of a Mapping
The shape of a function strongly affects the best AD mode.
| Function shape | Common derivative object | Usually efficient mode |
|---|---|---|
| scalar derivative | either | |
| gradient | reverse mode | |
| tangent vector | forward mode | |
| Jacobian | depends on , , and structure |
Reverse mode is efficient when there are many inputs and few outputs. Forward mode is efficient when there are few inputs and many outputs.
This distinction follows directly from the mapping shape. A loss function in deep learning usually has many parameters and one scalar loss, so reverse mode is natural. A simulation depending on a small number of design parameters but producing a large output field may be better suited to forward mode.
Total Functions and Partial Functions
Mathematical functions are often treated as total functions: every input in the domain has a valid output. Programs are more complicated.
A program may fail, diverge, overflow, divide by zero, access invalid memory, or branch into undefined regions. Therefore, many program mappings are better viewed as partial functions.
For example,
is valid only for
Similarly,
is undefined at
AD systems must respect domains. A derivative can only be computed where the original computation is valid and sufficiently smooth. Near discontinuities, branch boundaries, or undefined points, derivative values may be misleading or undefined.
Smoothness
Automatic differentiation works best for smooth computations.
A smooth function has derivatives that behave regularly. Many elementary functions used in scientific computing are smooth over their valid domains:
However, programs often contain non-smooth operations:
Some of these functions are differentiable almost everywhere. Others have discrete behavior that does not fit ordinary calculus well.
For example, ReLU is defined as
It has derivative for and derivative for . At , the classical derivative is undefined. Many AD systems assign a conventional value there, often , because practical optimization requires some rule.
This distinction matters. AD computes derivatives of the executed program under the derivative rules supplied by the system. It does not automatically solve every mathematical ambiguity.
The Central View
For automatic differentiation, a function is not just an input-output relation. It is a computation built from primitive mappings.
The same mathematical function may be represented by different programs. These programs can have different numerical behavior, different memory behavior, and different derivative evaluation costs.
For example,
and
represent the same mathematical function over real numbers. In floating point arithmetic, they may behave differently near certain values. An AD system differentiates the computation it is given, not an abstract idealized formula detached from implementation.
This gives the basic principle for the rest of the book:
Automatic differentiation treats a program as a composed mapping, evaluates that mapping at concrete inputs, and propagates local derivative information through the same structure.