Chapter 1. Introduction

Why Derivatives Matter

A derivative measures how an output changes when an input changes. That sentence is simple, but it is one of the main ideas behind numerical computing, optimization, machine learning, simulation, control, statistics, and scientific modeling.

When we write a function

y = f(x),

we usually care about more than the value of $y$ . We also care about how sensitive $y$ is to $x$ . If $x$ changes a little, does $y$ barely move, grow quickly, collapse, oscillate, or become unstable? The derivative answers that local question.

For a scalar function, the derivative is written as

\frac{dy}{dx}

f'(x).

It describes the local rate of change of $f$ at the point $x$ . If $f'(x)$ is positive, increasing $x$ locally increases $f(x)$ . If $f'(x)$ is negative, increasing $x$ locally decreases $f(x)$ . If $f'(x)$ is close to zero, the function is locally flat.

For a function with many inputs,

f : \mathbb{R}^n \to \mathbb{R},

the derivative becomes a gradient:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The gradient tells us which direction increases the function fastest. Its negative tells us which direction decreases the function fastest. This single fact explains why gradients are central to optimization.

For a function with many inputs and many outputs,

f : \mathbb{R}^n \to \mathbb{R}^m,

the derivative is a Jacobian matrix:

J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

The Jacobian is the local linear approximation to the function. Near a point $x$ , a small input perturbation $\Delta x$ produces an output perturbation approximately equal to

\Delta y \approx J_f(x)\Delta x.

This is the computational meaning of the derivative: it lets a nonlinear function behave locally like a linear map.

Derivatives Turn Questions Into Algorithms

Many important computational questions are local change questions.

In optimization, we ask: which direction should we move to reduce a loss function?

In sensitivity analysis, we ask: which input variables affect the output most?

In numerical solvers, we ask: how should the current estimate be corrected?

In machine learning, we ask: how should model parameters change so that predictions improve?

In control, we ask: how does a future state respond to a current action?

In uncertainty propagation, we ask: how does input noise affect output noise?

All of these questions require derivative information.

Consider a model with parameters $\theta$ , input data $x$ , prediction $\hat{y}$ , and loss function $L$ :

\hat{y} = f_\theta(x), \qquad \ell = L(\hat{y}, y).

Training the model means adjusting $\theta$ so that $\ell$ decreases. The derivative

\nabla_\theta \ell

tells us how the loss changes with respect to every parameter. Without this derivative, we can still try random search, grid search, or finite differences, but those methods become impractical when the number of parameters is large.

Modern neural networks may have millions or billions of parameters. The training procedure depends on computing derivatives efficiently. Backpropagation is reverse mode automatic differentiation applied to neural network programs.

Derivatives Compress Local Behavior

A function may be complicated globally. It may contain thousands of operations, branches, loops, matrix multiplications, nonlinearities, and calls into numerical kernels. But near one input, its first-order behavior can often be summarized by a derivative.

For scalar input and scalar output, one number summarizes the local slope.

For vector input and scalar output, one vector summarizes the local direction of steepest increase.

For vector input and vector output, one matrix summarizes the local linear behavior.

This compression is powerful because many algorithms only need local information. Newton’s method, gradient descent, Gauss-Newton methods, Kalman filters, adjoint methods, and many inverse problem solvers all rely on local approximations.

The first-order approximation is

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

For scalar-valued functions, the second-order approximation adds curvature:

f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2}\Delta x^T H_f(x)\Delta x,

where $H_f(x)$ is the Hessian matrix.

The gradient tells us the local slope. The Hessian tells us the local curvature. Together they describe how the function bends near a point.

Derivatives Are Expensive to Compute by Hand

For small textbook functions, derivatives are easy. For example,

f(x) = x^2 + \sin x

has derivative

f'(x) = 2x + \cos x.

But real programs are rarely this simple. A real function may be written as code, not as a compact formula. It may allocate arrays, call subroutines, perform matrix operations, branch on conditions, and iterate until convergence.

For example, a loss function in code may look conceptually like this:

function loss(params, batch):
    state = initialize(params)

    for layer in model:
        state = layer.forward(state)

    prediction = decode(state)
    error = prediction - batch.target

    return mean(error * error)

The derivative of this function with respect to params exists because the operations inside the program are differentiable. But writing the full derivative by hand would be tedious and error-prone. If the model changes, the derivative code must change too.

This is the central motivation for automatic differentiation. We want derivative computation to follow the program automatically, with accuracy close to analytic differentiation and efficiency close to hand-written derivative code.

Numerical Differentiation Is Too Fragile

A simple way to estimate a derivative is finite differences:

f'(x) \approx \frac{f(x+h)-f(x)}{h}.

This looks attractive because it treats the function as a black box. It only requires evaluating $f$ . But it has two major problems.

First, it is approximate. If $h$ is too large, the estimate has truncation error because the secant line differs from the tangent line. If $h$ is too small, floating point roundoff dominates because $f(x+h)$ and $f(x)$ may be nearly equal.

Second, it scales poorly. For a function with $n$ inputs, estimating the full gradient by finite differences usually requires $n+1$ function evaluations. If $n$ is large, this becomes too expensive.

Automatic differentiation avoids both problems. It computes derivatives by applying exact local derivative rules to the actual operations executed by the program. The result is exact up to normal floating point arithmetic, not exact in the symbolic algebra sense, but far more reliable than finite differences.

Symbolic Differentiation Is Too Rigid

Symbolic differentiation manipulates formulas. It can transform

\frac{d}{dx}(x^2 + \sin x)

into

2x + \cos x.

This is useful when the function is available as a symbolic expression. But many functions are programs rather than formulas. They may involve control flow, numerical libraries, data-dependent execution, and intermediate arrays.

Symbolic differentiation can also produce expressions that grow rapidly. A compact original expression may yield a huge derivative expression if simplification is not carefully managed. This expression swell makes symbolic methods awkward for large numerical programs.

Automatic differentiation occupies a different position. It does not estimate derivatives from nearby function values, as finite differences do. It also does not primarily manipulate symbolic formulas. Instead, it applies the chain rule mechanically to the executed computation.

The Chain Rule Is the Core Mechanism

Automatic differentiation works because every program can be viewed as a composition of small operations.

Suppose

y = f(g(x)).

The derivative is

\frac{dy}{dx} = \frac{df}{dg} \frac{dg}{dx}.

This is the chain rule. In larger programs, the same idea applies repeatedly. Each primitive operation contributes a local derivative. Automatic differentiation combines these local derivatives according to the dependency structure of the computation.

For example, let

a = x^2, \qquad b = \sin x, \qquad y = a + b.

The derivative of $y$ with respect to $x$ is obtained by propagating derivative information through each intermediate variable:

\frac{da}{dx} = 2x, \qquad \frac{db}{dx} = \cos x, \qquad \frac{dy}{dx} = \frac{da}{dx} + \frac{db}{dx}.

Automatic differentiation generalizes this procedure to large programs.

Derivatives Are Program Data

A useful way to understand automatic differentiation is to treat derivatives as data computed alongside ordinary values.

In forward mode, each value carries its tangent. If the program computes $v$ , AD also computes how $v$ changes with respect to a chosen input direction.

In reverse mode, each value later receives an adjoint. The adjoint tells how the final output changes with respect to that intermediate value.

Forward mode is efficient when the number of input directions is small.

Reverse mode is efficient when there are many inputs but few outputs, especially one scalar loss. This is why reverse mode dominates machine learning training.

The Main Promise of Automatic Differentiation

Automatic differentiation gives us a systematic way to compute derivatives of programs.

It has three important properties.

First, it is accurate. It applies analytic derivative rules to primitive operations, so it avoids the step-size problem of finite differences.

Second, it is mechanical. Once the primitive operations have derivative rules, large derivative computations can be generated automatically.

Third, it is efficient. With the right mode, derivative computation can often be done within a small constant factor of the original computation.

This makes automatic differentiation a bridge between calculus and software systems. It lets a program be executed not only for its value, but also for its local behavior.

That bridge is the subject of this book.