Skip to content

Chapter 1. Introduction

A derivative measures how an output changes when an input changes. That sentence is simple, but it is one of the main ideas behind numerical computing, optimization, machine...

Why Derivatives Matter

A derivative measures how an output changes when an input changes. That sentence is simple, but it is one of the main ideas behind numerical computing, optimization, machine learning, simulation, control, statistics, and scientific modeling.

When we write a function

y=f(x), y = f(x),

we usually care about more than the value of yy. We also care about how sensitive yy is to xx. If xx changes a little, does yy barely move, grow quickly, collapse, oscillate, or become unstable? The derivative answers that local question.

For a scalar function, the derivative is written as

dydx \frac{dy}{dx}

or

f(x). f'(x).

It describes the local rate of change of ff at the point xx. If f(x)f'(x) is positive, increasing xx locally increases f(x)f(x). If f(x)f'(x) is negative, increasing xx locally decreases f(x)f(x). If f(x)f'(x) is close to zero, the function is locally flat.

For a function with many inputs,

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the derivative becomes a gradient:

f(x)=[fx1fx2fxn]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The gradient tells us which direction increases the function fastest. Its negative tells us which direction decreases the function fastest. This single fact explains why gradients are central to optimization.

For a function with many inputs and many outputs,

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

the derivative is a Jacobian matrix:

Jf(x)=[f1x1f1xnfmx1fmxn]. J_f(x) = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}.

The Jacobian is the local linear approximation to the function. Near a point xx, a small input perturbation Δx\Delta x produces an output perturbation approximately equal to

ΔyJf(x)Δx. \Delta y \approx J_f(x)\Delta x.

This is the computational meaning of the derivative: it lets a nonlinear function behave locally like a linear map.

Derivatives Turn Questions Into Algorithms

Many important computational questions are local change questions.

In optimization, we ask: which direction should we move to reduce a loss function?

In sensitivity analysis, we ask: which input variables affect the output most?

In numerical solvers, we ask: how should the current estimate be corrected?

In machine learning, we ask: how should model parameters change so that predictions improve?

In control, we ask: how does a future state respond to a current action?

In uncertainty propagation, we ask: how does input noise affect output noise?

All of these questions require derivative information.

Consider a model with parameters θ\theta, input data xx, prediction y^\hat{y}, and loss function LL:

y^=fθ(x),=L(y^,y). \hat{y} = f_\theta(x), \qquad \ell = L(\hat{y}, y).

Training the model means adjusting θ\theta so that \ell decreases. The derivative

θ \nabla_\theta \ell

tells us how the loss changes with respect to every parameter. Without this derivative, we can still try random search, grid search, or finite differences, but those methods become impractical when the number of parameters is large.

Modern neural networks may have millions or billions of parameters. The training procedure depends on computing derivatives efficiently. Backpropagation is reverse mode automatic differentiation applied to neural network programs.

Derivatives Compress Local Behavior

A function may be complicated globally. It may contain thousands of operations, branches, loops, matrix multiplications, nonlinearities, and calls into numerical kernels. But near one input, its first-order behavior can often be summarized by a derivative.

For scalar input and scalar output, one number summarizes the local slope.

For vector input and scalar output, one vector summarizes the local direction of steepest increase.

For vector input and vector output, one matrix summarizes the local linear behavior.

This compression is powerful because many algorithms only need local information. Newton’s method, gradient descent, Gauss-Newton methods, Kalman filters, adjoint methods, and many inverse problem solvers all rely on local approximations.

The first-order approximation is

f(x+Δx)f(x)+Jf(x)Δx. f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

For scalar-valued functions, the second-order approximation adds curvature:

f(x+Δx)f(x)+f(x)TΔx+12ΔxTHf(x)Δx, f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2}\Delta x^T H_f(x)\Delta x,

where Hf(x)H_f(x) is the Hessian matrix.

The gradient tells us the local slope. The Hessian tells us the local curvature. Together they describe how the function bends near a point.

Derivatives Are Expensive to Compute by Hand

For small textbook functions, derivatives are easy. For example,

f(x)=x2+sinx f(x) = x^2 + \sin x

has derivative

f(x)=2x+cosx. f'(x) = 2x + \cos x.

But real programs are rarely this simple. A real function may be written as code, not as a compact formula. It may allocate arrays, call subroutines, perform matrix operations, branch on conditions, and iterate until convergence.

For example, a loss function in code may look conceptually like this:

function loss(params, batch):
    state = initialize(params)

    for layer in model:
        state = layer.forward(state)

    prediction = decode(state)
    error = prediction - batch.target

    return mean(error * error)

The derivative of this function with respect to params exists because the operations inside the program are differentiable. But writing the full derivative by hand would be tedious and error-prone. If the model changes, the derivative code must change too.

This is the central motivation for automatic differentiation. We want derivative computation to follow the program automatically, with accuracy close to analytic differentiation and efficiency close to hand-written derivative code.

Numerical Differentiation Is Too Fragile

A simple way to estimate a derivative is finite differences:

f(x)f(x+h)f(x)h. f'(x) \approx \frac{f(x+h)-f(x)}{h}.

This looks attractive because it treats the function as a black box. It only requires evaluating ff. But it has two major problems.

First, it is approximate. If hh is too large, the estimate has truncation error because the secant line differs from the tangent line. If hh is too small, floating point roundoff dominates because f(x+h)f(x+h) and f(x)f(x) may be nearly equal.

Second, it scales poorly. For a function with nn inputs, estimating the full gradient by finite differences usually requires n+1n+1 function evaluations. If nn is large, this becomes too expensive.

Automatic differentiation avoids both problems. It computes derivatives by applying exact local derivative rules to the actual operations executed by the program. The result is exact up to normal floating point arithmetic, not exact in the symbolic algebra sense, but far more reliable than finite differences.

Symbolic Differentiation Is Too Rigid

Symbolic differentiation manipulates formulas. It can transform

ddx(x2+sinx) \frac{d}{dx}(x^2 + \sin x)

into

2x+cosx. 2x + \cos x.

This is useful when the function is available as a symbolic expression. But many functions are programs rather than formulas. They may involve control flow, numerical libraries, data-dependent execution, and intermediate arrays.

Symbolic differentiation can also produce expressions that grow rapidly. A compact original expression may yield a huge derivative expression if simplification is not carefully managed. This expression swell makes symbolic methods awkward for large numerical programs.

Automatic differentiation occupies a different position. It does not estimate derivatives from nearby function values, as finite differences do. It also does not primarily manipulate symbolic formulas. Instead, it applies the chain rule mechanically to the executed computation.

The Chain Rule Is the Core Mechanism

Automatic differentiation works because every program can be viewed as a composition of small operations.

Suppose

y=f(g(x)). y = f(g(x)).

The derivative is

dydx=dfdgdgdx. \frac{dy}{dx} = \frac{df}{dg} \frac{dg}{dx}.

This is the chain rule. In larger programs, the same idea applies repeatedly. Each primitive operation contributes a local derivative. Automatic differentiation combines these local derivatives according to the dependency structure of the computation.

For example, let

a=x2,b=sinx,y=a+b. a = x^2, \qquad b = \sin x, \qquad y = a + b.

The derivative of yy with respect to xx is obtained by propagating derivative information through each intermediate variable:

dadx=2x,dbdx=cosx,dydx=dadx+dbdx. \frac{da}{dx} = 2x, \qquad \frac{db}{dx} = \cos x, \qquad \frac{dy}{dx} = \frac{da}{dx} + \frac{db}{dx}.

Automatic differentiation generalizes this procedure to large programs.

Derivatives Are Program Data

A useful way to understand automatic differentiation is to treat derivatives as data computed alongside ordinary values.

In forward mode, each value carries its tangent. If the program computes vv, AD also computes how vv changes with respect to a chosen input direction.

In reverse mode, each value later receives an adjoint. The adjoint tells how the final output changes with respect to that intermediate value.

Forward mode is efficient when the number of input directions is small.

Reverse mode is efficient when there are many inputs but few outputs, especially one scalar loss. This is why reverse mode dominates machine learning training.

The Main Promise of Automatic Differentiation

Automatic differentiation gives us a systematic way to compute derivatives of programs.

It has three important properties.

First, it is accurate. It applies analytic derivative rules to primitive operations, so it avoids the step-size problem of finite differences.

Second, it is mechanical. Once the primitive operations have derivative rules, large derivative computations can be generated automatically.

Third, it is efficient. With the right mode, derivative computation can often be done within a small constant factor of the original computation.

This makes automatic differentiation a bridge between calculus and software systems. It lets a program be executed not only for its value, but also for its local behavior.

That bridge is the subject of this book.