# Automatic Differentiation

## Automatic Differentiation

Automatic differentiation computes derivatives by applying the chain rule to the operations of a program. The input is ordinary code that computes a value. The output is code, data, or runtime behavior that also computes derivative information.

AD sits between numerical differentiation and symbolic differentiation.

Numerical differentiation estimates derivatives from repeated function evaluations.

Symbolic differentiation rewrites formulas into derivative formulas.

Automatic differentiation follows the actual computation and propagates derivatives through each primitive operation.

Suppose a program computes

$$
y = f(x).
$$

Automatic differentiation evaluates $f(x)$ and, at the same time or in a related pass, computes derivative information such as

$$
f'(x),
\qquad
\nabla f(x),
\qquad
J_f(x)v,
\qquad
v^T J_f(x).
$$

The exact object depends on the AD mode and the shape of the function.

## Programs as Compositions

A program can be viewed as a sequence of elementary operations.

For example,

$$
y = \sin(x^2 + 1)
$$

can be written as

$$
v_1 = x^2,
$$

$$
v_2 = v_1 + 1,
$$

$$
v_3 = \sin(v_2),
$$

$$
y = v_3.
$$

Each step is simple. Each step has a known local derivative. Automatic differentiation combines these local derivatives.

For the same computation,

$$
\frac{dv_1}{dx} = 2x,
$$

$$
\frac{dv_2}{dv_1} = 1,
$$

$$
\frac{dv_3}{dv_2} = \cos(v_2).
$$

The full derivative is the product of local derivatives:

$$
\frac{dy}{dx} =
\frac{dv_3}{dv_2}
\frac{dv_2}{dv_1}
\frac{dv_1}{dx} =
\cos(x^2+1)2x.
$$

AD does this mechanically for large computations.

## Elementary Operations

An AD system needs derivative rules for primitive operations. These operations may include arithmetic, transcendental functions, comparisons, indexing, reductions, matrix multiplication, convolution, and other library kernels.

For scalar arithmetic, the rules are familiar:

| Operation | Value | Local derivative |
|---|---:|---:|
| Add | $z=x+y$ | $dz=dx+dy$ |
| Subtract | $z=x-y$ | $dz=dx-dy$ |
| Multiply | $z=xy$ | $dz=y\,dx+x\,dy$ |
| Divide | $z=x/y$ | $dz=(y\,dx-x\,dy)/y^2$ |
| Sine | $z=\sin x$ | $dz=\cos x\,dx$ |
| Exponential | $z=\exp x$ | $dz=\exp x\,dx$ |
| Logarithm | $z=\log x$ | $dz=dx/x$ |

For array operations, the same idea applies, but the derivative rule may involve tensor shapes, broadcasting, transposes, reductions, or sparse structure.

For example, if

$$
Y = AX,
$$

then a perturbation satisfies

$$
dY = A\,dX + dA\,X.
$$

This local rule is enough to propagate derivatives through a matrix multiplication.

## Forward Mode

Forward mode propagates derivatives in the same direction as ordinary evaluation.

For each variable $v$, the system carries both the value $v$ and a tangent $\dot{v}$. The tangent describes how $v$ changes with respect to a selected input perturbation.

If the input is seeded with

$$
\dot{x} = 1,
$$

then the final tangent $\dot{y}$ equals

$$
\frac{dy}{dx}.
$$

For a multivariate function, the seed can be a direction $u$. Forward mode computes the Jacobian-vector product

$$
J_f(x)u.
$$

This is useful when the number of input directions is small.

A simple forward-mode trace for

$$
y = \sin(x^2 + 1)
$$

looks like this:

| Variable | Value | Tangent |
|---|---:|---:|
| $x$ | $x$ | $1$ |
| $v_1=x^2$ | $x^2$ | $2x$ |
| $v_2=v_1+1$ | $x^2+1$ | $2x$ |
| $y=\sin v_2$ | $\sin(x^2+1)$ | $\cos(x^2+1)2x$ |

The derivative appears at the output.

## Reverse Mode

Reverse mode propagates derivative information backward from the output to the inputs.

During the forward evaluation, the system records enough information to replay derivative rules backward. This record is often called a tape, trace, or computational graph.

For each intermediate variable $v$, reverse mode computes an adjoint

$$
\bar{v} = \frac{\partial y}{\partial v},
$$

where $y$ is the scalar output being differentiated.

The final output is seeded with

$$
\bar{y} = 1.
$$

Then adjoints are propagated backward.

For the same computation,

$$
v_1 = x^2,
\qquad
v_2 = v_1 + 1,
\qquad
y = \sin v_2,
$$

reverse mode proceeds as follows:

$$
\bar{y}=1,
$$

$$
\bar{v}_2 += \bar{y}\cos(v_2),
$$

$$
\bar{v}_1 += \bar{v}_2,
$$

$$
\bar{x} += \bar{v}_1 2x.
$$

At the end,

$$
\bar{x} =
\frac{dy}{dx} =
\cos(x^2+1)2x.
$$

Reverse mode is especially important for scalar-output functions with many inputs:

$$
f : \mathbb{R}^n \to \mathbb{R}.
$$

It computes the full gradient with cost comparable to a small constant multiple of evaluating $f$. This is the computational reason reverse mode is central to neural network training.

## AD Computes Exact Derivatives of the Executed Program

Automatic differentiation is often described as exact. The precise statement is more careful.

AD computes exact derivatives of the sequence of primitive operations that the program executes, subject to floating point arithmetic.

If the program computes using floating point numbers, AD differentiates that floating point computation. This may differ from the derivative of an ideal real-number mathematical function.

For example, a real function may be smooth, while its floating point implementation has rounding, overflow, underflow, branching, and saturation behavior.

AD follows the implementation.

This is usually desirable. Numerical software runs as code, not as ideal algebra. The derivative used by an optimizer should correspond to the computation actually producing the value.

## AD and Control Flow

Automatic differentiation can handle control flow by differentiating the executed path.

Consider:

```text
function f(x):
    if x > 0:
        return x * x
    else:
        return -x
```

For $x>0$, the executed branch is

$$
f(x)=x^2,
$$

so AD gives

$$
f'(x)=2x.
$$

For $x<0$, the executed branch is

$$
f(x)=-x,
$$

so AD gives

$$
f'(x)=-1.
$$

At $x=0$, the mathematical function may have a kink or branch-dependent behavior. AD returns the derivative of the branch that runs, assuming the primitive derivative rule is defined there.

This trace-based behavior is one reason AD works well for programs. It does not need to construct a global symbolic formula for every possible path.

## AD and Data Structures

AD becomes more complex when programs use arrays, mutation, aliasing, sparse structures, or external kernels.

A tensor operation such as

```text
z = sum(x * y)
```

has a simple derivative:

$$
\frac{\partial z}{\partial x} = y,
\qquad
\frac{\partial z}{\partial y} = x.
$$

But the implementation must account for shape, dtype, memory layout, broadcasting, and reduction axes.

For example:

```text
z = sum(x + b)
```

where $x$ is a matrix and $b$ is a vector broadcast across rows. The reverse derivative with respect to $b$ must sum over the broadcasted dimension.

Thus, AD systems are not only mathematical tools. They are runtime and compiler systems. They must understand the semantics of the operations they differentiate.

## AD Modes as Linear Maps

A derivative is a linear map between tangent spaces.

For

$$
f : \mathbb{R}^n \to \mathbb{R}^m,
$$

the derivative at $x$ is the Jacobian

$$
J_f(x) : \mathbb{R}^n \to \mathbb{R}^m.
$$

Forward mode applies this map to a vector:

$$
u \mapsto J_f(x)u.
$$

Reverse mode applies the transpose map to a covector:

$$
w \mapsto J_f(x)^T w.
$$

These are the two most important derivative products in numerical computing.

Forward mode computes Jacobian-vector products.

Reverse mode computes vector-Jacobian products, or equivalently Jacobian-transpose-vector products.

Many systems avoid constructing the full Jacobian because it may be enormous. Instead, they provide efficient products with the Jacobian or its transpose.

## Why AD Matters

Automatic differentiation matters because it makes derivatives available at program scale.

A programmer can write a numerical computation once. The AD system supplies derivative computation from the same source of truth. This reduces errors, improves maintainability, and enables algorithms that would otherwise be impractical.

AD is now used in:

| Area | Derivative use |
|---|---|
| Machine learning | Gradients for training |
| Optimization | Objective and constraint derivatives |
| Scientific simulation | Sensitivities and inverse problems |
| Control | Linearization and policy gradients |
| Robotics | Kinematics and dynamics derivatives |
| Finance | Greeks and risk sensitivities |
| Graphics | Differentiable rendering |
| Probabilistic programming | Gradients of log densities |

The common structure is the same: a program computes a quantity, and another algorithm needs to know how that quantity changes.

Automatic differentiation provides that missing layer. It turns ordinary numerical programs into differentiable programs.