# Chapter 4. Core Theory of Automatic Differentiation

Automatic differentiation is built on a simple observation: a complicated derivative can be computed by composing many small local derivatives. Instead of manipulating a full symbolic expression, or approximating derivatives numerically, AD propagates derivative information through each elementary operation in a program.

This section develops the central mechanism underlying all forms of automatic differentiation: local derivative propagation.

## Programs as Compositions

Consider a function:

$$
f(x)=\sin(x^2+3x)
$$

This function is not evaluated as a single indivisible object. A program computes it as a sequence of intermediate operations:

| Step | Expression | Variable |
|---|---|---|
| 1 | $x^2$ | $v_1$ |
| 2 | $3x$ | $v_2$ |
| 3 | $v_1 + v_2$ | $v_3$ |
| 4 | $\sin(v_3)$ | $v_4$ |

The final result is:

$$
f(x)=v_4
$$

Each operation has a local derivative rule:

| Operation | Local Derivative |
|---|---|
| $a+b$ | $1,1$ |
| $a \cdot b$ | $b,a$ |
| $\sin(a)$ | $\cos(a)$ |
| $a^2$ | $2a$ |

Automatic differentiation applies these rules incrementally across the computation graph.

## The Chain Rule as Propagation

The chain rule connects local derivatives into global derivatives.

For nested functions:

$$
y=f(g(h(x)))
$$

the derivative becomes:

$$
\frac{dy}{dx} =
\frac{dy}{dg}
\frac{dg}{dh}
\frac{dh}{dx}
$$

Automatic differentiation operationalizes this rule mechanically.

Instead of reasoning symbolically about the full expression, the system computes:

1. local values
2. local derivatives
3. propagated sensitivities

This decomposition is the key idea.

## Computational Graph Representation

A program can be represented as a directed acyclic graph (DAG).

Nodes represent operations or variables.

Edges represent data dependencies.

For:

$$
f(x,y)=x y+\sin(x)
$$

the graph contains:

- input nodes $x,y$
- multiplication node $xy$
- sine node $\sin(x)$
- addition node

Derivative information flows along graph edges.

The graph separates:
- primal computation
- derivative propagation

This separation allows AD systems to transform programs systematically.

## Local Jacobians

Each operation defines a small Jacobian relating outputs to inputs.

For scalar operations, these are ordinary derivatives.

For vector operations, they become matrices.

Example:

$$
z=x y
$$

Local derivatives:

$$
\frac{\partial z}{\partial x}=y
$$

$$
\frac{\partial z}{\partial y}=x
$$

The local Jacobian is:

$$
J=
\begin{bmatrix}
y & x
\end{bmatrix}
$$

AD systems never need to materialize the full global Jacobian explicitly. Instead, they propagate products involving local Jacobians.

This distinction is critical for scalability.

## Forward Propagation of Derivatives

In forward mode, derivatives move in the same direction as computation.

Suppose:

$$
v_1=x^2
$$

Then:

$$
\dot v_1 =
2x \dot x
$$

The dot notation denotes a tangent value.

Every variable carries:
- its primal value
- its derivative value

Evaluation proceeds step-by-step.

Example:

$$
f(x)=\sin(x^2)
$$

with $x=3$.

Primal computation:

| Variable | Value |
|---|---|
| $x$ | 3 |
| $v_1=x^2$ | 9 |
| $v_2=\sin(v_1)$ | $\sin(9)$ |

Derivative propagation:

| Variable | Derivative |
|---|---|
| $\dot x$ | 1 |
| $\dot v_1$ | $2x\dot x=6$ |
| $\dot v_2$ | $\cos(v_1)\dot v_1=6\cos(9)$ |

Final derivative:

$$
f'(3)=6\cos(9)
$$

No symbolic algebra is required.

## Reverse Propagation of Derivatives

Reverse mode propagates sensitivities backward.

Instead of computing:

$$
\frac{\partial v_i}{\partial x}
$$

it computes:

$$
\frac{\partial y}{\partial v_i}
$$

These quantities are called adjoints.

Suppose:

$$
y=\sin(x^2)
$$

Forward pass:
- compute intermediate values
- store them

Reverse pass:
- propagate gradients backward

Intermediate variables:

$$
v_1=x^2
$$

$$
v_2=\sin(v_1)
$$

Backward propagation:

$$
\bar v_2=1
$$

$$
\bar v_1 =
\bar v_2 \cos(v_1)
$$

$$
\bar x =
\bar v_1 (2x)
$$

Final result:

$$
\frac{dy}{dx} =
2x\cos(x^2)
$$

Reverse mode is especially efficient when:
- outputs are few
- inputs are many

This property explains its dominance in deep learning.

## Propagation Rules for Elementary Operations

Each primitive operation defines propagation equations.

Addition:

$$
z=x+y
$$

Forward:

$$
\dot z=\dot x+\dot y
$$

Reverse:

$$
\bar x += \bar z
$$

$$
\bar y += \bar z
$$

Multiplication:

$$
z=xy
$$

Forward:

$$
\dot z=y\dot x+x\dot y
$$

Reverse:

$$
\bar x += y\bar z
$$

$$
\bar y += x\bar z
$$

Sine:

$$
z=\sin(x)
$$

Forward:

$$
\dot z=\cos(x)\dot x
$$

Reverse:

$$
\bar x += \cos(x)\bar z
$$

These rules form the operational core of AD engines.

## Locality and Modularity

A major advantage of AD is locality.

Each operation only needs:
- local inputs
- local outputs
- local derivative rules

An operation never needs knowledge of the full program.

This locality enables:
- composability
- compiler transformations
- efficient runtime systems
- extensible operator libraries

New differentiable operations can be added independently.

## Exactness and Floating Point Semantics

Automatic differentiation computes derivatives accurate up to machine precision.

This differs fundamentally from finite differences.

Finite difference approximation:

$$
f'(x)\approx\frac{f(x+h)-f(x)}{h}
$$

introduces:
- truncation error
- cancellation error
- sensitivity to step size

AD avoids these issues because it differentiates the computational procedure itself.

The derivative computation is exact relative to the executed floating point operations.

## Complexity Perspective

Suppose a program requires $C$ operations.

Forward mode derivative computation costs approximately:

$$
O(C)
$$

for one directional derivative.

Reverse mode computes gradients with cost proportional to a small multiple of the original computation.

This efficiency is one of the most important results in automatic differentiation theory.

For scalar-output functions:

$$
f:\mathbb{R}^n \to \mathbb{R}
$$

reverse mode computes the full gradient in roughly the same asymptotic complexity as evaluating the function itself.

## Local Propagation as the Core Abstraction

Everything in automatic differentiation reduces to local propagation.

Regardless of:
- programming language
- compiler strategy
- tensor system
- graph representation
- runtime architecture

the core mechanism remains unchanged:

1. decompose computation into primitive operations
2. associate local derivative rules
3. propagate derivative information through the graph

Forward mode propagates tangents.

Reverse mode propagates adjoints.

Higher-order systems propagate richer algebraic structures.

But all of them inherit the same foundational principle:
local derivative propagation through compositional programs.