# Backpropagation

## Backpropagation

Backpropagation is reverse mode automatic differentiation applied to neural networks. In most machine learning writing, the term refers to the whole training procedure: run a forward pass, compute a loss, propagate gradients backward, then update parameters. More precisely, backpropagation is the derivative computation step.

A neural network is a composition of functions:

$$
f_\theta(x) =
f_k(
f_{k-1}(
\cdots
f_2(f_1(x))
)).
$$

Each layer produces an intermediate activation:

$$
a_0 = x,
\qquad
a_i = f_i(a_{i-1}; \theta_i),
\qquad
\hat{y} = a_k.
$$

The loss is a scalar:

$$
L = \ell(\hat{y}, y).
$$

Backpropagation computes derivatives of $L$ with respect to every parameter block $\theta_i$. It does this by walking the computation backward and applying the chain rule.

## Forward Pass

The forward pass evaluates the model and stores intermediate values needed by the backward pass.

For a simple feedforward network:

$$
z_i = W_i a_{i-1} + b_i,
$$

$$
a_i = \sigma(z_i).
$$

The values $a_{i-1}$, $z_i$, $W_i$, and sometimes masks or shape metadata may be needed later to compute gradients.

A reverse mode AD system therefore saves selected intermediate values. These saved values are often called residuals, activations, or tape entries depending on the implementation.

## Backward Pass

The backward pass starts from the scalar loss. Since

$$
\frac{\partial L}{\partial L} = 1,
$$

the initial adjoint is one.

For each intermediate variable $v$, backpropagation computes its adjoint:

$$
\bar{v} = \frac{\partial L}{\partial v}.
$$

If a variable feeds into several later operations, its adjoint receives contributions from all uses. The contributions are summed:

$$
\bar{v} =
\sum_j
\bar{u}_j
\frac{\partial u_j}{\partial v}.
$$

This accumulation rule is the reason gradient buffers are additive. A parameter used many times in a computation graph receives one contribution per use.

## A Single Layer

Consider one layer:

$$
z = Wa + b,
\qquad
h = \sigma(z).
$$

Suppose the backward pass receives

$$
\bar{h} = \frac{\partial L}{\partial h}.
$$

The derivative through the activation is:

$$
\bar{z} =
\bar{h} \odot \sigma'(z).
$$

The parameter gradients are:

$$
\bar{W} =
\frac{\partial L}{\partial W} =
\bar{z}a^\top,
$$

$$
\bar{b} =
\frac{\partial L}{\partial b} =
\bar{z}.
$$

The adjoint passed to the previous activation is:

$$
\bar{a} =
W^\top \bar{z}.
$$

This is the local structure of backpropagation. Each layer receives an upstream adjoint, computes gradients for its own parameters, and returns a downstream adjoint for earlier layers.

## Batched Backpropagation

In practice, layers process batches. Let

$$
A \in \mathbb{R}^{d_{\text{in}} \times B}
$$

contain $B$ input activations, one column per example. A linear layer computes:

$$
Z = WA + b\mathbf{1}^\top.
$$

If the upstream adjoint after the activation is

$$
\bar{Z} \in \mathbb{R}^{d_{\text{out}} \times B},
$$

then the gradients are:

$$
\bar{W} =
\bar{Z}A^\top,
$$

$$
\bar{b} =
\bar{Z}\mathbf{1},
$$

$$
\bar{A} =
W^\top \bar{Z}.
$$

These formulas explain why backpropagation maps efficiently to dense matrix multiplication. Training neural networks is largely a sequence of forward matrix products and backward matrix products.

## Computational Graph View

Backpropagation does not require the program to be written as layers. It only needs a graph of operations and dependencies.

A computation graph contains nodes for values and operations:

```text
x -> linear -> relu -> linear -> loss
```

The backward graph runs in reverse topological order:

```text
loss -> linear backward -> relu backward -> linear backward -> x
```

Each primitive operation has a local backward rule. For example, multiplication, matrix multiplication, convolution, normalization, and indexing each define how incoming adjoints are transformed into adjoints for their inputs.

The global gradient emerges from composing these local rules.

## Backpropagation and Reverse Mode AD

Backpropagation is sometimes presented as a special neural network algorithm. In AD terms, it is an instance of reverse accumulation.

For a function

$$
L = F(\theta),
$$

where $L$ is scalar and $\theta$ contains many parameters, reverse mode computes:

$$
\nabla_\theta L.
$$

The cost is usually within a small constant factor of evaluating $F$, assuming saved intermediates are available.

This cost property explains the importance of backpropagation. A neural network may have billions of parameters, but one scalar loss. Reverse mode computes all parameter gradients with one backward traversal.

## Memory Cost

Backpropagation requires memory for saved forward values. For deep networks, activation memory can exceed parameter memory.

The standard tradeoff is:

| Strategy | Memory | Compute |
|---|---:|---:|
| Save all activations | High | Low |
| Recompute activations | Low | High |
| Checkpoint selected layers | Medium | Medium |

Checkpointing stores only some intermediate values. During the backward pass, missing values are recomputed from the nearest checkpoint. This reduces memory at the cost of extra forward computation.

This tradeoff belongs to AD runtime design, not to the mathematical chain rule. The derivative is the same. The execution plan changes.

## Custom Backward Rules

Many systems allow custom gradients. A custom backward rule replaces the default derivative of a primitive or composite operation.

This is useful when the mathematically direct derivative is inefficient, unstable, or inconvenient. Examples include softmax with cross-entropy, normalization layers, clipping operations, and fused GPU kernels.

A custom backward rule must obey the same contract: given upstream adjoints and saved residuals, return adjoints for differentiable inputs.

Incorrect custom gradients are dangerous because the AD system will trust them. Testing usually compares the custom gradient against finite differences or a simpler reference implementation.

## Backpropagation Through Shared Parameters

Some models use the same parameter in multiple places. Recurrent networks, tied embeddings, and weight sharing all have this structure.

If a parameter $W$ is used at several graph locations, then its total gradient is the sum of all contributions:

$$
\frac{\partial L}{\partial W} =
\sum_t
\frac{\partial L_t}{\partial W}.
$$

The AD system must accumulate into the same gradient buffer. This is a graph property, not a layer property. Shared parameters make it clear that backpropagation operates over dependency graphs rather than over a simple list of layers.

## Backpropagation Through Time

For recurrent computation, the same transition is applied repeatedly:

$$
h_t = f_\theta(h_{t-1}, x_t).
$$

Unrolling the recurrence gives an ordinary computation graph:

$$
h_0 \to h_1 \to h_2 \to \cdots \to h_T.
$$

Backpropagation through time applies reverse mode to this unrolled graph. Gradients with respect to $\theta$ accumulate across time steps.

The memory cost grows with sequence length because hidden states and other residuals are needed for the backward pass. Long sequences often require truncation, checkpointing, or specialized architectures.

## Numerical Issues

Backpropagation multiplies many local derivative factors. This can cause vanishing or exploding gradients.

For a repeated scalar recurrence,

$$
h_t = a h_{t-1},
$$

the derivative across $T$ steps is:

$$
\frac{\partial h_T}{\partial h_0} =
a^T.
$$

If $|a| < 1$, the derivative decays. If $|a| > 1$, it grows. Deep networks and recurrent networks have higher-dimensional versions of the same issue, where products of Jacobians control gradient scale.

AD computes these products according to the program. It does not guarantee that the products are numerically well-conditioned.

## Minimal Backpropagation Engine

A small reverse mode engine stores values, parents, and local backward functions.

```text
class Node:
    value
    grad
    parents
    backward
```

A computation creates nodes:

```text
z = add(matmul(W, x), b)
a = relu(z)
loss = mse(a, y)
```

Calling backward on `loss` proceeds as follows:

```text
topological_order = sort_graph(loss)

loss.grad = 1

for node in reverse(topological_order):
    node.backward(node.grad)
```

Each `backward` function adds gradient contributions to parent nodes.

This minimal design is enough to express the essence of backpropagation. Production systems add tensor kernels, memory planning, alias tracking, distributed execution, mixed precision, custom operators, and graph compilation.

## Backpropagation as an Interface Boundary

Backpropagation connects the model and the optimizer.

The model defines the forward computation:

```text
loss = model_loss(theta, batch)
```

The AD system derives the gradient computation:

```text
grad = backward(loss, theta)
```

The optimizer applies an update:

```text
theta = optimizer_update(theta, grad, state)
```

Keeping this boundary clear helps system design. A model should not need to know whether the optimizer is SGD or Adam. An optimizer should not need to know whether the gradient came from a tape, a compiler pass, or a manually written backward function.

Backpropagation is therefore both a mathematical procedure and a systems protocol. It is the standard way a differentiable program exposes parameter sensitivities to an optimization algorithm.