# Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently. It applies reverse-mode differentiation to a computational graph. The algorithm evaluates the network forward to compute predictions and loss, then traverses the graph backward to compute gradients with respect to parameters.

Without backpropagation, training modern neural networks would be computationally infeasible. A large language model may contain billions of parameters. Backpropagation computes all parameter gradients in roughly the same order of cost as one forward pass.

### The Goal of Backpropagation

Suppose a neural network has parameters

$$
\theta,
$$

input data

$$
X,
$$

predictions

$$
\hat{y}=f_\theta(X),
$$

and scalar loss

$$
L=\ell(\hat{y}, y).
$$

Training requires the gradient

$$
\nabla_\theta L.
$$

This gradient tells us how changing each parameter changes the loss.

The optimizer then updates the parameters. For gradient descent:

$$
\theta \leftarrow \theta - \eta \nabla_\theta L,
$$

where \(\eta\) is the learning rate.

The challenge is efficiency. A neural network may contain millions of intermediate computations. Naively differentiating every parameter separately would repeat enormous amounts of work. Backpropagation avoids this by reusing intermediate derivatives.

### A Small Neural Network

Consider a simple neural network with one hidden layer:

$$
h = W_1x + b_1,
$$

$$
a = \sigma(h),
$$

$$
o = W_2a + b_2,
$$

$$
L = \ell(o, y).
$$

Here:

| Symbol | Meaning |
|---|---|
| \(x\) | Input vector |
| \(W_1, b_1\) | First-layer parameters |
| \(h\) | Hidden pre-activation |
| \(\sigma\) | Activation function |
| \(a\) | Hidden activation |
| \(W_2, b_2\) | Output-layer parameters |
| \(o\) | Output logits or predictions |
| \(L\) | Scalar loss |

The forward pass computes these values from top to bottom. The backward pass computes gradients from bottom to top.

### Forward Pass

The forward pass evaluates the computational graph.

For example:

$$
x
\longrightarrow
h
\longrightarrow
a
\longrightarrow
o
\longrightarrow
L.
$$

Each operation produces intermediate tensors needed later during the backward pass.

In PyTorch:

```python id="ay8q2v"
import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(3, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
)

x = torch.randn(2, 3)
target = torch.randn(2, 1)

pred = model(x)
loss = ((pred - target) ** 2).mean()

print(loss)
```

During this forward computation, PyTorch records the operations needed for gradient computation.

### Backward Pass

The backward pass starts from the scalar loss and applies the chain rule backward through the graph.

We begin with

$$
\frac{\partial L}{\partial L}=1.
$$

Then gradients propagate through each operation.

For the output layer:

$$
o = W_2a + b_2.
$$

The gradients are

$$
\frac{\partial L}{\partial W_2},
\quad
\frac{\partial L}{\partial b_2},
\quad
\frac{\partial L}{\partial a}.
$$

The gradient with respect to \(a\) becomes the upstream gradient for the previous layer.

Next:

$$
a=\sigma(h).
$$

The chain rule gives

$$
\frac{\partial L}{\partial h} =
\frac{\partial L}{\partial a}
\odot
\sigma'(h),
$$

where \(\odot\) denotes elementwise multiplication.

Then:

$$
h = W_1x+b_1.
$$

This produces gradients for

$$
W_1,
\quad
b_1,
\quad
x.
$$

The backward pass continues until all required gradients have been computed.

### Backpropagation as Message Passing

Backpropagation can be viewed as gradient messages flowing backward through the graph.

Each node receives an upstream gradient from later computations. It multiplies that upstream gradient by its local derivative and sends resulting gradients to earlier nodes.

For example, suppose

$$
z = x^2.
$$

The local derivative is

$$
\frac{dz}{dx}=2x.
$$

If the upstream gradient is

$$
\bar{z},
$$

then the gradient passed backward is

$$
\bar{x} =
\bar{z}\cdot 2x.
$$

Every operation follows the same pattern:

1. Receive upstream gradient.
2. Apply local derivative rule.
3. Pass gradients to inputs.

This local structure makes backpropagation scalable.

### A Manual Backpropagation Example

Consider:

$$
x=2,
$$

$$
w=3,
$$

$$
b=1,
$$

$$
y = wx+b,
$$

$$
L = y^2.
$$

First compute the forward pass:

$$
y = 3\cdot2 + 1 = 7,
$$

$$
L = 7^2 = 49.
$$

Now compute gradients.

Start with

$$
\frac{\partial L}{\partial L}=1.
$$

Since

$$
L = y^2,
$$

we get

$$
\frac{\partial L}{\partial y}=2y=14.
$$

Since

$$
y = wx+b,
$$

the local derivatives are

$$
\frac{\partial y}{\partial w}=x=2,
$$

$$
\frac{\partial y}{\partial x}=w=3,
$$

$$
\frac{\partial y}{\partial b}=1.
$$

Therefore:

$$
\frac{\partial L}{\partial w} =
14\cdot2 =
28,
$$

$$
\frac{\partial L}{\partial x} =
14\cdot3 =
42,
$$

$$
\frac{\partial L}{\partial b} =
14.
$$

In PyTorch:

```python id="vq9e95"
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

y = w * x + b
L = y ** 2

L.backward()

print(w.grad)  # tensor(28.)
print(x.grad)  # tensor(42.)
print(b.grad)  # tensor(14.)
```

The computed gradients match the manual derivation.

### Matrix Backpropagation

Neural networks operate primarily on matrices and tensors.

Consider a linear layer:

$$
Y = XW^\top + b.
$$

Assume:

$$
X\in\mathbb{R}^{B\times d},
$$

$$
W\in\mathbb{R}^{h\times d},
$$

$$
b\in\mathbb{R}^{h},
$$

$$
Y\in\mathbb{R}^{B\times h}.
$$

Suppose the upstream gradient is

$$
\bar{Y} =
\frac{\partial L}{\partial Y}.
$$

Then the backward equations are:

$$
\frac{\partial L}{\partial X} =
\bar{Y}W,
$$

$$
\frac{\partial L}{\partial W} =
\bar{Y}^\top X,
$$

$$
\frac{\partial L}{\partial b} =
\sum_{i=1}^{B}\bar{Y}_i.
$$

These formulas appear throughout deep learning systems.

In PyTorch:

```python id="m5ig8u"
layer = nn.Linear(3, 4)

X = torch.randn(5, 3)
Y = layer(X)

loss = Y.sum()
loss.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])
```

### Backpropagation Through Activations

Activation functions apply nonlinear transformations.

For ReLU:

$$
a = \max(0,h).
$$

The derivative is

$$
\frac{da}{dh} =
\begin{cases}
1, & h>0, \\
0, & h<0.
\end{cases}
$$

During backpropagation, gradients pass through positive activations and are blocked for negative activations.

Example:

```python id="vdyhh5"
x = torch.tensor([-2.0, 1.0, 3.0], requires_grad=True)

y = torch.relu(x)
loss = y.sum()

loss.backward()

print(x.grad)
```

Output:

```text id="nz7xyi"
tensor([0., 1., 1.])
```

Negative entries receive zero gradient because ReLU is flat there.

For sigmoid:

$$
\sigma(x)=\frac{1}{1+e^{-x}},
$$

the derivative is

$$
\sigma'(x)=\sigma(x)(1-\sigma(x)).
$$

This derivative becomes small when the sigmoid saturates near 0 or 1. Repeated multiplication of small derivatives can lead to vanishing gradients.

### Backpropagation Through Batches

Training usually uses minibatches.

Suppose

$$
X\in\mathbb{R}^{B\times d}.
$$

The forward pass computes predictions for all examples simultaneously.

The backward pass computes gradients accumulated across the batch. For example, if the loss is averaged over the batch:

$$
L =
\frac{1}{B}
\sum_{i=1}^{B}
L_i,
$$

then parameter gradients combine contributions from all examples.

In PyTorch:

```python id="t8hfs7"
model = nn.Linear(10, 1)

X = torch.randn(32, 10)
y = torch.randn(32, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

loss.backward()
```

The resulting parameter gradients already include the whole batch contribution.

### Backpropagation and Memory

Backpropagation requires intermediate forward activations.

For example, to differentiate

$$
y=x^2,
$$

the backward pass needs the value of \(x\).

A deep network therefore stores activations from the forward pass so they can be reused later.

This is why training uses more memory than inference.

For large models, activation memory dominates GPU usage. Several techniques reduce this cost:

| Technique | Idea |
|---|---|
| Mixed precision | Use smaller data types |
| Gradient checkpointing | Recompute activations during backward |
| Activation offloading | Move tensors between GPU and CPU |
| Smaller batch sizes | Reduce stored activations |

### Gradient Checkpointing

Checkpointing trades computation for memory.

Instead of storing every intermediate activation, the system stores only selected checkpoints. Missing activations are recomputed during the backward pass.

In PyTorch:

```python id="4mh2lr"
from torch.utils.checkpoint import checkpoint

def block(x):
    return model_block(x)

y = checkpoint(block, x)
```

This reduces memory usage but increases computation because some forward operations are repeated.

Checkpointing is widely used in transformer training.

### Gradient Flow

Backpropagation depends on gradients flowing through many layers.

Suppose a deep network repeatedly multiplies by derivatives:

$$
\frac{\partial L}{\partial x} =
\frac{\partial L}{\partial h_n}
\prod_{i=1}^{n}
\frac{\partial h_i}{\partial h_{i-1}}.
$$

If these derivatives are often smaller than one, gradients shrink exponentially. This is the vanishing gradient problem.

If they are larger than one, gradients grow exponentially. This is the exploding gradient problem.

These problems become severe in deep recurrent networks and poorly initialized models.

Modern architectures reduce them using:

| Technique | Purpose |
|---|---|
| ReLU activations | Reduce saturation |
| Residual connections | Shorten gradient paths |
| Batch normalization | Stabilize activations |
| Layer normalization | Stabilize transformer training |
| Gradient clipping | Limit exploding gradients |
| Careful initialization | Preserve variance |

### Backpropagation in PyTorch

PyTorch automates backpropagation through autograd.

A typical training step is:

```python id="6ylng7"
optimizer.zero_grad()

pred = model(X)
loss = loss_fn(pred, y)

loss.backward()

optimizer.step()
```

Step-by-step:

| Step | Purpose |
|---|---|
| `zero_grad()` | Clear old gradients |
| Forward pass | Compute predictions |
| Loss computation | Produce scalar objective |
| `backward()` | Run backpropagation |
| `step()` | Update parameters |

After `backward()`, each parameter tensor contains gradients in `.grad`.

```python id="rqm1g4"
for name, param in model.named_parameters():
    print(name, param.grad.shape)
```

### Common Backpropagation Errors

#### Forgetting `zero_grad()`

```python id="td7h6z"
loss.backward()
optimizer.step()
```

This accumulates gradients across iterations unintentionally.

Correct version:

```python id="dhrl7x"
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

#### Breaking the Graph

```python id="sbhx5j"
loss_value = loss.item()
loss_value.backward()
```

`.item()` converts the tensor into a Python number and destroys gradient information.

#### In-Place Modification

```python id="4nn2wc"
x = torch.randn(3, requires_grad=True)

y = x ** 2
x.add_(1.0)

loss = y.sum()
loss.backward()
```

The backward pass may fail because the original value of `x` was overwritten.

#### Detached Tensors

```python id="h7pln8"
h = encoder(x)
h = h.detach()

out = decoder(h)
loss.backward()
```

This prevents gradients from reaching the encoder.

### Computational Complexity

Suppose the forward pass costs \(C\) operations.

The backward pass usually costs roughly another \(C\) operations. Thus one training iteration typically costs:

$$
\text{forward} + \text{backward}
\approx
2C.
$$

Some operations have more expensive backward passes. Attention layers, normalization layers, and convolutions often require additional intermediate computations.

Training is therefore substantially more expensive than inference.

### Backpropagation Through Time

Recurrent neural networks reuse parameters across time steps:

$$
h_t = f(h_{t-1}, x_t).
$$

The computational graph unfolds across time:

$$
h_0
\rightarrow
h_1
\rightarrow
h_2
\rightarrow
\cdots
\rightarrow
h_T.
$$

Backpropagation through time applies the chain rule across all time steps.

The resulting gradients contain long products of derivatives:

$$
\prod_{t=1}^{T}
\frac{\partial h_t}{\partial h_{t-1}}.
$$

This makes recurrent networks especially vulnerable to vanishing and exploding gradients.

### Summary

Backpropagation is reverse-mode differentiation applied to neural network computational graphs. The forward pass computes activations and loss. The backward pass applies the chain rule in reverse order to compute gradients efficiently.

Each operation contributes a local derivative rule. PyTorch autograd records forward operations automatically and executes the backward pass when `loss.backward()` is called.

Backpropagation makes large-scale neural network training practical. Almost all modern deep learning systems depend on it.