Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently. It applies reverse-mode differentiation to a computational graph. The algorithm evaluates the network forward to compute predictions and loss, then traverses the graph backward to compute gradients with respect to parameters.

Without backpropagation, training modern neural networks would be computationally infeasible. A large language model may contain billions of parameters. Backpropagation computes all parameter gradients in roughly the same order of cost as one forward pass.

The Goal of Backpropagation

Suppose a neural network has parameters

\theta,

input data

X,

predictions

\hat{y}=f_\theta(X),

and scalar loss

L=\ell(\hat{y}, y).

Training requires the gradient

\nabla_\theta L.

This gradient tells us how changing each parameter changes the loss.

The optimizer then updates the parameters. For gradient descent:

\theta \leftarrow \theta - \eta \nabla_\theta L,

where $\eta$ is the learning rate.

The challenge is efficiency. A neural network may contain millions of intermediate computations. Naively differentiating every parameter separately would repeat enormous amounts of work. Backpropagation avoids this by reusing intermediate derivatives.

A Small Neural Network

Consider a simple neural network with one hidden layer:

h = W_1x + b_1,

a = \sigma(h),

o = W_2a + b_2,

L = \ell(o, y).

Here:

Symbol	Meaning
$x$	Input vector
$W_1, b_1$	First-layer parameters
$h$	Hidden pre-activation
$\sigma$	Activation function
$a$	Hidden activation
$W_2, b_2$	Output-layer parameters
$o$	Output logits or predictions
$L$	Scalar loss

The forward pass computes these values from top to bottom. The backward pass computes gradients from bottom to top.

Forward Pass

The forward pass evaluates the computational graph.

For example:

x \longrightarrow h \longrightarrow a \longrightarrow o \longrightarrow L.

Each operation produces intermediate tensors needed later during the backward pass.

In PyTorch:

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(3, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
)

x = torch.randn(2, 3)
target = torch.randn(2, 1)

pred = model(x)
loss = ((pred - target) ** 2).mean()

print(loss)

During this forward computation, PyTorch records the operations needed for gradient computation.

Backward Pass

The backward pass starts from the scalar loss and applies the chain rule backward through the graph.

We begin with

\frac{\partial L}{\partial L}=1.

Then gradients propagate through each operation.

For the output layer:

o = W_2a + b_2.

The gradients are

\frac{\partial L}{\partial W_2}, \quad \frac{\partial L}{\partial b_2}, \quad \frac{\partial L}{\partial a}.

The gradient with respect to $a$ becomes the upstream gradient for the previous layer.

a=\sigma(h).

The chain rule gives

\frac{\partial L}{\partial h} = \frac{\partial L}{\partial a} \odot \sigma'(h),

where $\odot$ denotes elementwise multiplication.

Then:

h = W_1x+b_1.

This produces gradients for

W_1, \quad b_1, \quad x.

The backward pass continues until all required gradients have been computed.

Backpropagation as Message Passing

Backpropagation can be viewed as gradient messages flowing backward through the graph.

Each node receives an upstream gradient from later computations. It multiplies that upstream gradient by its local derivative and sends resulting gradients to earlier nodes.

For example, suppose

z = x^2.

The local derivative is

\frac{dz}{dx}=2x.

If the upstream gradient is

\bar{z},

then the gradient passed backward is

\bar{x} = \bar{z}\cdot 2x.

Every operation follows the same pattern:

Receive upstream gradient.
Apply local derivative rule.
Pass gradients to inputs.

This local structure makes backpropagation scalable.

A Manual Backpropagation Example

Consider:

x=2,

w=3,

b=1,

y = wx+b,

L = y^2.

First compute the forward pass:

y = 3\cdot2 + 1 = 7,

L = 7^2 = 49.

Now compute gradients.

Start with

\frac{\partial L}{\partial L}=1.

Since

L = y^2,

we get

\frac{\partial L}{\partial y}=2y=14.

Since

y = wx+b,

the local derivatives are

\frac{\partial y}{\partial w}=x=2,

\frac{\partial y}{\partial x}=w=3,

\frac{\partial y}{\partial b}=1.

Therefore:

\frac{\partial L}{\partial w} = 14\cdot2 = 28,

\frac{\partial L}{\partial x} = 14\cdot3 = 42,

\frac{\partial L}{\partial b} = 14.

In PyTorch:

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

y = w * x + b
L = y ** 2

L.backward()

print(w.grad)  # tensor(28.)
print(x.grad)  # tensor(42.)
print(b.grad)  # tensor(14.)

The computed gradients match the manual derivation.

Matrix Backpropagation

Neural networks operate primarily on matrices and tensors.

Consider a linear layer:

Y = XW^\top + b.

Assume:

X\in\mathbb{R}^{B\times d},

W\in\mathbb{R}^{h\times d},

b\in\mathbb{R}^{h},

Y\in\mathbb{R}^{B\times h}.

Suppose the upstream gradient is

\bar{Y} = \frac{\partial L}{\partial Y}.

Then the backward equations are:

\frac{\partial L}{\partial X} = \bar{Y}W,

\frac{\partial L}{\partial W} = \bar{Y}^\top X,

\frac{\partial L}{\partial b} = \sum_{i=1}^{B}\bar{Y}_i.

These formulas appear throughout deep learning systems.

In PyTorch:

layer = nn.Linear(3, 4)

X = torch.randn(5, 3)
Y = layer(X)

loss = Y.sum()
loss.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])

Backpropagation Through Activations

Activation functions apply nonlinear transformations.

For ReLU:

a = \max(0,h).

The derivative is

\frac{da}{dh} = \begin{cases} 1, & h>0, \\ 0, & h<0. \end{cases}

During backpropagation, gradients pass through positive activations and are blocked for negative activations.

Example:

x = torch.tensor([-2.0, 1.0, 3.0], requires_grad=True)

y = torch.relu(x)
loss = y.sum()

loss.backward()

print(x.grad)

Output:

tensor([0., 1., 1.])

Negative entries receive zero gradient because ReLU is flat there.

For sigmoid:

\sigma(x)=\frac{1}{1+e^{-x}},

the derivative is

\sigma'(x)=\sigma(x)(1-\sigma(x)).

This derivative becomes small when the sigmoid saturates near 0 or 1. Repeated multiplication of small derivatives can lead to vanishing gradients.

Backpropagation Through Batches

Training usually uses minibatches.

Suppose

X\in\mathbb{R}^{B\times d}.

The forward pass computes predictions for all examples simultaneously.

The backward pass computes gradients accumulated across the batch. For example, if the loss is averaged over the batch:

L = \frac{1}{B} \sum_{i=1}^{B} L_i,

then parameter gradients combine contributions from all examples.

In PyTorch:

model = nn.Linear(10, 1)

X = torch.randn(32, 10)
y = torch.randn(32, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

loss.backward()

The resulting parameter gradients already include the whole batch contribution.

Backpropagation and Memory

Backpropagation requires intermediate forward activations.

For example, to differentiate

y=x^2,

the backward pass needs the value of $x$ .

A deep network therefore stores activations from the forward pass so they can be reused later.

This is why training uses more memory than inference.

For large models, activation memory dominates GPU usage. Several techniques reduce this cost:

Technique	Idea
Mixed precision	Use smaller data types
Gradient checkpointing	Recompute activations during backward
Activation offloading	Move tensors between GPU and CPU
Smaller batch sizes	Reduce stored activations

Gradient Checkpointing

Checkpointing trades computation for memory.

Instead of storing every intermediate activation, the system stores only selected checkpoints. Missing activations are recomputed during the backward pass.

In PyTorch:

from torch.utils.checkpoint import checkpoint

def block(x):
    return model_block(x)

y = checkpoint(block, x)

This reduces memory usage but increases computation because some forward operations are repeated.

Checkpointing is widely used in transformer training.

Gradient Flow

Backpropagation depends on gradients flowing through many layers.

Suppose a deep network repeatedly multiplies by derivatives:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial h_n} \prod_{i=1}^{n} \frac{\partial h_i}{\partial h_{i-1}}.

If these derivatives are often smaller than one, gradients shrink exponentially. This is the vanishing gradient problem.

If they are larger than one, gradients grow exponentially. This is the exploding gradient problem.

These problems become severe in deep recurrent networks and poorly initialized models.

Modern architectures reduce them using:

Technique	Purpose
ReLU activations	Reduce saturation
Residual connections	Shorten gradient paths
Batch normalization	Stabilize activations
Layer normalization	Stabilize transformer training
Gradient clipping	Limit exploding gradients
Careful initialization	Preserve variance

Backpropagation in PyTorch

PyTorch automates backpropagation through autograd.

A typical training step is:

optimizer.zero_grad()

pred = model(X)
loss = loss_fn(pred, y)

loss.backward()

optimizer.step()

Step-by-step:

Step	Purpose
`zero_grad()`	Clear old gradients
Forward pass	Compute predictions
Loss computation	Produce scalar objective
`backward()`	Run backpropagation
`step()`	Update parameters

After backward(), each parameter tensor contains gradients in .grad.

for name, param in model.named_parameters():
    print(name, param.grad.shape)

Common Backpropagation Errors

Forgetting `zero_grad()`

loss.backward()
optimizer.step()

This accumulates gradients across iterations unintentionally.

Correct version:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Breaking the Graph

loss_value = loss.item()
loss_value.backward()

.item() converts the tensor into a Python number and destroys gradient information.

In-Place Modification

x = torch.randn(3, requires_grad=True)

y = x ** 2
x.add_(1.0)

loss = y.sum()
loss.backward()

The backward pass may fail because the original value of x was overwritten.

Detached Tensors

h = encoder(x)
h = h.detach()

out = decoder(h)
loss.backward()

This prevents gradients from reaching the encoder.

Computational Complexity

Suppose the forward pass costs $C$ operations.

The backward pass usually costs roughly another $C$ operations. Thus one training iteration typically costs:

\text{forward} + \text{backward} \approx 2C.

Some operations have more expensive backward passes. Attention layers, normalization layers, and convolutions often require additional intermediate computations.

Training is therefore substantially more expensive than inference.

Backpropagation Through Time

Recurrent neural networks reuse parameters across time steps:

h_t = f(h_{t-1}, x_t).

The computational graph unfolds across time:

h_0 \rightarrow h_1 \rightarrow h_2 \rightarrow \cdots \rightarrow h_T.

Backpropagation through time applies the chain rule across all time steps.

The resulting gradients contain long products of derivatives:

\prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}.

This makes recurrent networks especially vulnerable to vanishing and exploding gradients.

Summary

Backpropagation is reverse-mode differentiation applied to neural network computational graphs. The forward pass computes activations and loss. The backward pass applies the chain rule in reverse order to compute gradients efficiently.

Each operation contributes a local derivative rule. PyTorch autograd records forward operations automatically and executes the backward pass when loss.backward() is called.

Backpropagation makes large-scale neural network training practical. Almost all modern deep learning systems depend on it.

Backpropagation

The Goal of Backpropagation

A Small Neural Network

Forward Pass

Backward Pass

Backpropagation as Message Passing

A Manual Backpropagation Example

Matrix Backpropagation

Backpropagation Through Activations

Backpropagation Through Batches

Backpropagation and Memory

Gradient Checkpointing

Gradient Flow

Backpropagation in PyTorch

Common Backpropagation Errors

Forgetting zero_grad()

Breaking the Graph

In-Place Modification

Detached Tensors

Computational Complexity

Backpropagation Through Time

Summary

Forgetting `zero_grad()`