Backpropagation is the algorithm used to compute gradients in neural networks efficiently.
Backpropagation is the algorithm used to compute gradients in neural networks efficiently. It applies reverse-mode differentiation to a computational graph. The algorithm evaluates the network forward to compute predictions and loss, then traverses the graph backward to compute gradients with respect to parameters.
Without backpropagation, training modern neural networks would be computationally infeasible. A large language model may contain billions of parameters. Backpropagation computes all parameter gradients in roughly the same order of cost as one forward pass.
The Goal of Backpropagation
Suppose a neural network has parameters
input data
predictions
and scalar loss
Training requires the gradient
This gradient tells us how changing each parameter changes the loss.
The optimizer then updates the parameters. For gradient descent:
where is the learning rate.
The challenge is efficiency. A neural network may contain millions of intermediate computations. Naively differentiating every parameter separately would repeat enormous amounts of work. Backpropagation avoids this by reusing intermediate derivatives.
A Small Neural Network
Consider a simple neural network with one hidden layer:
Here:
| Symbol | Meaning |
|---|---|
| Input vector | |
| First-layer parameters | |
| Hidden pre-activation | |
| Activation function | |
| Hidden activation | |
| Output-layer parameters | |
| Output logits or predictions | |
| Scalar loss |
The forward pass computes these values from top to bottom. The backward pass computes gradients from bottom to top.
Forward Pass
The forward pass evaluates the computational graph.
For example:
Each operation produces intermediate tensors needed later during the backward pass.
In PyTorch:
import torch
from torch import nn
model = nn.Sequential(
nn.Linear(3, 4),
nn.ReLU(),
nn.Linear(4, 1),
)
x = torch.randn(2, 3)
target = torch.randn(2, 1)
pred = model(x)
loss = ((pred - target) ** 2).mean()
print(loss)During this forward computation, PyTorch records the operations needed for gradient computation.
Backward Pass
The backward pass starts from the scalar loss and applies the chain rule backward through the graph.
We begin with
Then gradients propagate through each operation.
For the output layer:
The gradients are
The gradient with respect to becomes the upstream gradient for the previous layer.
Next:
The chain rule gives
where denotes elementwise multiplication.
Then:
This produces gradients for
The backward pass continues until all required gradients have been computed.
Backpropagation as Message Passing
Backpropagation can be viewed as gradient messages flowing backward through the graph.
Each node receives an upstream gradient from later computations. It multiplies that upstream gradient by its local derivative and sends resulting gradients to earlier nodes.
For example, suppose
The local derivative is
If the upstream gradient is
then the gradient passed backward is
Every operation follows the same pattern:
- Receive upstream gradient.
- Apply local derivative rule.
- Pass gradients to inputs.
This local structure makes backpropagation scalable.
A Manual Backpropagation Example
Consider:
First compute the forward pass:
Now compute gradients.
Start with
Since
we get
Since
the local derivatives are
Therefore:
In PyTorch:
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
y = w * x + b
L = y ** 2
L.backward()
print(w.grad) # tensor(28.)
print(x.grad) # tensor(42.)
print(b.grad) # tensor(14.)The computed gradients match the manual derivation.
Matrix Backpropagation
Neural networks operate primarily on matrices and tensors.
Consider a linear layer:
Assume:
Suppose the upstream gradient is
Then the backward equations are:
These formulas appear throughout deep learning systems.
In PyTorch:
layer = nn.Linear(3, 4)
X = torch.randn(5, 3)
Y = layer(X)
loss = Y.sum()
loss.backward()
print(layer.weight.grad.shape) # torch.Size([4, 3])
print(layer.bias.grad.shape) # torch.Size([4])Backpropagation Through Activations
Activation functions apply nonlinear transformations.
For ReLU:
The derivative is
During backpropagation, gradients pass through positive activations and are blocked for negative activations.
Example:
x = torch.tensor([-2.0, 1.0, 3.0], requires_grad=True)
y = torch.relu(x)
loss = y.sum()
loss.backward()
print(x.grad)Output:
tensor([0., 1., 1.])Negative entries receive zero gradient because ReLU is flat there.
For sigmoid:
the derivative is
This derivative becomes small when the sigmoid saturates near 0 or 1. Repeated multiplication of small derivatives can lead to vanishing gradients.
Backpropagation Through Batches
Training usually uses minibatches.
Suppose
The forward pass computes predictions for all examples simultaneously.
The backward pass computes gradients accumulated across the batch. For example, if the loss is averaged over the batch:
then parameter gradients combine contributions from all examples.
In PyTorch:
model = nn.Linear(10, 1)
X = torch.randn(32, 10)
y = torch.randn(32, 1)
pred = model(X)
loss = ((pred - y) ** 2).mean()
loss.backward()The resulting parameter gradients already include the whole batch contribution.
Backpropagation and Memory
Backpropagation requires intermediate forward activations.
For example, to differentiate
the backward pass needs the value of .
A deep network therefore stores activations from the forward pass so they can be reused later.
This is why training uses more memory than inference.
For large models, activation memory dominates GPU usage. Several techniques reduce this cost:
| Technique | Idea |
|---|---|
| Mixed precision | Use smaller data types |
| Gradient checkpointing | Recompute activations during backward |
| Activation offloading | Move tensors between GPU and CPU |
| Smaller batch sizes | Reduce stored activations |
Gradient Checkpointing
Checkpointing trades computation for memory.
Instead of storing every intermediate activation, the system stores only selected checkpoints. Missing activations are recomputed during the backward pass.
In PyTorch:
from torch.utils.checkpoint import checkpoint
def block(x):
return model_block(x)
y = checkpoint(block, x)This reduces memory usage but increases computation because some forward operations are repeated.
Checkpointing is widely used in transformer training.
Gradient Flow
Backpropagation depends on gradients flowing through many layers.
Suppose a deep network repeatedly multiplies by derivatives:
If these derivatives are often smaller than one, gradients shrink exponentially. This is the vanishing gradient problem.
If they are larger than one, gradients grow exponentially. This is the exploding gradient problem.
These problems become severe in deep recurrent networks and poorly initialized models.
Modern architectures reduce them using:
| Technique | Purpose |
|---|---|
| ReLU activations | Reduce saturation |
| Residual connections | Shorten gradient paths |
| Batch normalization | Stabilize activations |
| Layer normalization | Stabilize transformer training |
| Gradient clipping | Limit exploding gradients |
| Careful initialization | Preserve variance |
Backpropagation in PyTorch
PyTorch automates backpropagation through autograd.
A typical training step is:
optimizer.zero_grad()
pred = model(X)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()Step-by-step:
| Step | Purpose |
|---|---|
zero_grad() | Clear old gradients |
| Forward pass | Compute predictions |
| Loss computation | Produce scalar objective |
backward() | Run backpropagation |
step() | Update parameters |
After backward(), each parameter tensor contains gradients in .grad.
for name, param in model.named_parameters():
print(name, param.grad.shape)Common Backpropagation Errors
Forgetting zero_grad()
loss.backward()
optimizer.step()This accumulates gradients across iterations unintentionally.
Correct version:
optimizer.zero_grad()
loss.backward()
optimizer.step()Breaking the Graph
loss_value = loss.item()
loss_value.backward().item() converts the tensor into a Python number and destroys gradient information.
In-Place Modification
x = torch.randn(3, requires_grad=True)
y = x ** 2
x.add_(1.0)
loss = y.sum()
loss.backward()The backward pass may fail because the original value of x was overwritten.
Detached Tensors
h = encoder(x)
h = h.detach()
out = decoder(h)
loss.backward()This prevents gradients from reaching the encoder.
Computational Complexity
Suppose the forward pass costs operations.
The backward pass usually costs roughly another operations. Thus one training iteration typically costs:
Some operations have more expensive backward passes. Attention layers, normalization layers, and convolutions often require additional intermediate computations.
Training is therefore substantially more expensive than inference.
Backpropagation Through Time
Recurrent neural networks reuse parameters across time steps:
The computational graph unfolds across time:
Backpropagation through time applies the chain rule across all time steps.
The resulting gradients contain long products of derivatives:
This makes recurrent networks especially vulnerable to vanishing and exploding gradients.
Summary
Backpropagation is reverse-mode differentiation applied to neural network computational graphs. The forward pass computes activations and loss. The backward pass applies the chain rule in reverse order to compute gradients efficiently.
Each operation contributes a local derivative rule. PyTorch autograd records forward operations automatically and executes the backward pass when loss.backward() is called.
Backpropagation makes large-scale neural network training practical. Almost all modern deep learning systems depend on it.