Skip to content

Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently.

Backpropagation is the algorithm used to compute gradients in neural networks efficiently. It applies reverse-mode differentiation to a computational graph. The algorithm evaluates the network forward to compute predictions and loss, then traverses the graph backward to compute gradients with respect to parameters.

Without backpropagation, training modern neural networks would be computationally infeasible. A large language model may contain billions of parameters. Backpropagation computes all parameter gradients in roughly the same order of cost as one forward pass.

The Goal of Backpropagation

Suppose a neural network has parameters

θ, \theta,

input data

X, X,

predictions

y^=fθ(X), \hat{y}=f_\theta(X),

and scalar loss

L=(y^,y). L=\ell(\hat{y}, y).

Training requires the gradient

θL. \nabla_\theta L.

This gradient tells us how changing each parameter changes the loss.

The optimizer then updates the parameters. For gradient descent:

θθηθL, \theta \leftarrow \theta - \eta \nabla_\theta L,

where η\eta is the learning rate.

The challenge is efficiency. A neural network may contain millions of intermediate computations. Naively differentiating every parameter separately would repeat enormous amounts of work. Backpropagation avoids this by reusing intermediate derivatives.

A Small Neural Network

Consider a simple neural network with one hidden layer:

h=W1x+b1, h = W_1x + b_1, a=σ(h), a = \sigma(h), o=W2a+b2, o = W_2a + b_2, L=(o,y). L = \ell(o, y).

Here:

SymbolMeaning
xxInput vector
W1,b1W_1, b_1First-layer parameters
hhHidden pre-activation
σ\sigmaActivation function
aaHidden activation
W2,b2W_2, b_2Output-layer parameters
ooOutput logits or predictions
LLScalar loss

The forward pass computes these values from top to bottom. The backward pass computes gradients from bottom to top.

Forward Pass

The forward pass evaluates the computational graph.

For example:

xhaoL. x \longrightarrow h \longrightarrow a \longrightarrow o \longrightarrow L.

Each operation produces intermediate tensors needed later during the backward pass.

In PyTorch:

import torch
from torch import nn

model = nn.Sequential(
    nn.Linear(3, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
)

x = torch.randn(2, 3)
target = torch.randn(2, 1)

pred = model(x)
loss = ((pred - target) ** 2).mean()

print(loss)

During this forward computation, PyTorch records the operations needed for gradient computation.

Backward Pass

The backward pass starts from the scalar loss and applies the chain rule backward through the graph.

We begin with

LL=1. \frac{\partial L}{\partial L}=1.

Then gradients propagate through each operation.

For the output layer:

o=W2a+b2. o = W_2a + b_2.

The gradients are

LW2,Lb2,La. \frac{\partial L}{\partial W_2}, \quad \frac{\partial L}{\partial b_2}, \quad \frac{\partial L}{\partial a}.

The gradient with respect to aa becomes the upstream gradient for the previous layer.

Next:

a=σ(h). a=\sigma(h).

The chain rule gives

Lh=Laσ(h), \frac{\partial L}{\partial h} = \frac{\partial L}{\partial a} \odot \sigma'(h),

where \odot denotes elementwise multiplication.

Then:

h=W1x+b1. h = W_1x+b_1.

This produces gradients for

W1,b1,x. W_1, \quad b_1, \quad x.

The backward pass continues until all required gradients have been computed.

Backpropagation as Message Passing

Backpropagation can be viewed as gradient messages flowing backward through the graph.

Each node receives an upstream gradient from later computations. It multiplies that upstream gradient by its local derivative and sends resulting gradients to earlier nodes.

For example, suppose

z=x2. z = x^2.

The local derivative is

dzdx=2x. \frac{dz}{dx}=2x.

If the upstream gradient is

zˉ, \bar{z},

then the gradient passed backward is

xˉ=zˉ2x. \bar{x} = \bar{z}\cdot 2x.

Every operation follows the same pattern:

  1. Receive upstream gradient.
  2. Apply local derivative rule.
  3. Pass gradients to inputs.

This local structure makes backpropagation scalable.

A Manual Backpropagation Example

Consider:

x=2, x=2, w=3, w=3, b=1, b=1, y=wx+b, y = wx+b, L=y2. L = y^2.

First compute the forward pass:

y=32+1=7, y = 3\cdot2 + 1 = 7, L=72=49. L = 7^2 = 49.

Now compute gradients.

Start with

LL=1. \frac{\partial L}{\partial L}=1.

Since

L=y2, L = y^2,

we get

Ly=2y=14. \frac{\partial L}{\partial y}=2y=14.

Since

y=wx+b, y = wx+b,

the local derivatives are

yw=x=2, \frac{\partial y}{\partial w}=x=2, yx=w=3, \frac{\partial y}{\partial x}=w=3, yb=1. \frac{\partial y}{\partial b}=1.

Therefore:

Lw=142=28, \frac{\partial L}{\partial w} = 14\cdot2 = 28, Lx=143=42, \frac{\partial L}{\partial x} = 14\cdot3 = 42, Lb=14. \frac{\partial L}{\partial b} = 14.

In PyTorch:

x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

y = w * x + b
L = y ** 2

L.backward()

print(w.grad)  # tensor(28.)
print(x.grad)  # tensor(42.)
print(b.grad)  # tensor(14.)

The computed gradients match the manual derivation.

Matrix Backpropagation

Neural networks operate primarily on matrices and tensors.

Consider a linear layer:

Y=XW+b. Y = XW^\top + b.

Assume:

XRB×d, X\in\mathbb{R}^{B\times d}, WRh×d, W\in\mathbb{R}^{h\times d}, bRh, b\in\mathbb{R}^{h}, YRB×h. Y\in\mathbb{R}^{B\times h}.

Suppose the upstream gradient is

Yˉ=LY. \bar{Y} = \frac{\partial L}{\partial Y}.

Then the backward equations are:

LX=YˉW, \frac{\partial L}{\partial X} = \bar{Y}W, LW=YˉX, \frac{\partial L}{\partial W} = \bar{Y}^\top X, Lb=i=1BYˉi. \frac{\partial L}{\partial b} = \sum_{i=1}^{B}\bar{Y}_i.

These formulas appear throughout deep learning systems.

In PyTorch:

layer = nn.Linear(3, 4)

X = torch.randn(5, 3)
Y = layer(X)

loss = Y.sum()
loss.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])

Backpropagation Through Activations

Activation functions apply nonlinear transformations.

For ReLU:

a=max(0,h). a = \max(0,h).

The derivative is

dadh={1,h>0,0,h<0. \frac{da}{dh} = \begin{cases} 1, & h>0, \\ 0, & h<0. \end{cases}

During backpropagation, gradients pass through positive activations and are blocked for negative activations.

Example:

x = torch.tensor([-2.0, 1.0, 3.0], requires_grad=True)

y = torch.relu(x)
loss = y.sum()

loss.backward()

print(x.grad)

Output:

tensor([0., 1., 1.])

Negative entries receive zero gradient because ReLU is flat there.

For sigmoid:

σ(x)=11+ex, \sigma(x)=\frac{1}{1+e^{-x}},

the derivative is

σ(x)=σ(x)(1σ(x)). \sigma'(x)=\sigma(x)(1-\sigma(x)).

This derivative becomes small when the sigmoid saturates near 0 or 1. Repeated multiplication of small derivatives can lead to vanishing gradients.

Backpropagation Through Batches

Training usually uses minibatches.

Suppose

XRB×d. X\in\mathbb{R}^{B\times d}.

The forward pass computes predictions for all examples simultaneously.

The backward pass computes gradients accumulated across the batch. For example, if the loss is averaged over the batch:

L=1Bi=1BLi, L = \frac{1}{B} \sum_{i=1}^{B} L_i,

then parameter gradients combine contributions from all examples.

In PyTorch:

model = nn.Linear(10, 1)

X = torch.randn(32, 10)
y = torch.randn(32, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

loss.backward()

The resulting parameter gradients already include the whole batch contribution.

Backpropagation and Memory

Backpropagation requires intermediate forward activations.

For example, to differentiate

y=x2, y=x^2,

the backward pass needs the value of xx.

A deep network therefore stores activations from the forward pass so they can be reused later.

This is why training uses more memory than inference.

For large models, activation memory dominates GPU usage. Several techniques reduce this cost:

TechniqueIdea
Mixed precisionUse smaller data types
Gradient checkpointingRecompute activations during backward
Activation offloadingMove tensors between GPU and CPU
Smaller batch sizesReduce stored activations

Gradient Checkpointing

Checkpointing trades computation for memory.

Instead of storing every intermediate activation, the system stores only selected checkpoints. Missing activations are recomputed during the backward pass.

In PyTorch:

from torch.utils.checkpoint import checkpoint

def block(x):
    return model_block(x)

y = checkpoint(block, x)

This reduces memory usage but increases computation because some forward operations are repeated.

Checkpointing is widely used in transformer training.

Gradient Flow

Backpropagation depends on gradients flowing through many layers.

Suppose a deep network repeatedly multiplies by derivatives:

Lx=Lhni=1nhihi1. \frac{\partial L}{\partial x} = \frac{\partial L}{\partial h_n} \prod_{i=1}^{n} \frac{\partial h_i}{\partial h_{i-1}}.

If these derivatives are often smaller than one, gradients shrink exponentially. This is the vanishing gradient problem.

If they are larger than one, gradients grow exponentially. This is the exploding gradient problem.

These problems become severe in deep recurrent networks and poorly initialized models.

Modern architectures reduce them using:

TechniquePurpose
ReLU activationsReduce saturation
Residual connectionsShorten gradient paths
Batch normalizationStabilize activations
Layer normalizationStabilize transformer training
Gradient clippingLimit exploding gradients
Careful initializationPreserve variance

Backpropagation in PyTorch

PyTorch automates backpropagation through autograd.

A typical training step is:

optimizer.zero_grad()

pred = model(X)
loss = loss_fn(pred, y)

loss.backward()

optimizer.step()

Step-by-step:

StepPurpose
zero_grad()Clear old gradients
Forward passCompute predictions
Loss computationProduce scalar objective
backward()Run backpropagation
step()Update parameters

After backward(), each parameter tensor contains gradients in .grad.

for name, param in model.named_parameters():
    print(name, param.grad.shape)

Common Backpropagation Errors

Forgetting zero_grad()

loss.backward()
optimizer.step()

This accumulates gradients across iterations unintentionally.

Correct version:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Breaking the Graph

loss_value = loss.item()
loss_value.backward()

.item() converts the tensor into a Python number and destroys gradient information.

In-Place Modification

x = torch.randn(3, requires_grad=True)

y = x ** 2
x.add_(1.0)

loss = y.sum()
loss.backward()

The backward pass may fail because the original value of x was overwritten.

Detached Tensors

h = encoder(x)
h = h.detach()

out = decoder(h)
loss.backward()

This prevents gradients from reaching the encoder.

Computational Complexity

Suppose the forward pass costs CC operations.

The backward pass usually costs roughly another CC operations. Thus one training iteration typically costs:

forward+backward2C. \text{forward} + \text{backward} \approx 2C.

Some operations have more expensive backward passes. Attention layers, normalization layers, and convolutions often require additional intermediate computations.

Training is therefore substantially more expensive than inference.

Backpropagation Through Time

Recurrent neural networks reuse parameters across time steps:

ht=f(ht1,xt). h_t = f(h_{t-1}, x_t).

The computational graph unfolds across time:

h0h1h2hT. h_0 \rightarrow h_1 \rightarrow h_2 \rightarrow \cdots \rightarrow h_T.

Backpropagation through time applies the chain rule across all time steps.

The resulting gradients contain long products of derivatives:

t=1Ththt1. \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}.

This makes recurrent networks especially vulnerable to vanishing and exploding gradients.

Summary

Backpropagation is reverse-mode differentiation applied to neural network computational graphs. The forward pass computes activations and loss. The backward pass applies the chain rule in reverse order to compute gradients efficiently.

Each operation contributes a local derivative rule. PyTorch autograd records forward operations automatically and executes the backward pass when loss.backward() is called.

Backpropagation makes large-scale neural network training practical. Almost all modern deep learning systems depend on it.