Backpropagation is reverse mode automatic differentiation applied to neural networks. In most machine learning writing, the term refers to the whole training procedure: run a...
Backpropagation is reverse mode automatic differentiation applied to neural networks. In most machine learning writing, the term refers to the whole training procedure: run a forward pass, compute a loss, propagate gradients backward, then update parameters. More precisely, backpropagation is the derivative computation step.
A neural network is a composition of functions:
Each layer produces an intermediate activation:
The loss is a scalar:
Backpropagation computes derivatives of with respect to every parameter block . It does this by walking the computation backward and applying the chain rule.
Forward Pass
The forward pass evaluates the model and stores intermediate values needed by the backward pass.
For a simple feedforward network:
The values , , , and sometimes masks or shape metadata may be needed later to compute gradients.
A reverse mode AD system therefore saves selected intermediate values. These saved values are often called residuals, activations, or tape entries depending on the implementation.
Backward Pass
The backward pass starts from the scalar loss. Since
the initial adjoint is one.
For each intermediate variable , backpropagation computes its adjoint:
If a variable feeds into several later operations, its adjoint receives contributions from all uses. The contributions are summed:
This accumulation rule is the reason gradient buffers are additive. A parameter used many times in a computation graph receives one contribution per use.
A Single Layer
Consider one layer:
Suppose the backward pass receives
The derivative through the activation is:
The parameter gradients are:
The adjoint passed to the previous activation is:
This is the local structure of backpropagation. Each layer receives an upstream adjoint, computes gradients for its own parameters, and returns a downstream adjoint for earlier layers.
Batched Backpropagation
In practice, layers process batches. Let
contain input activations, one column per example. A linear layer computes:
If the upstream adjoint after the activation is
then the gradients are:
These formulas explain why backpropagation maps efficiently to dense matrix multiplication. Training neural networks is largely a sequence of forward matrix products and backward matrix products.
Computational Graph View
Backpropagation does not require the program to be written as layers. It only needs a graph of operations and dependencies.
A computation graph contains nodes for values and operations:
x -> linear -> relu -> linear -> lossThe backward graph runs in reverse topological order:
loss -> linear backward -> relu backward -> linear backward -> xEach primitive operation has a local backward rule. For example, multiplication, matrix multiplication, convolution, normalization, and indexing each define how incoming adjoints are transformed into adjoints for their inputs.
The global gradient emerges from composing these local rules.
Backpropagation and Reverse Mode AD
Backpropagation is sometimes presented as a special neural network algorithm. In AD terms, it is an instance of reverse accumulation.
For a function
where is scalar and contains many parameters, reverse mode computes:
The cost is usually within a small constant factor of evaluating , assuming saved intermediates are available.
This cost property explains the importance of backpropagation. A neural network may have billions of parameters, but one scalar loss. Reverse mode computes all parameter gradients with one backward traversal.
Memory Cost
Backpropagation requires memory for saved forward values. For deep networks, activation memory can exceed parameter memory.
The standard tradeoff is:
| Strategy | Memory | Compute |
|---|---|---|
| Save all activations | High | Low |
| Recompute activations | Low | High |
| Checkpoint selected layers | Medium | Medium |
Checkpointing stores only some intermediate values. During the backward pass, missing values are recomputed from the nearest checkpoint. This reduces memory at the cost of extra forward computation.
This tradeoff belongs to AD runtime design, not to the mathematical chain rule. The derivative is the same. The execution plan changes.
Custom Backward Rules
Many systems allow custom gradients. A custom backward rule replaces the default derivative of a primitive or composite operation.
This is useful when the mathematically direct derivative is inefficient, unstable, or inconvenient. Examples include softmax with cross-entropy, normalization layers, clipping operations, and fused GPU kernels.
A custom backward rule must obey the same contract: given upstream adjoints and saved residuals, return adjoints for differentiable inputs.
Incorrect custom gradients are dangerous because the AD system will trust them. Testing usually compares the custom gradient against finite differences or a simpler reference implementation.
Backpropagation Through Shared Parameters
Some models use the same parameter in multiple places. Recurrent networks, tied embeddings, and weight sharing all have this structure.
If a parameter is used at several graph locations, then its total gradient is the sum of all contributions:
The AD system must accumulate into the same gradient buffer. This is a graph property, not a layer property. Shared parameters make it clear that backpropagation operates over dependency graphs rather than over a simple list of layers.
Backpropagation Through Time
For recurrent computation, the same transition is applied repeatedly:
Unrolling the recurrence gives an ordinary computation graph:
Backpropagation through time applies reverse mode to this unrolled graph. Gradients with respect to accumulate across time steps.
The memory cost grows with sequence length because hidden states and other residuals are needed for the backward pass. Long sequences often require truncation, checkpointing, or specialized architectures.
Numerical Issues
Backpropagation multiplies many local derivative factors. This can cause vanishing or exploding gradients.
For a repeated scalar recurrence,
the derivative across steps is:
If , the derivative decays. If , it grows. Deep networks and recurrent networks have higher-dimensional versions of the same issue, where products of Jacobians control gradient scale.
AD computes these products according to the program. It does not guarantee that the products are numerically well-conditioned.
Minimal Backpropagation Engine
A small reverse mode engine stores values, parents, and local backward functions.
class Node:
value
grad
parents
backwardA computation creates nodes:
z = add(matmul(W, x), b)
a = relu(z)
loss = mse(a, y)Calling backward on loss proceeds as follows:
topological_order = sort_graph(loss)
loss.grad = 1
for node in reverse(topological_order):
node.backward(node.grad)Each backward function adds gradient contributions to parent nodes.
This minimal design is enough to express the essence of backpropagation. Production systems add tensor kernels, memory planning, alias tracking, distributed execution, mixed precision, custom operators, and graph compilation.
Backpropagation as an Interface Boundary
Backpropagation connects the model and the optimizer.
The model defines the forward computation:
loss = model_loss(theta, batch)The AD system derives the gradient computation:
grad = backward(loss, theta)The optimizer applies an update:
theta = optimizer_update(theta, grad, state)Keeping this boundary clear helps system design. A model should not need to know whether the optimizer is SGD or Adam. An optimizer should not need to know whether the gradient came from a tape, a compiler pass, or a manually written backward function.
Backpropagation is therefore both a mathematical procedure and a systems protocol. It is the standard way a differentiable program exposes parameter sensitivities to an optimization algorithm.