Reverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs.
Reverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs.
This method is especially useful in deep learning because neural networks usually have many parameters and one scalar loss. Reverse-mode differentiation can compute the gradient of one scalar output with respect to millions or billions of parameters efficiently.
The Problem Setting
Assume we have a function
where is a vector of parameters and is a scalar loss.
If
then the gradient is
A deep model may have millions of parameters. Computing each partial derivative separately would be too expensive. Reverse-mode differentiation avoids that cost by reusing intermediate derivative information.
Forward Mode Versus Reverse Mode
There are two broad modes of automatic differentiation: forward mode and reverse mode.
Forward mode propagates derivatives from inputs to outputs. It answers: if one input changes, how does each later value change?
Reverse mode propagates derivatives from outputs to inputs. It answers: if the final output changes, how much did each earlier value contribute?
For a function
forward mode is efficient when is small. Reverse mode is efficient when is small.
Deep learning training usually has
where may be very large. The output is one scalar loss. This is the ideal case for reverse mode.
A Simple Example
Consider
Break it into intermediate variables:
The forward pass computes values:
The reverse pass computes derivatives of the final output with respect to each intermediate value.
We start with
Then
Since
we have
Therefore
and
In PyTorch:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
a = x + y
z = a ** 2
z.backward()
print(x.grad) # tensor(10.)
print(y.grad) # tensor(10.)PyTorch performs this reverse traversal automatically.
Adjoints
Reverse-mode differentiation often uses the term adjoint. The adjoint of a variable is the derivative of the final scalar output with respect to that variable.
For a variable , its adjoint is written as
Here is the final scalar loss.
For the computation
the adjoints are
At , we get
The adjoint notation is useful because it describes the backward pass locally. Each operation receives an upstream adjoint and distributes it to its inputs.
Local Backward Rules
Reverse-mode differentiation depends on local backward rules. Each primitive operation knows how to send gradients to its inputs.
For addition,
If the upstream gradient is , then
For multiplication,
The local derivatives are
Thus
For squaring,
The backward rule is
The symbol means that gradients are accumulated. A variable may influence the output through multiple paths, so all contributions must be added.
Why Gradients Accumulate
Consider
Break it into steps:
The variable contributes to through two paths: one through , and one through .
The derivative is
Reverse mode obtains this by adding contributions from both paths.
At ,
In PyTorch:
x = torch.tensor(3.0, requires_grad=True)
z = x ** 2 + x
z.backward()
print(x.grad) # tensor(7.)This accumulation behavior is the reason PyTorch adds gradients into .grad fields rather than replacing them automatically.
Reverse Pass on a Larger Graph
Consider the computation
The forward pass computes , , and . The reverse pass starts with
Since
we get
Since
we add
Since
we add
Thus
At , ,
PyTorch:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
u = x * y
v = x + y
L = u + v
L.backward()
print(x.grad) # tensor(4.)
print(y.grad) # tensor(3.)Vector-Jacobian Products
Reverse mode can be understood through vector-Jacobian products.
Suppose
where and . The Jacobian is
If a later scalar loss depends on , then the upstream gradient is
The gradient with respect to is
This operation is a vector-Jacobian product. It avoids explicitly forming the full Jacobian. This matters because Jacobians in deep learning can be extremely large.
PyTorch’s backward pass is primarily a system for computing vector-Jacobian products efficiently.
Non-Scalar Outputs
When the output is scalar, PyTorch implicitly uses an upstream gradient of .
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2
y.backward()This means
When the output is not scalar, the user must provide the upstream gradient.
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
y.backward(torch.tensor([1.0, 1.0, 1.0]))
print(x.grad) # tensor([2., 4., 6.])Here the provided vector acts as . Since
the vector-Jacobian product gives
With , this gives .
A different upstream gradient gives a different vector-Jacobian product:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
y.backward(torch.tensor([10.0, 1.0, 0.1]))
print(x.grad) # tensor([20.0000, 4.0000, 0.6000])Reverse Mode in Neural Networks
A neural network is a composition of many operations:
The forward pass computes predictions and loss. The reverse pass computes
For a multilayer network,
Reverse mode starts from , then moves backward through the loss, final linear layer, activation, hidden linear layer, another activation, and first linear layer. Each operation applies its local backward rule.
PyTorch code:
import torch
from torch import nn
model = nn.Sequential(
nn.Linear(10, 32),
nn.ReLU(),
nn.Linear(32, 32),
nn.ReLU(),
nn.Linear(32, 1),
)
X = torch.randn(16, 10)
target = torch.randn(16, 1)
pred = model(X)
loss = ((pred - target) ** 2).mean()
loss.backward()
for name, param in model.named_parameters():
print(name, param.grad.shape)The programmer does not write the derivative rules for the whole network. PyTorch composes local backward rules automatically.
Cost of Reverse Mode
Reverse mode has two important costs.
The first cost is computation. The backward pass usually costs the same order of magnitude as the forward pass. For many neural networks, one training step costs roughly one forward pass plus one backward pass.
The second cost is memory. Reverse mode needs intermediate values from the forward pass. These values are stored so that backward rules can use them later.
For example, the derivative of
needs the value of . The derivative of ReLU needs to know which entries were positive. The derivative of batch normalization needs normalization statistics. The derivative of attention needs intermediate tensors related to queries, keys, values, attention scores, and probabilities.
This is why training uses more memory than inference.
Checkpointing
Gradient checkpointing reduces memory use by storing fewer intermediate activations during the forward pass. During the backward pass, missing activations are recomputed.
This trades computation for memory.
Without checkpointing, the system stores many intermediate values:
With checkpointing, the system stores fewer intermediate values:
In PyTorch, checkpointing can be applied with torch.utils.checkpoint:
import torch
from torch.utils.checkpoint import checkpoint
def block(x):
return layer2(torch.relu(layer1(x)))
x = torch.randn(16, 128, requires_grad=True)
y = checkpoint(block, x)
loss = y.sum()
loss.backward()Checkpointing is common when training large models that would otherwise exceed GPU memory.
In-Place Operations and Reverse Mode
Reverse mode depends on saved forward values. In-place operations can overwrite those values and break gradient computation.
Example:
x = torch.randn(4, requires_grad=True)
y = x ** 2
x.add_(1.0) # unsafe
loss = y.sum()
loss.backward()The backward rule for needs the original value of . If that value is modified in place, PyTorch may raise an error.
A safer version uses out-of-place operations:
x = torch.randn(4, requires_grad=True)
y = x ** 2
x2 = x + 1.0
loss = y.sum()
loss.backward()In-place operations are sometimes useful for memory efficiency, but they should be used carefully when gradients are involved.
Detaching and Stopping Gradients
Sometimes a computation should stop gradient flow. PyTorch uses .detach() for this.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
z = y.detach()
w = 3 * z
print(w.requires_grad) # FalseThe tensor z has the same numerical value as y, but it has no connection to the graph that produced y.
This is used in target networks, contrastive learning, reinforcement learning, teacher-student methods, logging, and some optimization algorithms.
A common pattern is:
with torch.no_grad():
target = teacher_model(x)
pred = student_model(x)
loss = loss_fn(pred, target)
loss.backward()The target is treated as a fixed value. Gradients update the student model but not the teacher computation.
Summary
Reverse-mode differentiation computes gradients by traversing a computational graph backward from a scalar output. It starts with an upstream gradient of at the loss, then applies local backward rules for each operation.
This method is efficient for deep learning because training usually requires the gradient of one scalar loss with respect to many parameters. PyTorch implements reverse mode through autograd. It records operations during the forward pass, stores needed intermediate values, and computes vector-Jacobian products during the backward pass.