Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.
Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.
If a model has parameters and loss , training needs the gradient
This gradient tells the optimizer how to update the parameters. If a parameter change increases the loss, the optimizer usually moves in the opposite direction. If a parameter change decreases the loss, the optimizer favors that direction.
Derivatives of Scalar Functions
For a scalar function
the derivative is
It measures the local rate of change of with respect to .
For example, if
then
At , the derivative is . A small increase in will increase by roughly six times that small amount.
In PyTorch:
import torch
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2
y.backward()
print(x.grad) # tensor(6.)The call to backward() computes the derivative of y with respect to x.
Gradients of Multivariable Functions
Most neural network functions depend on many variables. For a scalar function
the gradient is the vector of partial derivatives:
Each entry says how the loss changes when one input changes and the others are held fixed.
Consider
Then
At and ,
In PyTorch:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
L = x ** 2 + 3 * y
L.backward()
print(x.grad) # tensor(4.)
print(y.grad) # tensor(3.)Gradients with Respect to Tensors
A tensor gradient has the same shape as the tensor it differentiates.
If
and is a scalar loss, then
The entry at position is
In PyTorch:
W = torch.randn(3, 4, requires_grad=True)
L = (W ** 2).sum()
L.backward()
print(W.shape) # torch.Size([3, 4])
print(W.grad.shape) # torch.Size([3, 4])Since
the gradient is
Thus W.grad contains 2 * W.
Why the Loss Must Usually Be Scalar
PyTorch’s backward() is simplest when called on a scalar tensor.
x = torch.tensor(2.0, requires_grad=True)
loss = x ** 2
loss.backward()This works because loss contains one number.
If the output is a vector, PyTorch needs to know which scalar quantity should be differentiated. For example:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
# y.backward() would fail because y is not scalarOne common solution is to reduce the vector to a scalar:
loss = y.sum()
loss.backward()
print(x.grad) # tensor([2., 4., 6.])Here
Another option is to provide an upstream gradient:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
y.backward(torch.tensor([1.0, 1.0, 1.0]))
print(x.grad) # tensor([2., 4., 6.])This tells PyTorch how to combine the vector outputs into a scalar derivative. This idea will be formalized later as vector-Jacobian products.
The Chain Rule
Deep learning depends on the chain rule. Neural networks are compositions of functions. If
then
For example,
Let
Then
The derivatives are
Therefore
At , the derivative is .
In PyTorch:
x = torch.tensor(3.0, requires_grad=True)
y = x + 1
z = y ** 2
z.backward()
print(x.grad) # tensor(8.)PyTorch records the intermediate operation y = x + 1, then applies the chain rule automatically during the backward pass.
Local Gradients and Upstream Gradients
Each operation in a computational graph has a local gradient. During backpropagation, this local gradient is multiplied by an upstream gradient.
Consider
The local derivative of with respect to is
The local derivatives of are
The upstream gradient arriving at is
The gradients passed to and are
In code:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
a = x + y
z = a ** 2
z.backward()
print(x.grad) # tensor(10.)
print(y.grad) # tensor(10.)The value , so the upstream gradient at is . Since both local derivatives are , both input gradients are .
Gradients of Linear Layers
A linear layer computes
Let
Suppose the loss is
Then every output entry contributes equally to the loss.
The gradient with respect to the bias is
The gradient with respect to the weight is
In PyTorch:
B = 5
d = 3
h = 4
layer = torch.nn.Linear(d, h)
X = torch.randn(B, d)
Y = layer(X)
L = Y.sum()
L.backward()
print(layer.weight.grad.shape) # torch.Size([4, 3])
print(layer.bias.grad.shape) # torch.Size([4])The shapes match the parameter shapes.
Gradient Accumulation
PyTorch accumulates gradients by default. This means that each call to backward() adds to the .grad field instead of replacing it.
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2
y.backward()
print(x.grad) # tensor(4.)
z = 3 * x
z.backward()
print(x.grad) # tensor(7.)The second gradient is added to the first. Since
at , and
the accumulated gradient is .
This behavior is useful for gradient accumulation over multiple microbatches. In ordinary training loops, gradients should usually be cleared before each optimization step:
optimizer.zero_grad()
loss.backward()
optimizer.step()Without zero_grad(), gradients from previous batches contaminate the current update.
Disabling Gradients
During evaluation, gradient computation wastes memory and computation. PyTorch provides torch.no_grad():
model.eval()
with torch.no_grad():
pred = model(X)Inside this block, PyTorch does not record operations for autograd.
For inference-only code, torch.inference_mode() is often stronger:
model.eval()
with torch.inference_mode():
pred = model(X)Both are used to avoid building a computational graph when gradients are not needed.
Gradients and Optimizers
Gradient computation alone does not update model parameters. It only fills the .grad fields.
The optimizer performs the update.
For stochastic gradient descent, the update is
Here is the learning rate.
In PyTorch:
model = torch.nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
X = torch.randn(8, 3)
y = torch.randn(8, 1)
pred = model(X)
loss = ((pred - y) ** 2).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()The call sequence matters:
- Clear old gradients.
- Compute predictions and loss.
- Run backpropagation.
- Update parameters.
A common training step is therefore:
pred = model(X)
loss = loss_fn(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()Gradient Checking
Gradient checking compares automatic gradients with finite-difference approximations.
For a scalar function , the derivative can be approximated by
This is useful when implementing custom layers or custom autograd functions.
Example:
def f(x):
return x ** 2
x = torch.tensor(3.0, requires_grad=True)
y = f(x)
y.backward()
autograd_grad = x.grad.item()
eps = 1e-4
finite_diff_grad = (f(torch.tensor(3.0 + eps)) - f(torch.tensor(3.0 - eps))) / (2 * eps)
print(autograd_grad)
print(finite_diff_grad.item())Both values should be close to .
Gradient checking is slower than autograd and should not be used in normal training. Its purpose is debugging.
Common Gradient Problems
The first common problem is a missing gradient. This happens when a tensor does not require gradients, or when the computation has been detached from the graph.
x = torch.tensor(2.0)
y = x ** 2
# y.backward() fails because y does not require gradientsThe fix is:
x = torch.tensor(2.0, requires_grad=True)The second common problem is a stale gradient. This happens when gradients are not cleared between optimization steps.
loss.backward()
optimizer.step()
# next iteration
loss.backward()
optimizer.step()The fix is:
optimizer.zero_grad()
loss.backward()
optimizer.step()The third common problem is an in-place operation that modifies a value needed for backward computation.
x = torch.randn(3, requires_grad=True)
y = x ** 2
x.add_(1.0) # may break autograd
loss = y.sum()
loss.backward()Autograd may fail because the old value of x was needed to compute the derivative of . Avoid in-place operations on tensors that participate in gradient computation unless you know they are safe.
Summary
Gradient computation measures how a scalar loss changes with respect to tensors in the computational graph. In PyTorch, tensors with requires_grad=True participate in autograd. Calling backward() on a scalar loss computes gradients and stores them in leaf tensors.
Gradients have the same shape as the tensors they differentiate. PyTorch accumulates gradients by default, so training loops usually call optimizer.zero_grad() before loss.backward(). The optimizer then uses the gradients to update parameters.