Skip to content

Gradient Computation

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.

If a model has parameters θ\theta and loss LL, training needs the gradient

θL. \nabla_\theta L.

This gradient tells the optimizer how to update the parameters. If a parameter change increases the loss, the optimizer usually moves in the opposite direction. If a parameter change decreases the loss, the optimizer favors that direction.

Derivatives of Scalar Functions

For a scalar function

y=f(x), y = f(x),

the derivative is

dydx. \frac{dy}{dx}.

It measures the local rate of change of yy with respect to xx.

For example, if

y=x2, y = x^2,

then

dydx=2x. \frac{dy}{dx}=2x.

At x=3x=3, the derivative is 66. A small increase in xx will increase yy by roughly six times that small amount.

In PyTorch:

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

y.backward()

print(x.grad)  # tensor(6.)

The call to backward() computes the derivative of y with respect to x.

Gradients of Multivariable Functions

Most neural network functions depend on many variables. For a scalar function

L=f(x1,x2,,xn), L = f(x_1, x_2, \ldots, x_n),

the gradient is the vector of partial derivatives:

xL=[Lx1Lx2Lxn]. \nabla_x L = \begin{bmatrix} \frac{\partial L}{\partial x_1} \\ \frac{\partial L}{\partial x_2} \\ \vdots \\ \frac{\partial L}{\partial x_n} \end{bmatrix}.

Each entry says how the loss changes when one input changes and the others are held fixed.

Consider

L=x2+3y. L = x^2 + 3y.

Then

Lx=2x,Ly=3. \frac{\partial L}{\partial x}=2x, \quad \frac{\partial L}{\partial y}=3.

At x=2x=2 and y=4y=4,

L=[43]. \nabla L = \begin{bmatrix} 4 \\ 3 \end{bmatrix}.

In PyTorch:

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

L = x ** 2 + 3 * y
L.backward()

print(x.grad)  # tensor(4.)
print(y.grad)  # tensor(3.)

Gradients with Respect to Tensors

A tensor gradient has the same shape as the tensor it differentiates.

If

WRm×n W\in\mathbb{R}^{m\times n}

and LL is a scalar loss, then

WLRm×n. \nabla_W L\in\mathbb{R}^{m\times n}.

The entry at position (i,j)(i,j) is

LWij. \frac{\partial L}{\partial W_{ij}}.

In PyTorch:

W = torch.randn(3, 4, requires_grad=True)

L = (W ** 2).sum()
L.backward()

print(W.shape)       # torch.Size([3, 4])
print(W.grad.shape)  # torch.Size([3, 4])

Since

L=i,jWij2, L = \sum_{i,j} W_{ij}^2,

the gradient is

LWij=2Wij. \frac{\partial L}{\partial W_{ij}} = 2W_{ij}.

Thus W.grad contains 2 * W.

Why the Loss Must Usually Be Scalar

PyTorch’s backward() is simplest when called on a scalar tensor.

x = torch.tensor(2.0, requires_grad=True)
loss = x ** 2

loss.backward()

This works because loss contains one number.

If the output is a vector, PyTorch needs to know which scalar quantity should be differentiated. For example:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# y.backward() would fail because y is not scalar

One common solution is to reduce the vector to a scalar:

loss = y.sum()
loss.backward()

print(x.grad)  # tensor([2., 4., 6.])

Here

L=ixi2. L = \sum_i x_i^2.

Another option is to provide an upstream gradient:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

y.backward(torch.tensor([1.0, 1.0, 1.0]))

print(x.grad)  # tensor([2., 4., 6.])

This tells PyTorch how to combine the vector outputs into a scalar derivative. This idea will be formalized later as vector-Jacobian products.

The Chain Rule

Deep learning depends on the chain rule. Neural networks are compositions of functions. If

z=f(y),y=g(x), z = f(y), \quad y = g(x),

then

dzdx=dzdydydx. \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}.

For example,

z=(x+1)2. z = (x+1)^2.

Let

y=x+1. y = x+1.

Then

z=y2. z = y^2.

The derivatives are

dzdy=2y,dydx=1. \frac{dz}{dy}=2y, \quad \frac{dy}{dx}=1.

Therefore

dzdx=2y=2(x+1). \frac{dz}{dx}=2y=2(x+1).

At x=3x=3, the derivative is 88.

In PyTorch:

x = torch.tensor(3.0, requires_grad=True)

y = x + 1
z = y ** 2

z.backward()

print(x.grad)  # tensor(8.)

PyTorch records the intermediate operation y = x + 1, then applies the chain rule automatically during the backward pass.

Local Gradients and Upstream Gradients

Each operation in a computational graph has a local gradient. During backpropagation, this local gradient is multiplied by an upstream gradient.

Consider

z=a2,a=x+y. z = a^2, \quad a = x + y.

The local derivative of zz with respect to aa is

za=2a. \frac{\partial z}{\partial a}=2a.

The local derivatives of aa are

ax=1,ay=1. \frac{\partial a}{\partial x}=1, \quad \frac{\partial a}{\partial y}=1.

The upstream gradient arriving at aa is

za. \frac{\partial z}{\partial a}.

The gradients passed to xx and yy are

zx=zaax, \frac{\partial z}{\partial x} = \frac{\partial z}{\partial a} \frac{\partial a}{\partial x}, zy=zaay. \frac{\partial z}{\partial y} = \frac{\partial z}{\partial a} \frac{\partial a}{\partial y}.

In code:

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

a = x + y
z = a ** 2

z.backward()

print(x.grad)  # tensor(10.)
print(y.grad)  # tensor(10.)

The value a=5a=5, so the upstream gradient at aa is 1010. Since both local derivatives are 11, both input gradients are 1010.

Gradients of Linear Layers

A linear layer computes

Y=XW+b. Y = XW^\top + b.

Let

XRB×d,WRh×d,bRh,YRB×h. X\in\mathbb{R}^{B\times d}, \quad W\in\mathbb{R}^{h\times d}, \quad b\in\mathbb{R}^{h}, \quad Y\in\mathbb{R}^{B\times h}.

Suppose the loss is

L=b=1Bj=1hYbj. L = \sum_{b=1}^{B}\sum_{j=1}^{h}Y_{bj}.

Then every output entry contributes equally to the loss.

The gradient with respect to the bias is

Lbj=B. \frac{\partial L}{\partial b_j}=B.

The gradient with respect to the weight is

LWjk=b=1BXbk. \frac{\partial L}{\partial W_{jk}} = \sum_{b=1}^{B} X_{bk}.

In PyTorch:

B = 5
d = 3
h = 4

layer = torch.nn.Linear(d, h)

X = torch.randn(B, d)
Y = layer(X)

L = Y.sum()
L.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])

The shapes match the parameter shapes.

Gradient Accumulation

PyTorch accumulates gradients by default. This means that each call to backward() adds to the .grad field instead of replacing it.

x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
y.backward()

print(x.grad)  # tensor(4.)

z = 3 * x
z.backward()

print(x.grad)  # tensor(7.)

The second gradient is added to the first. Since

ddxx2=4 \frac{d}{dx}x^2 = 4

at x=2x=2, and

ddx3x=3, \frac{d}{dx}3x = 3,

the accumulated gradient is 77.

This behavior is useful for gradient accumulation over multiple microbatches. In ordinary training loops, gradients should usually be cleared before each optimization step:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Without zero_grad(), gradients from previous batches contaminate the current update.

Disabling Gradients

During evaluation, gradient computation wastes memory and computation. PyTorch provides torch.no_grad():

model.eval()

with torch.no_grad():
    pred = model(X)

Inside this block, PyTorch does not record operations for autograd.

For inference-only code, torch.inference_mode() is often stronger:

model.eval()

with torch.inference_mode():
    pred = model(X)

Both are used to avoid building a computational graph when gradients are not needed.

Gradients and Optimizers

Gradient computation alone does not update model parameters. It only fills the .grad fields.

The optimizer performs the update.

For stochastic gradient descent, the update is

θθηθL. \theta \leftarrow \theta - \eta \nabla_\theta L.

Here η\eta is the learning rate.

In PyTorch:

model = torch.nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(8, 3)
y = torch.randn(8, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

optimizer.zero_grad()
loss.backward()
optimizer.step()

The call sequence matters:

  1. Clear old gradients.
  2. Compute predictions and loss.
  3. Run backpropagation.
  4. Update parameters.

A common training step is therefore:

pred = model(X)
loss = loss_fn(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Gradient Checking

Gradient checking compares automatic gradients with finite-difference approximations.

For a scalar function f(x)f(x), the derivative can be approximated by

f(x+ϵ)f(xϵ)2ϵ. \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}.

This is useful when implementing custom layers or custom autograd functions.

Example:

def f(x):
    return x ** 2

x = torch.tensor(3.0, requires_grad=True)
y = f(x)
y.backward()

autograd_grad = x.grad.item()

eps = 1e-4
finite_diff_grad = (f(torch.tensor(3.0 + eps)) - f(torch.tensor(3.0 - eps))) / (2 * eps)

print(autograd_grad)
print(finite_diff_grad.item())

Both values should be close to 66.

Gradient checking is slower than autograd and should not be used in normal training. Its purpose is debugging.

Common Gradient Problems

The first common problem is a missing gradient. This happens when a tensor does not require gradients, or when the computation has been detached from the graph.

x = torch.tensor(2.0)
y = x ** 2

# y.backward() fails because y does not require gradients

The fix is:

x = torch.tensor(2.0, requires_grad=True)

The second common problem is a stale gradient. This happens when gradients are not cleared between optimization steps.

loss.backward()
optimizer.step()

# next iteration
loss.backward()
optimizer.step()

The fix is:

optimizer.zero_grad()
loss.backward()
optimizer.step()

The third common problem is an in-place operation that modifies a value needed for backward computation.

x = torch.randn(3, requires_grad=True)
y = x ** 2
x.add_(1.0)  # may break autograd
loss = y.sum()
loss.backward()

Autograd may fail because the old value of x was needed to compute the derivative of x2x^2. Avoid in-place operations on tensors that participate in gradient computation unless you know they are safe.

Summary

Gradient computation measures how a scalar loss changes with respect to tensors in the computational graph. In PyTorch, tensors with requires_grad=True participate in autograd. Calling backward() on a scalar loss computes gradients and stores them in leaf tensors.

Gradients have the same shape as the tensors they differentiate. PyTorch accumulates gradients by default, so training loops usually call optimizer.zero_grad() before loss.backward(). The optimizer then uses the gradients to update parameters.