Gradient Computation

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.

If a model has parameters $\theta$ and loss $L$ , training needs the gradient

\nabla_\theta L.

This gradient tells the optimizer how to update the parameters. If a parameter change increases the loss, the optimizer usually moves in the opposite direction. If a parameter change decreases the loss, the optimizer favors that direction.

Derivatives of Scalar Functions

For a scalar function

y = f(x),

the derivative is

\frac{dy}{dx}.

It measures the local rate of change of $y$ with respect to $x$ .

For example, if

y = x^2,

then

\frac{dy}{dx}=2x.

At $x=3$ , the derivative is $6$ . A small increase in $x$ will increase $y$ by roughly six times that small amount.

In PyTorch:

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

y.backward()

print(x.grad)  # tensor(6.)

The call to backward() computes the derivative of y with respect to x.

Gradients of Multivariable Functions

Most neural network functions depend on many variables. For a scalar function

L = f(x_1, x_2, \ldots, x_n),

the gradient is the vector of partial derivatives:

\nabla_x L = \begin{bmatrix} \frac{\partial L}{\partial x_1} \\ \frac{\partial L}{\partial x_2} \\ \vdots \\ \frac{\partial L}{\partial x_n} \end{bmatrix}.

Each entry says how the loss changes when one input changes and the others are held fixed.

Consider

L = x^2 + 3y.

Then

\frac{\partial L}{\partial x}=2x, \quad \frac{\partial L}{\partial y}=3.

At $x=2$ and $y=4$ ,

\nabla L = \begin{bmatrix} 4 \\ 3 \end{bmatrix}.

In PyTorch:

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

L = x ** 2 + 3 * y
L.backward()

print(x.grad)  # tensor(4.)
print(y.grad)  # tensor(3.)

Gradients with Respect to Tensors

A tensor gradient has the same shape as the tensor it differentiates.

W\in\mathbb{R}^{m\times n}

and $L$ is a scalar loss, then

\nabla_W L\in\mathbb{R}^{m\times n}.

The entry at position $(i,j)$ is

\frac{\partial L}{\partial W_{ij}}.

In PyTorch:

W = torch.randn(3, 4, requires_grad=True)

L = (W ** 2).sum()
L.backward()

print(W.shape)       # torch.Size([3, 4])
print(W.grad.shape)  # torch.Size([3, 4])

Since

L = \sum_{i,j} W_{ij}^2,

the gradient is

\frac{\partial L}{\partial W_{ij}} = 2W_{ij}.

Thus W.grad contains 2 * W.

Why the Loss Must Usually Be Scalar

PyTorch’s backward() is simplest when called on a scalar tensor.

x = torch.tensor(2.0, requires_grad=True)
loss = x ** 2

loss.backward()

This works because loss contains one number.

If the output is a vector, PyTorch needs to know which scalar quantity should be differentiated. For example:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# y.backward() would fail because y is not scalar

One common solution is to reduce the vector to a scalar:

loss = y.sum()
loss.backward()

print(x.grad)  # tensor([2., 4., 6.])

Here

L = \sum_i x_i^2.

Another option is to provide an upstream gradient:

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

y.backward(torch.tensor([1.0, 1.0, 1.0]))

print(x.grad)  # tensor([2., 4., 6.])

This tells PyTorch how to combine the vector outputs into a scalar derivative. This idea will be formalized later as vector-Jacobian products.

The Chain Rule

Deep learning depends on the chain rule. Neural networks are compositions of functions. If

z = f(y), \quad y = g(x),

then

\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}.

For example,

z = (x+1)^2.

Let

y = x+1.

Then

z = y^2.

The derivatives are

\frac{dz}{dy}=2y, \quad \frac{dy}{dx}=1.

Therefore

\frac{dz}{dx}=2y=2(x+1).

At $x=3$ , the derivative is $8$ .

In PyTorch:

x = torch.tensor(3.0, requires_grad=True)

y = x + 1
z = y ** 2

z.backward()

print(x.grad)  # tensor(8.)

PyTorch records the intermediate operation y = x + 1, then applies the chain rule automatically during the backward pass.

Local Gradients and Upstream Gradients

Each operation in a computational graph has a local gradient. During backpropagation, this local gradient is multiplied by an upstream gradient.

Consider

z = a^2, \quad a = x + y.

The local derivative of $z$ with respect to $a$ is

\frac{\partial z}{\partial a}=2a.

The local derivatives of $a$ are

\frac{\partial a}{\partial x}=1, \quad \frac{\partial a}{\partial y}=1.

The upstream gradient arriving at $a$ is

\frac{\partial z}{\partial a}.

The gradients passed to $x$ and $y$ are

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial a} \frac{\partial a}{\partial x},

\frac{\partial z}{\partial y} = \frac{\partial z}{\partial a} \frac{\partial a}{\partial y}.

In code:

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

a = x + y
z = a ** 2

z.backward()

print(x.grad)  # tensor(10.)
print(y.grad)  # tensor(10.)

The value $a=5$ , so the upstream gradient at $a$ is $10$ . Since both local derivatives are $1$ , both input gradients are $10$ .

Gradients of Linear Layers

A linear layer computes

Y = XW^\top + b.

Let

X\in\mathbb{R}^{B\times d}, \quad W\in\mathbb{R}^{h\times d}, \quad b\in\mathbb{R}^{h}, \quad Y\in\mathbb{R}^{B\times h}.

Suppose the loss is

L = \sum_{b=1}^{B}\sum_{j=1}^{h}Y_{bj}.

Then every output entry contributes equally to the loss.

The gradient with respect to the bias is

\frac{\partial L}{\partial b_j}=B.

The gradient with respect to the weight is

\frac{\partial L}{\partial W_{jk}} = \sum_{b=1}^{B} X_{bk}.

In PyTorch:

B = 5
d = 3
h = 4

layer = torch.nn.Linear(d, h)

X = torch.randn(B, d)
Y = layer(X)

L = Y.sum()
L.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])

The shapes match the parameter shapes.

Gradient Accumulation

PyTorch accumulates gradients by default. This means that each call to backward() adds to the .grad field instead of replacing it.

x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
y.backward()

print(x.grad)  # tensor(4.)

z = 3 * x
z.backward()

print(x.grad)  # tensor(7.)

The second gradient is added to the first. Since

\frac{d}{dx}x^2 = 4

at $x=2$ , and

\frac{d}{dx}3x = 3,

the accumulated gradient is $7$ .

This behavior is useful for gradient accumulation over multiple microbatches. In ordinary training loops, gradients should usually be cleared before each optimization step:

optimizer.zero_grad()
loss.backward()
optimizer.step()

Without zero_grad(), gradients from previous batches contaminate the current update.

Disabling Gradients

During evaluation, gradient computation wastes memory and computation. PyTorch provides torch.no_grad():

model.eval()

with torch.no_grad():
    pred = model(X)

Inside this block, PyTorch does not record operations for autograd.

For inference-only code, torch.inference_mode() is often stronger:

model.eval()

with torch.inference_mode():
    pred = model(X)

Both are used to avoid building a computational graph when gradients are not needed.

Gradients and Optimizers

Gradient computation alone does not update model parameters. It only fills the .grad fields.

The optimizer performs the update.

For stochastic gradient descent, the update is

\theta \leftarrow \theta - \eta \nabla_\theta L.

Here $\eta$ is the learning rate.

In PyTorch:

model = torch.nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(8, 3)
y = torch.randn(8, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

optimizer.zero_grad()
loss.backward()
optimizer.step()

The call sequence matters:

Clear old gradients.
Compute predictions and loss.
Run backpropagation.
Update parameters.

A common training step is therefore:

pred = model(X)
loss = loss_fn(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

Gradient Checking

Gradient checking compares automatic gradients with finite-difference approximations.

For a scalar function $f(x)$ , the derivative can be approximated by

\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}.

This is useful when implementing custom layers or custom autograd functions.

Example:

def f(x):
    return x ** 2

x = torch.tensor(3.0, requires_grad=True)
y = f(x)
y.backward()

autograd_grad = x.grad.item()

eps = 1e-4
finite_diff_grad = (f(torch.tensor(3.0 + eps)) - f(torch.tensor(3.0 - eps))) / (2 * eps)

print(autograd_grad)
print(finite_diff_grad.item())

Both values should be close to $6$ .

Gradient checking is slower than autograd and should not be used in normal training. Its purpose is debugging.

Common Gradient Problems

The first common problem is a missing gradient. This happens when a tensor does not require gradients, or when the computation has been detached from the graph.

x = torch.tensor(2.0)
y = x ** 2

# y.backward() fails because y does not require gradients

The fix is:

x = torch.tensor(2.0, requires_grad=True)

The second common problem is a stale gradient. This happens when gradients are not cleared between optimization steps.

loss.backward()
optimizer.step()

# next iteration
loss.backward()
optimizer.step()

The fix is:

optimizer.zero_grad()
loss.backward()
optimizer.step()

The third common problem is an in-place operation that modifies a value needed for backward computation.

x = torch.randn(3, requires_grad=True)
y = x ** 2
x.add_(1.0)  # may break autograd
loss = y.sum()
loss.backward()

Autograd may fail because the old value of x was needed to compute the derivative of $x^2$ . Avoid in-place operations on tensors that participate in gradient computation unless you know they are safe.

Summary

Gradient computation measures how a scalar loss changes with respect to tensors in the computational graph. In PyTorch, tensors with requires_grad=True participate in autograd. Calling backward() on a scalar loss computes gradients and stores them in leaf tensors.

Gradients have the same shape as the tensors they differentiate. PyTorch accumulates gradients by default, so training loops usually call optimizer.zero_grad() before loss.backward(). The optimizer then uses the gradients to update parameters.