# Gradient Computation

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.

If a model has parameters \(\theta\) and loss \(L\), training needs the gradient

$$
\nabla_\theta L.
$$

This gradient tells the optimizer how to update the parameters. If a parameter change increases the loss, the optimizer usually moves in the opposite direction. If a parameter change decreases the loss, the optimizer favors that direction.

### Derivatives of Scalar Functions

For a scalar function

$$
y = f(x),
$$

the derivative is

$$
\frac{dy}{dx}.
$$

It measures the local rate of change of \(y\) with respect to \(x\).

For example, if

$$
y = x^2,
$$

then

$$
\frac{dy}{dx}=2x.
$$

At \(x=3\), the derivative is \(6\). A small increase in \(x\) will increase \(y\) by roughly six times that small amount.

In PyTorch:

```python
import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

y.backward()

print(x.grad)  # tensor(6.)
```

The call to `backward()` computes the derivative of `y` with respect to `x`.

### Gradients of Multivariable Functions

Most neural network functions depend on many variables. For a scalar function

$$
L = f(x_1, x_2, \ldots, x_n),
$$

the gradient is the vector of partial derivatives:

$$
\nabla_x L =
\begin{bmatrix}
\frac{\partial L}{\partial x_1} \\
\frac{\partial L}{\partial x_2} \\
\vdots \\
\frac{\partial L}{\partial x_n}
\end{bmatrix}.
$$

Each entry says how the loss changes when one input changes and the others are held fixed.

Consider

$$
L = x^2 + 3y.
$$

Then

$$
\frac{\partial L}{\partial x}=2x,
\quad
\frac{\partial L}{\partial y}=3.
$$

At \(x=2\) and \(y=4\),

$$
\nabla L =
\begin{bmatrix}
4 \\
3
\end{bmatrix}.
$$

In PyTorch:

```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)

L = x ** 2 + 3 * y
L.backward()

print(x.grad)  # tensor(4.)
print(y.grad)  # tensor(3.)
```

### Gradients with Respect to Tensors

A tensor gradient has the same shape as the tensor it differentiates.

If

$$
W\in\mathbb{R}^{m\times n}
$$

and \(L\) is a scalar loss, then

$$
\nabla_W L\in\mathbb{R}^{m\times n}.
$$

The entry at position \((i,j)\) is

$$
\frac{\partial L}{\partial W_{ij}}.
$$

In PyTorch:

```python
W = torch.randn(3, 4, requires_grad=True)

L = (W ** 2).sum()
L.backward()

print(W.shape)       # torch.Size([3, 4])
print(W.grad.shape)  # torch.Size([3, 4])
```

Since

$$
L = \sum_{i,j} W_{ij}^2,
$$

the gradient is

$$
\frac{\partial L}{\partial W_{ij}} = 2W_{ij}.
$$

Thus `W.grad` contains `2 * W`.

### Why the Loss Must Usually Be Scalar

PyTorch’s `backward()` is simplest when called on a scalar tensor.

```python
x = torch.tensor(2.0, requires_grad=True)
loss = x ** 2

loss.backward()
```

This works because `loss` contains one number.

If the output is a vector, PyTorch needs to know which scalar quantity should be differentiated. For example:

```python
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# y.backward() would fail because y is not scalar
```

One common solution is to reduce the vector to a scalar:

```python
loss = y.sum()
loss.backward()

print(x.grad)  # tensor([2., 4., 6.])
```

Here

$$
L = \sum_i x_i^2.
$$

Another option is to provide an upstream gradient:

```python
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

y.backward(torch.tensor([1.0, 1.0, 1.0]))

print(x.grad)  # tensor([2., 4., 6.])
```

This tells PyTorch how to combine the vector outputs into a scalar derivative. This idea will be formalized later as vector-Jacobian products.

### The Chain Rule

Deep learning depends on the chain rule. Neural networks are compositions of functions. If

$$
z = f(y),
\quad
y = g(x),
$$

then

$$
\frac{dz}{dx} =
\frac{dz}{dy}
\frac{dy}{dx}.
$$

For example,

$$
z = (x+1)^2.
$$

Let

$$
y = x+1.
$$

Then

$$
z = y^2.
$$

The derivatives are

$$
\frac{dz}{dy}=2y,
\quad
\frac{dy}{dx}=1.
$$

Therefore

$$
\frac{dz}{dx}=2y=2(x+1).
$$

At \(x=3\), the derivative is \(8\).

In PyTorch:

```python
x = torch.tensor(3.0, requires_grad=True)

y = x + 1
z = y ** 2

z.backward()

print(x.grad)  # tensor(8.)
```

PyTorch records the intermediate operation `y = x + 1`, then applies the chain rule automatically during the backward pass.

### Local Gradients and Upstream Gradients

Each operation in a computational graph has a local gradient. During backpropagation, this local gradient is multiplied by an upstream gradient.

Consider

$$
z = a^2,
\quad
a = x + y.
$$

The local derivative of \(z\) with respect to \(a\) is

$$
\frac{\partial z}{\partial a}=2a.
$$

The local derivatives of \(a\) are

$$
\frac{\partial a}{\partial x}=1,
\quad
\frac{\partial a}{\partial y}=1.
$$

The upstream gradient arriving at \(a\) is

$$
\frac{\partial z}{\partial a}.
$$

The gradients passed to \(x\) and \(y\) are

$$
\frac{\partial z}{\partial x} =
\frac{\partial z}{\partial a}
\frac{\partial a}{\partial x},
$$

$$
\frac{\partial z}{\partial y} =
\frac{\partial z}{\partial a}
\frac{\partial a}{\partial y}.
$$

In code:

```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

a = x + y
z = a ** 2

z.backward()

print(x.grad)  # tensor(10.)
print(y.grad)  # tensor(10.)
```

The value \(a=5\), so the upstream gradient at \(a\) is \(10\). Since both local derivatives are \(1\), both input gradients are \(10\).

### Gradients of Linear Layers

A linear layer computes

$$
Y = XW^\top + b.
$$

Let

$$
X\in\mathbb{R}^{B\times d},
\quad
W\in\mathbb{R}^{h\times d},
\quad
b\in\mathbb{R}^{h},
\quad
Y\in\mathbb{R}^{B\times h}.
$$

Suppose the loss is

$$
L = \sum_{b=1}^{B}\sum_{j=1}^{h}Y_{bj}.
$$

Then every output entry contributes equally to the loss.

The gradient with respect to the bias is

$$
\frac{\partial L}{\partial b_j}=B.
$$

The gradient with respect to the weight is

$$
\frac{\partial L}{\partial W_{jk}} =
\sum_{b=1}^{B} X_{bk}.
$$

In PyTorch:

```python
B = 5
d = 3
h = 4

layer = torch.nn.Linear(d, h)

X = torch.randn(B, d)
Y = layer(X)

L = Y.sum()
L.backward()

print(layer.weight.grad.shape)  # torch.Size([4, 3])
print(layer.bias.grad.shape)    # torch.Size([4])
```

The shapes match the parameter shapes.

### Gradient Accumulation

PyTorch accumulates gradients by default. This means that each call to `backward()` adds to the `.grad` field instead of replacing it.

```python
x = torch.tensor(2.0, requires_grad=True)

y = x ** 2
y.backward()

print(x.grad)  # tensor(4.)

z = 3 * x
z.backward()

print(x.grad)  # tensor(7.)
```

The second gradient is added to the first. Since

$$
\frac{d}{dx}x^2 = 4
$$

at \(x=2\), and

$$
\frac{d}{dx}3x = 3,
$$

the accumulated gradient is \(7\).

This behavior is useful for gradient accumulation over multiple microbatches. In ordinary training loops, gradients should usually be cleared before each optimization step:

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

Without `zero_grad()`, gradients from previous batches contaminate the current update.

### Disabling Gradients

During evaluation, gradient computation wastes memory and computation. PyTorch provides `torch.no_grad()`:

```python
model.eval()

with torch.no_grad():
    pred = model(X)
```

Inside this block, PyTorch does not record operations for autograd.

For inference-only code, `torch.inference_mode()` is often stronger:

```python
model.eval()

with torch.inference_mode():
    pred = model(X)
```

Both are used to avoid building a computational graph when gradients are not needed.

### Gradients and Optimizers

Gradient computation alone does not update model parameters. It only fills the `.grad` fields.

The optimizer performs the update.

For stochastic gradient descent, the update is

$$
\theta \leftarrow \theta - \eta \nabla_\theta L.
$$

Here \(\eta\) is the learning rate.

In PyTorch:

```python
model = torch.nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

X = torch.randn(8, 3)
y = torch.randn(8, 1)

pred = model(X)
loss = ((pred - y) ** 2).mean()

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

The call sequence matters:

1. Clear old gradients.
2. Compute predictions and loss.
3. Run backpropagation.
4. Update parameters.

A common training step is therefore:

```python
pred = model(X)
loss = loss_fn(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

### Gradient Checking

Gradient checking compares automatic gradients with finite-difference approximations.

For a scalar function \(f(x)\), the derivative can be approximated by

$$
\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}.
$$

This is useful when implementing custom layers or custom autograd functions.

Example:

```python
def f(x):
    return x ** 2

x = torch.tensor(3.0, requires_grad=True)
y = f(x)
y.backward()

autograd_grad = x.grad.item()

eps = 1e-4
finite_diff_grad = (f(torch.tensor(3.0 + eps)) - f(torch.tensor(3.0 - eps))) / (2 * eps)

print(autograd_grad)
print(finite_diff_grad.item())
```

Both values should be close to \(6\).

Gradient checking is slower than autograd and should not be used in normal training. Its purpose is debugging.

### Common Gradient Problems

The first common problem is a missing gradient. This happens when a tensor does not require gradients, or when the computation has been detached from the graph.

```python
x = torch.tensor(2.0)
y = x ** 2

# y.backward() fails because y does not require gradients
```

The fix is:

```python
x = torch.tensor(2.0, requires_grad=True)
```

The second common problem is a stale gradient. This happens when gradients are not cleared between optimization steps.

```python
loss.backward()
optimizer.step()

# next iteration
loss.backward()
optimizer.step()
```

The fix is:

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

The third common problem is an in-place operation that modifies a value needed for backward computation.

```python
x = torch.randn(3, requires_grad=True)
y = x ** 2
x.add_(1.0)  # may break autograd
loss = y.sum()
loss.backward()
```

Autograd may fail because the old value of `x` was needed to compute the derivative of \(x^2\). Avoid in-place operations on tensors that participate in gradient computation unless you know they are safe.

### Summary

Gradient computation measures how a scalar loss changes with respect to tensors in the computational graph. In PyTorch, tensors with `requires_grad=True` participate in autograd. Calling `backward()` on a scalar loss computes gradients and stores them in leaf tensors.

Gradients have the same shape as the tensors they differentiate. PyTorch accumulates gradients by default, so training loops usually call `optimizer.zero_grad()` before `loss.backward()`. The optimizer then uses the gradients to update parameters.

