Gradient Descent

Gradient descent is the basic optimization method used to train neural networks. It updates model parameters in the direction that reduces the loss.

A model has parameters

\theta

and a loss function

L(\theta).

The gradient of the loss is

\nabla_\theta L(\theta).

The gradient points in the direction of steepest increase. To reduce the loss, gradient descent moves in the opposite direction:

\theta \leftarrow \theta - \eta \nabla_\theta L(\theta).

Here $\eta$ is the learning rate. It controls the size of each update.

The Optimization Problem

Training a neural network means solving an optimization problem:

\theta^\star = \arg\min_{\theta} L(\theta).

The parameter vector $\theta$ contains all trainable weights and biases in the model. The loss $L(\theta)$ measures how poorly the model performs on the training data.

For supervised learning with $N$ examples, the objective is often

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

The function $f_\theta$ is the model. The function $\ell$ is the per-example loss. Gradient descent tries to find parameters that make this average loss small.

The Gradient

For a scalar loss and a parameter vector

\theta = \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_m \end{bmatrix},

the gradient is

\nabla_\theta L = \begin{bmatrix} \frac{\partial L}{\partial \theta_1} \\ \frac{\partial L}{\partial \theta_2} \\ \vdots \\ \frac{\partial L}{\partial \theta_m} \end{bmatrix}.

Each component tells how the loss changes when one parameter changes slightly.

\frac{\partial L}{\partial \theta_j} > 0,

then increasing $\theta_j$ increases the loss locally, so gradient descent decreases $\theta_j$ .

\frac{\partial L}{\partial \theta_j} < 0,

then increasing $\theta_j$ decreases the loss locally, so gradient descent increases $\theta_j$ .

Learning Rate

The learning rate determines the update size:

\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_\theta L.

A small learning rate gives small updates. Training may be stable but slow.

A large learning rate gives large updates. Training may progress quickly at first, but it can overshoot good parameter values and become unstable.

For a simple one-dimensional loss, the update is

\theta \leftarrow \theta - \eta \frac{dL}{d\theta}.

If $\eta$ is too large, the parameter may jump across the minimum repeatedly. If $\eta$ is too small, many steps are needed.

Manual Gradient Descent in PyTorch

Consider a simple regression problem:

y = 2x + 1.

We will learn the slope and bias from data.

import torch

torch.manual_seed(0)

x = torch.linspace(-2, 2, 100)
y = 2 * x + 1 + 0.1 * torch.randn(100)

w = torch.randn((), requires_grad=True)
b = torch.randn((), requires_grad=True)

lr = 0.05

for step in range(200):
    y_hat = w * x + b
    loss = ((y_hat - y) ** 2).mean()

    loss.backward()

    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad

        w.grad.zero_()
        b.grad.zero_()

print(w.item(), b.item())

The parameters w and b are scalar tensors. Since requires_grad=True, PyTorch records operations involving them. Calling loss.backward() computes gradients and stores them in w.grad and b.grad.

The update is placed inside torch.no_grad() because parameter updates are not part of the model’s differentiable forward computation.

Gradient Accumulation

PyTorch accumulates gradients by default. Calling backward() adds new gradients to the existing .grad field.

This means the following pattern is wrong for ordinary training:

loss.backward()
optimizer.step()

without clearing gradients.

The correct pattern is:

optimizer.zero_grad()
loss.backward()
optimizer.step()

For manual updates:

loss.backward()

with torch.no_grad():
    w -= lr * w.grad
    b -= lr * b.grad

    w.grad.zero_()
    b.grad.zero_()

Gradient accumulation is sometimes useful. For example, when GPU memory cannot hold a large batch, one can accumulate gradients over several smaller batches before taking an optimizer step. But by default, gradients should be cleared every training step.

Using `torch.optim.SGD`

PyTorch provides optimizers in torch.optim. The simplest optimizer is stochastic gradient descent.

import torch
from torch import nn

model = nn.Linear(1, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss()

x = torch.linspace(-2, 2, 100).unsqueeze(1)
y = 2 * x + 1 + 0.1 * torch.randn(100, 1)

for step in range(200):
    y_hat = model(x)
    loss = loss_fn(y_hat, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(model.weight.item(), model.bias.item())

The optimizer owns the update rule. The model owns the parameters. The loss owns the scalar objective. Autograd computes the gradients connecting them.

Full-Batch Gradient Descent

Full-batch gradient descent computes the loss over the entire training set before each update:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

The update is based on the exact gradient of the training objective:

\theta \leftarrow \theta - \eta \nabla_\theta L(\theta).

This can be stable for small datasets. But it is expensive for large datasets because each update requires a full pass over all examples.

In deep learning, full-batch gradient descent is uncommon for large-scale training. Minibatch methods are usually preferred.

Minibatch Gradient Descent

Minibatch gradient descent uses a subset of the training data at each step. If a minibatch has $B$ examples, the batch loss is

L_B(\theta) = \frac{1}{B} \sum_{i=1}^{B} \ell(f_\theta(x_i), y_i).

The update uses

\nabla_\theta L_B(\theta)

as an estimate of the full gradient.

Minibatches make training more efficient because each update is cheaper. They also introduce noise into the gradient estimate. This noise can help optimization escape poor regions, but it can also make training curves fluctuate.

Typical batch sizes range from small values such as 16 or 32 to very large values such as thousands or more, depending on model size, hardware, and task.

Training with a DataLoader

A standard PyTorch training loop uses a Dataset, a DataLoader, a model, a loss function, and an optimizer.

import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(0)

N = 1000
d = 5

X = torch.randn(N, d)
true_w = torch.tensor([2.0, -1.0, 0.5, 3.0, -2.0])
true_b = 0.7
y = X @ true_w + true_b + 0.1 * torch.randn(N)

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

for epoch in range(20):
    for batch_X, batch_y in loader:
        pred = model(batch_X).squeeze(-1)
        loss = loss_fn(pred, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The DataLoader handles batching and shuffling. Shuffling matters because it prevents the model from seeing examples in a fixed order each epoch.

Loss Curves

During training, it is useful to record the loss value over time.

losses = []

for epoch in range(20):
    for batch_X, batch_y in loader:
        pred = model(batch_X).squeeze(-1)
        loss = loss_fn(pred, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

A healthy loss curve usually trends downward. It may fluctuate because minibatches are noisy.

If the loss becomes nan, common causes include an excessive learning rate, exploding gradients, invalid operations such as log(0), or unstable numerical computation.

If the loss does not decrease, common causes include a learning rate that is too small, incorrect target format, missing optimizer.step(), frozen parameters, or a model that cannot represent the target relationship.

Local Minima and Nonconvexity

For linear regression with mean squared error, the loss surface is convex. This means any local minimum is also a global minimum.

Neural networks usually have nonconvex loss surfaces. They may contain many local minima, saddle points, and flat regions. Gradient descent does not guarantee finding the global optimum.

In practice, large neural networks often train well despite nonconvexity. Overparameterization, normalization, residual connections, adaptive optimizers, and good initialization all help optimization.

The goal in deep learning is rarely to find the exact global minimum. The practical goal is to find parameters that generalize well to unseen data.

Summary

Gradient descent trains a model by moving parameters in the negative gradient direction. The learning rate controls the update size. PyTorch computes gradients with automatic differentiation and applies updates through optimizers such as torch.optim.SGD.

Full-batch gradient descent uses all training data for every update. Minibatch gradient descent uses a subset of the data and is the standard method in deep learning. The basic training loop is: clear gradients, compute predictions, compute loss, backpropagate, and update parameters.