# Stochastic Gradient Descent

Stochastic gradient descent, usually abbreviated as SGD, is the standard form of gradient-based training used in deep learning. It updates parameters using a small random subset of the training data instead of the full dataset.

The full training objective is

$$
L(\theta) =
\frac{1}{N}
\sum_{i=1}^{N}
\ell(f_\theta(x_i), y_i).
$$

Full-batch gradient descent computes

$$
\nabla_\theta L(\theta)
$$

using all \(N\) examples. SGD estimates this gradient using one example or a minibatch.

For a minibatch \(\mathcal{B}\), the minibatch loss is

$$
L_\mathcal{B}(\theta) =
\frac{1}{|\mathcal{B}|}
\sum_{i\in\mathcal{B}}
\ell(f_\theta(x_i), y_i).
$$

The update is

$$
\theta
\leftarrow
\theta -
\eta
\nabla_\theta L_\mathcal{B}(\theta).
$$

The minibatch gradient is noisy, but it is much cheaper to compute.

### Why Use Stochastic Gradients

Modern datasets are often too large for full-batch optimization. Even when the dataset fits in memory, computing the full gradient before every update is usually inefficient.

SGD solves this by making frequent approximate updates. Each minibatch gives a rough estimate of the direction that reduces the full training loss.

This has three practical advantages.

First, each update is cheaper. A minibatch of 32 or 256 examples is much faster than a full pass over millions of examples.

Second, SGD can start improving the model immediately. It does not wait for a full dataset pass before the first update.

Third, gradient noise can help training. The noise may prevent the optimizer from following a narrow path too rigidly and may help it move through flat or saddle-like regions.

### Single-Example SGD and Minibatch SGD

In the strictest sense, SGD uses one training example per update:

$$
\theta
\leftarrow
\theta -
\eta
\nabla_\theta
\ell(f_\theta(x_i), y_i).
$$

In deep learning practice, the term SGD usually includes minibatch SGD. A minibatch contains several examples:

$$
\mathcal{B} = \{i_1,i_2,\ldots,i_B\}.
$$

The update uses the average loss over the minibatch:

$$
L_\mathcal{B}(\theta) =
\frac{1}{B}
\sum_{j=1}^{B}
\ell(f_\theta(x_{i_j}), y_{i_j}).
$$

Minibatches are preferred because they use hardware efficiently. GPUs and other accelerators perform large matrix operations much more efficiently than many tiny operations.

### Epochs, Steps, and Batches

An epoch is one pass through the training dataset.

A step is one parameter update.

A batch is the subset of examples used in one step.

If the dataset has \(N\) examples and the batch size is \(B\), then the number of steps per epoch is approximately

$$
\left\lceil \frac{N}{B} \right\rceil.
$$

For example, if \(N=50{,}000\) and \(B=100\), then each epoch has about 500 steps.

These terms are distinct. Training for 10 epochs with batch size 100 on 50,000 examples means about 5,000 optimizer updates.

### Shuffle and Sampling

SGD depends on sampling. If minibatches are always constructed in the same fixed order, the optimizer may see biased or correlated updates.

For this reason, training data is usually shuffled at the start of each epoch.

In PyTorch:

```python
from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
)
```

Shuffling makes each minibatch a better approximation to a random sample from the dataset.

For very large datasets, exact shuffling may be expensive. In streaming systems, approximate shuffling with buffers is often used. The goal is the same: reduce harmful ordering effects.

### SGD in PyTorch

PyTorch provides SGD through `torch.optim.SGD`.

```python
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(0)

N = 1000
d = 10

X = torch.randn(N, d)
true_w = torch.randn(d)
true_b = 0.5
y = X @ true_w + true_b + 0.1 * torch.randn(N)

dataset = TensorDataset(X, y)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

model = nn.Linear(d, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

for epoch in range(20):
    for batch_X, batch_y in loader:
        pred = model(batch_X).squeeze(-1)
        loss = loss_fn(pred, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

This is minibatch SGD. Each iteration uses only one batch, not the full dataset.

The important point is that the optimizer does not know whether the loss came from the full dataset or a minibatch. It simply updates parameters using the gradients currently stored in each parameter’s `.grad` field.

### Learning Rate in SGD

The learning rate is especially important in SGD because gradients are noisy.

If the learning rate is too large, the loss may oscillate or diverge. If the learning rate is too small, training may be stable but slow.

A fixed learning rate can work for simple problems. For larger neural networks, the learning rate is usually scheduled. Common schedules include step decay, cosine decay, exponential decay, and warmup followed by decay.

A simple step schedule in PyTorch:

```python
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=10,
    gamma=0.1,
)

for epoch in range(30):
    for batch_X, batch_y in loader:
        pred = model(batch_X).squeeze(-1)
        loss = loss_fn(pred, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    scheduler.step()
```

After every 10 epochs, the learning rate is multiplied by 0.1.

### Batch Size and Gradient Noise

Batch size controls the amount of noise in the gradient estimate.

Small batches produce noisy gradients. They are cheap per step and may generalize well, but training can fluctuate.

Large batches produce smoother gradients. They use hardware efficiently and may train faster in wall-clock time, but they require more memory and often need careful learning-rate tuning.

If the minibatch gradient is

$$
g_\mathcal{B} =
\nabla_\theta L_\mathcal{B}(\theta),
$$

then it is an estimate of the full gradient

$$
g =
\nabla_\theta L(\theta).
$$

As batch size increases, the estimate usually becomes less noisy.

However, after a certain point, increasing batch size gives diminishing returns. Doubling the batch size may not double the useful information in the update.

### SGD with Momentum

Plain SGD updates parameters directly from the current gradient. Momentum adds a velocity term that accumulates past gradients.

The update is commonly written as

$$
v
\leftarrow
\mu v + g,
$$

$$
\theta
\leftarrow
\theta - \eta v.
$$

Here \(g\) is the current gradient, \(v\) is the velocity, and \(\mu\) is the momentum coefficient.

Momentum helps in two ways. It smooths noisy gradients, and it accelerates progress in directions where gradients consistently point the same way.

In PyTorch:

```python
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.05,
    momentum=0.9,
)
```

Momentum is one of the most important extensions of SGD.

### Weight Decay

SGD also supports weight decay, a common form of regularization.

```python
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.05,
    momentum=0.9,
    weight_decay=1e-4,
)
```

Weight decay discourages large parameter values. For many models, this improves generalization.

For SGD, weight decay is closely related to L2 regularization. The update includes a term that pulls weights toward zero.

Biases and normalization parameters are sometimes excluded from weight decay in modern architectures. This requires creating separate optimizer parameter groups.

### Parameter Groups

PyTorch optimizers can use different settings for different parameters.

```python
decay = []
no_decay = []

for name, param in model.named_parameters():
    if not param.requires_grad:
        continue

    if name.endswith("bias"):
        no_decay.append(param)
    else:
        decay.append(param)

optimizer = torch.optim.SGD(
    [
        {"params": decay, "weight_decay": 1e-4},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=0.05,
    momentum=0.9,
)
```

Parameter groups are useful when different parts of a model need different learning rates or regularization settings.

For example, when fine-tuning a pretrained model, one may use a smaller learning rate for the pretrained backbone and a larger learning rate for the newly initialized head.

### Training and Validation Behavior

SGD reduces training loss by following minibatch gradients. But the training loss alone does not show whether the model generalizes.

A typical loop records both training loss and validation loss:

```python
for epoch in range(20):
    model.train()

    train_loss = 0.0
    for batch_X, batch_y in loader:
        pred = model(batch_X).squeeze(-1)
        loss = loss_fn(pred, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * batch_X.size(0)

    train_loss /= len(loader.dataset)

    model.eval()
    with torch.no_grad():
        val_pred = model(X_val).squeeze(-1)
        val_loss = loss_fn(val_pred, y_val).item()

    print(epoch, train_loss, val_loss)
```

`model.train()` enables training behavior such as dropout and batch normalization updates. `model.eval()` switches the model to evaluation behavior. `torch.no_grad()` disables gradient tracking during validation.

### Common SGD Failure Modes

SGD can fail for several common reasons.

A learning rate that is too high may cause the loss to explode or become `nan`.

A learning rate that is too low may produce almost no progress.

Unshuffled data may lead to biased minibatches.

Poor feature scaling may make optimization slow, especially for linear models and shallow networks.

Exploding gradients may cause unstable updates in recurrent networks or very deep models.

Incorrect loss usage, such as applying softmax before `CrossEntropyLoss`, may damage numerical stability.

Most debugging starts with a small controlled experiment. Use a small dataset, confirm the model can overfit it, then scale up.

### SGD as the Baseline Optimizer

Many modern systems use AdamW or related adaptive optimizers. Even so, SGD remains important.

It is simple, stable, memory-efficient, and theoretically easier to analyze. In computer vision, SGD with momentum has historically been a strong baseline. In large language model training, adaptive methods are more common, but the same minibatch-gradient principle remains central.

Understanding SGD makes later optimizers easier to understand. Momentum, RMSProp, Adam, and AdamW all modify how gradients are scaled, accumulated, or regularized.

### Summary

Stochastic gradient descent updates model parameters using gradients computed from randomly sampled examples or minibatches. It is cheaper than full-batch gradient descent and is the standard training method for neural networks.

In PyTorch, SGD is implemented with `torch.optim.SGD`. Practical SGD training depends on batch size, shuffling, learning rate, momentum, weight decay, and validation monitoring. The same training loop structure will be reused for deeper networks throughout this book.