Momentum and Adaptive Methods

Stochastic gradient descent uses the current minibatch gradient to update the parameters. This is simple and effective, but the update can be noisy. Momentum and adaptive optimization methods modify the basic SGD update to make training faster, smoother, or less sensitive to feature scale.

The general training loop stays the same:

optimizer.zero_grad()
loss.backward()
optimizer.step()

The difference is inside optimizer.step(). Different optimizers use different rules for converting gradients into parameter updates.

Momentum

Momentum adds memory to SGD. Instead of updating parameters using only the current gradient, it keeps a running velocity.

Let $g_t$ be the gradient at step $t$ . Momentum computes

v_t = \mu v_{t-1} + g_t,

then updates parameters by

\theta_t = \theta_{t-1} - \eta v_t.

Here $\mu$ is the momentum coefficient, commonly 0.9, and $\eta$ is the learning rate.

Momentum helps when gradients point in a consistent direction. The velocity accumulates those gradients, so progress accelerates along stable directions. It also damps oscillation in directions where gradients change sign repeatedly.

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
)

This is still SGD, but with a velocity buffer stored for each parameter.

Nesterov Momentum

Nesterov momentum is a variation that evaluates the gradient after a lookahead step. Intuitively, it asks: if momentum is already carrying the parameters forward, what gradient will we see near that future position?

A common form is

v_t = \mu v_{t-1} + g(\theta_{t-1} - \eta \mu v_{t-1}),

\theta_t = \theta_{t-1} - \eta v_t.

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    nesterov=True,
)

Nesterov momentum can improve convergence in some models, especially when combined with a well-tuned learning-rate schedule.

Adaptive Learning Rates

Momentum uses past gradients to smooth the update direction. Adaptive methods also change the effective learning rate for each parameter.

A single global learning rate can be inefficient. Some parameters may receive large gradients. Others may receive small gradients. Adaptive optimizers scale updates using statistics of past gradients.

The general idea is:

\theta_t = \theta_{t-1} - \eta \frac{\text{first moment estimate}} {\text{scale estimate}}.

The numerator controls direction. The denominator controls per-parameter step size.

AdaGrad

AdaGrad accumulates squared gradients for each parameter:

s_t = s_{t-1} + g_t^2.

The update is

\theta_t = \theta_{t-1} - \eta \frac{g_t}{\sqrt{s_t}+\epsilon}.

Parameters with large historical gradients receive smaller effective learning rates. Parameters with small or rare gradients receive larger relative updates.

This is useful for sparse features, such as word counts or high-dimensional categorical inputs. But because $s_t$ only grows, the effective learning rate can become very small after many steps.

In PyTorch:

optimizer = torch.optim.Adagrad(
    model.parameters(),
    lr=0.01,
)

RMSProp

RMSProp fixes AdaGrad’s continuously shrinking learning rate by using an exponential moving average of squared gradients:

s_t = \rho s_{t-1} + (1-\rho)g_t^2.

The update is

\theta_t = \theta_{t-1} - \eta \frac{g_t}{\sqrt{s_t}+\epsilon}.

The parameter $\rho$ controls how quickly the moving average changes. A common value is 0.99 or 0.9.

In PyTorch:

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=1e-3,
    alpha=0.99,
    eps=1e-8,
)

RMSProp was historically useful for recurrent networks and nonstationary objectives.

Adam

Adam combines momentum with adaptive scaling. It keeps two moving averages.

The first moment estimates the mean gradient:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t.

The second moment estimates the mean squared gradient:

v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2.

Because both estimates start at zero, Adam uses bias correction:

\hat{m}_t = \frac{m_t}{1-\beta_1^t},

\hat{v}_t = \frac{v_t}{1-\beta_2^t}.

The update is

\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}.

Common defaults are

\beta_1=0.9,\quad \beta_2=0.999,\quad \epsilon=10^{-8}.

In PyTorch:

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
)

Adam is widely used because it often works well with little tuning. It is common for transformers, generative models, reinforcement learning, and many research prototypes.

AdamW

AdamW is a variant of Adam with decoupled weight decay. This distinction matters.

In ordinary L2 regularization, the penalty is added to the loss:

L_{\text{reg}}(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2.

This changes the gradient:

\nabla_\theta L_{\text{reg}} = \nabla_\theta L + \lambda\theta.

For plain SGD, this behaves like weight decay. For Adam, the adaptive scaling changes the effect of this penalty. AdamW instead applies weight decay directly to the parameters, separately from the gradient update.

In PyTorch:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=0.01,
)

AdamW is the default choice for many modern transformer and diffusion models.

Choosing an Optimizer

There is no universally best optimizer. The right choice depends on the architecture, dataset, training budget, and tuning effort.

Optimizer	Main idea	Common use
SGD	Current minibatch gradient	Simple baselines, classical CNN training
SGD with momentum	Smooth accumulated direction	Vision models, stable large-batch training
AdaGrad	Accumulated squared gradients	Sparse features
RMSProp	Moving average of squared gradients	Recurrent nets, older RL systems
Adam	Momentum plus adaptive scaling	General deep learning
AdamW	Adam with decoupled weight decay	Transformers, diffusion models, foundation models

A reasonable default for many PyTorch projects is AdamW. A reasonable baseline for classical image classification is SGD with momentum.

Parameter Groups in AdamW

Modern models often apply weight decay selectively. Bias terms and normalization parameters are commonly excluded.

decay = []
no_decay = []

for name, param in model.named_parameters():
    if not param.requires_grad:
        continue

    if name.endswith("bias") or "norm" in name.lower():
        no_decay.append(param)
    else:
        decay.append(param)

optimizer = torch.optim.AdamW(
    [
        {"params": decay, "weight_decay": 0.01},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=3e-4,
)

This pattern is common for transformer training and fine-tuning.

Practical Defaults

For small models, these settings are often a useful starting point:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,
)

For transformer fine-tuning:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
)

For CNN training from scratch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)

These are starting values, not laws. Learning rate usually matters more than the specific optimizer.

Optimizer State

Adaptive methods store extra tensors for each parameter. Adam stores first and second moment estimates, so it uses more memory than SGD.

If a model has $P$ parameters, SGD without momentum stores roughly the parameters and gradients. SGD with momentum stores an additional velocity tensor. Adam and AdamW store two additional tensors per parameter.

This matters for large models. Optimizer state can consume more memory than the model weights themselves.

A simplified memory comparison:

Optimizer	Extra state per parameter
SGD	0
SGD with momentum	1
RMSProp	1
Adam	2
AdamW	2

For very large models, memory-efficient optimizers, sharded optimizer states, or reduced-precision optimizer states may be needed.

Summary

Momentum improves SGD by accumulating a velocity across steps. Adaptive methods adjust the effective learning rate for each parameter using gradient statistics.

AdaGrad accumulates squared gradients. RMSProp uses a moving average of squared gradients. Adam combines momentum and adaptive scaling. AdamW modifies Adam by decoupling weight decay from the gradient update.

In PyTorch, these methods share the same training-loop interface. The optimizer changes, but the structure of training remains: clear gradients, compute loss, backpropagate, update parameters.