Skip to content

Momentum and Adaptive Methods

Stochastic gradient descent uses the current minibatch gradient to update the parameters.

Stochastic gradient descent uses the current minibatch gradient to update the parameters. This is simple and effective, but the update can be noisy. Momentum and adaptive optimization methods modify the basic SGD update to make training faster, smoother, or less sensitive to feature scale.

The general training loop stays the same:

optimizer.zero_grad()
loss.backward()
optimizer.step()

The difference is inside optimizer.step(). Different optimizers use different rules for converting gradients into parameter updates.

Momentum

Momentum adds memory to SGD. Instead of updating parameters using only the current gradient, it keeps a running velocity.

Let gtg_t be the gradient at step tt. Momentum computes

vt=μvt1+gt, v_t = \mu v_{t-1} + g_t,

then updates parameters by

θt=θt1ηvt. \theta_t = \theta_{t-1} - \eta v_t.

Here μ\mu is the momentum coefficient, commonly 0.9, and η\eta is the learning rate.

Momentum helps when gradients point in a consistent direction. The velocity accumulates those gradients, so progress accelerates along stable directions. It also damps oscillation in directions where gradients change sign repeatedly.

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
)

This is still SGD, but with a velocity buffer stored for each parameter.

Nesterov Momentum

Nesterov momentum is a variation that evaluates the gradient after a lookahead step. Intuitively, it asks: if momentum is already carrying the parameters forward, what gradient will we see near that future position?

A common form is

vt=μvt1+g(θt1ημvt1), v_t = \mu v_{t-1} + g(\theta_{t-1} - \eta \mu v_{t-1}), θt=θt1ηvt. \theta_t = \theta_{t-1} - \eta v_t.

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    nesterov=True,
)

Nesterov momentum can improve convergence in some models, especially when combined with a well-tuned learning-rate schedule.

Adaptive Learning Rates

Momentum uses past gradients to smooth the update direction. Adaptive methods also change the effective learning rate for each parameter.

A single global learning rate can be inefficient. Some parameters may receive large gradients. Others may receive small gradients. Adaptive optimizers scale updates using statistics of past gradients.

The general idea is:

θt=θt1ηfirst moment estimatescale estimate. \theta_t = \theta_{t-1} - \eta \frac{\text{first moment estimate}} {\text{scale estimate}}.

The numerator controls direction. The denominator controls per-parameter step size.

AdaGrad

AdaGrad accumulates squared gradients for each parameter:

st=st1+gt2. s_t = s_{t-1} + g_t^2.

The update is

θt=θt1ηgtst+ϵ. \theta_t = \theta_{t-1} - \eta \frac{g_t}{\sqrt{s_t}+\epsilon}.

Parameters with large historical gradients receive smaller effective learning rates. Parameters with small or rare gradients receive larger relative updates.

This is useful for sparse features, such as word counts or high-dimensional categorical inputs. But because sts_t only grows, the effective learning rate can become very small after many steps.

In PyTorch:

optimizer = torch.optim.Adagrad(
    model.parameters(),
    lr=0.01,
)

RMSProp

RMSProp fixes AdaGrad’s continuously shrinking learning rate by using an exponential moving average of squared gradients:

st=ρst1+(1ρ)gt2. s_t = \rho s_{t-1} + (1-\rho)g_t^2.

The update is

θt=θt1ηgtst+ϵ. \theta_t = \theta_{t-1} - \eta \frac{g_t}{\sqrt{s_t}+\epsilon}.

The parameter ρ\rho controls how quickly the moving average changes. A common value is 0.99 or 0.9.

In PyTorch:

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=1e-3,
    alpha=0.99,
    eps=1e-8,
)

RMSProp was historically useful for recurrent networks and nonstationary objectives.

Adam

Adam combines momentum with adaptive scaling. It keeps two moving averages.

The first moment estimates the mean gradient:

mt=β1mt1+(1β1)gt. m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t.

The second moment estimates the mean squared gradient:

vt=β2vt1+(1β2)gt2. v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2.

Because both estimates start at zero, Adam uses bias correction:

m^t=mt1β1t, \hat{m}_t = \frac{m_t}{1-\beta_1^t}, v^t=vt1β2t. \hat{v}_t = \frac{v_t}{1-\beta_2^t}.

The update is

θt=θt1ηm^tv^t+ϵ. \theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}.

Common defaults are

β1=0.9,β2=0.999,ϵ=108. \beta_1=0.9,\quad \beta_2=0.999,\quad \epsilon=10^{-8}.

In PyTorch:

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
)

Adam is widely used because it often works well with little tuning. It is common for transformers, generative models, reinforcement learning, and many research prototypes.

AdamW

AdamW is a variant of Adam with decoupled weight decay. This distinction matters.

In ordinary L2 regularization, the penalty is added to the loss:

Lreg(θ)=L(θ)+λ2θ22. L_{\text{reg}}(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|_2^2.

This changes the gradient:

θLreg=θL+λθ. \nabla_\theta L_{\text{reg}} = \nabla_\theta L + \lambda\theta.

For plain SGD, this behaves like weight decay. For Adam, the adaptive scaling changes the effect of this penalty. AdamW instead applies weight decay directly to the parameters, separately from the gradient update.

In PyTorch:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=0.01,
)

AdamW is the default choice for many modern transformer and diffusion models.

Choosing an Optimizer

There is no universally best optimizer. The right choice depends on the architecture, dataset, training budget, and tuning effort.

OptimizerMain ideaCommon use
SGDCurrent minibatch gradientSimple baselines, classical CNN training
SGD with momentumSmooth accumulated directionVision models, stable large-batch training
AdaGradAccumulated squared gradientsSparse features
RMSPropMoving average of squared gradientsRecurrent nets, older RL systems
AdamMomentum plus adaptive scalingGeneral deep learning
AdamWAdam with decoupled weight decayTransformers, diffusion models, foundation models

A reasonable default for many PyTorch projects is AdamW. A reasonable baseline for classical image classification is SGD with momentum.

Parameter Groups in AdamW

Modern models often apply weight decay selectively. Bias terms and normalization parameters are commonly excluded.

decay = []
no_decay = []

for name, param in model.named_parameters():
    if not param.requires_grad:
        continue

    if name.endswith("bias") or "norm" in name.lower():
        no_decay.append(param)
    else:
        decay.append(param)

optimizer = torch.optim.AdamW(
    [
        {"params": decay, "weight_decay": 0.01},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=3e-4,
)

This pattern is common for transformer training and fine-tuning.

Practical Defaults

For small models, these settings are often a useful starting point:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4,
)

For transformer fine-tuning:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
)

For CNN training from scratch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
)

These are starting values, not laws. Learning rate usually matters more than the specific optimizer.

Optimizer State

Adaptive methods store extra tensors for each parameter. Adam stores first and second moment estimates, so it uses more memory than SGD.

If a model has PP parameters, SGD without momentum stores roughly the parameters and gradients. SGD with momentum stores an additional velocity tensor. Adam and AdamW store two additional tensors per parameter.

This matters for large models. Optimizer state can consume more memory than the model weights themselves.

A simplified memory comparison:

OptimizerExtra state per parameter
SGD0
SGD with momentum1
RMSProp1
Adam2
AdamW2

For very large models, memory-efficient optimizers, sharded optimizer states, or reduced-precision optimizer states may be needed.

Summary

Momentum improves SGD by accumulating a velocity across steps. Adaptive methods adjust the effective learning rate for each parameter using gradient statistics.

AdaGrad accumulates squared gradients. RMSProp uses a moving average of squared gradients. Adam combines momentum and adaptive scaling. AdamW modifies Adam by decoupling weight decay from the gradient update.

In PyTorch, these methods share the same training-loop interface. The optimizer changes, but the structure of training remains: clear gradients, compute loss, backpropagate, update parameters.