Learning Rate Scheduling

The learning rate controls the size of each parameter update. In early training, larger updates can help the model move quickly into a useful region. Later in training, smaller updates can help the model settle into a better solution.

A learning rate schedule changes the learning rate during training.

The optimizer update has the form

\theta_t = \theta_{t-1} - \eta_t g_t,

where $\eta_t$ is the learning rate at step $t$ , and $g_t$ is the gradient or optimizer-adjusted update direction.

With scheduling, $\eta_t$ changes over time.

Why Schedule the Learning Rate

A fixed learning rate is simple, but it often leaves performance on the table.

If the learning rate is large, training may make fast progress early but fail to converge cleanly. If the learning rate is small, training may be stable but unnecessarily slow.

A schedule tries to get both benefits: larger steps when the model is far from a good solution, smaller steps when the model needs refinement.

Learning rate scheduling is especially important for deep networks because training dynamics change over time. At the start, parameters are often poorly adapted to the data. Later, the model may need smaller changes to improve validation loss.

Step Decay

Step decay reduces the learning rate by a fixed factor after a fixed number of epochs.

For example, the learning rate may start at $0.1$ , then drop to $0.01$ , then to $0.001$ .

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
)

scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=30,
    gamma=0.1,
)

for epoch in range(100):
    train_one_epoch(model, loader, optimizer)
    scheduler.step()

Here the learning rate is multiplied by 0.1 every 30 epochs.

Step decay is simple and has been widely used for CNN training.

Multi-Step Decay

Multi-step decay lowers the learning rate at specific epochs.

scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[30, 60, 90],
    gamma=0.1,
)

This gives more explicit control than StepLR. It is common when reproducing known training recipes.

For example, a training run may use learning rate $0.1$ until epoch 30, $0.01$ until epoch 60, $0.001$ until epoch 90, then $0.0001$ afterward.

Exponential Decay

Exponential decay multiplies the learning rate by a constant factor at every scheduling step:

\eta_t = \eta_0 \gamma^t.

Here $\eta_0$ is the initial learning rate and $\gamma$ controls the decay speed.

In PyTorch:

scheduler = torch.optim.lr_scheduler.ExponentialLR(
    optimizer,
    gamma=0.95,
)

Each call to scheduler.step() multiplies the learning rate by 0.95.

Exponential decay gives a smooth decrease, but it can reduce the learning rate too aggressively if $\gamma$ is poorly chosen.

Cosine Annealing

Cosine annealing reduces the learning rate following a cosine curve:

\eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max}-\eta_{\min}) \left( 1+\cos\left(\frac{\pi t}{T}\right) \right).

At the beginning, the learning rate is near $\eta_{\max}$ . Near the end, it approaches $\eta_{\min}$ .

In PyTorch:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=100,
    eta_min=1e-6,
)

for epoch in range(100):
    train_one_epoch(model, loader, optimizer)
    scheduler.step()

Cosine schedules are common in modern vision models, transformers, and self-supervised learning.

Warmup

Warmup starts with a small learning rate and gradually increases it. This is useful when large initial updates would destabilize training.

A linear warmup schedule is

\eta_t = \eta_{\max} \frac{t}{T_{\text{warmup}}}, \quad 0 \le t \le T_{\text{warmup}}.

After warmup, the schedule usually switches to decay.

Warmup is common for transformers and large-batch training. At initialization, activations and gradients may have unstable scale. A gradual learning-rate increase gives the optimizer time to enter a stable regime.

A simple manual warmup by step:

base_lr = 3e-4
warmup_steps = 1000

optimizer = torch.optim.AdamW(model.parameters(), lr=base_lr)

for step, batch in enumerate(loader):
    if step < warmup_steps:
        lr = base_lr * (step + 1) / warmup_steps
        for group in optimizer.param_groups:
            group["lr"] = lr

    loss = compute_loss(model, batch)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

For long training runs, warmup is usually combined with cosine decay or linear decay.

Linear Warmup with Cosine Decay

A common modern schedule is linear warmup followed by cosine decay.

During warmup:

\eta_t = \eta_{\max} \frac{t}{T_w}.

After warmup:

\eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max}-\eta_{\min}) \left( 1+\cos\left( \frac{\pi(t-T_w)}{T-T_w} \right) \right).

Here $T_w$ is the number of warmup steps, and $T$ is the total number of steps.

In PyTorch, this can be implemented with LambdaLR:

import math
import torch

total_steps = 10000
warmup_steps = 1000
min_lr_ratio = 0.1

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

def lr_lambda(step):
    if step < warmup_steps:
        return float(step + 1) / float(warmup_steps)

    progress = float(step - warmup_steps) / float(
        max(1, total_steps - warmup_steps)
    )

    cosine = 0.5 * (1.0 + math.cos(math.pi * progress))
    return min_lr_ratio + (1.0 - min_lr_ratio) * cosine

scheduler = torch.optim.lr_scheduler.LambdaLR(
    optimizer,
    lr_lambda=lr_lambda,
)

for step, batch in enumerate(loader):
    loss = compute_loss(model, batch)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    scheduler.step()

The scheduler returns a multiplier applied to the optimizer’s base learning rate.

Reduce on Plateau

Some schedules depend on validation metrics rather than epoch count. ReduceLROnPlateau lowers the learning rate when a monitored metric stops improving.

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="min",
    factor=0.1,
    patience=5,
)

for epoch in range(100):
    train_one_epoch(model, loader, optimizer)
    val_loss = evaluate(model, val_loader)

    scheduler.step(val_loss)

This is useful when the right decay time is unknown. It is common in smaller experiments and applied modeling.

Unlike most PyTorch schedulers, ReduceLROnPlateau receives the validation metric as an argument.

One-Cycle Policy

The one-cycle policy increases the learning rate early, then decreases it below the initial value. It often uses momentum in the opposite direction: lower momentum when learning rate is high, higher momentum when learning rate is low.

In PyTorch:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
)

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    steps_per_epoch=len(loader),
    epochs=20,
)

for epoch in range(20):
    for batch in loader:
        loss = compute_loss(model, batch)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

OneCycleLR is stepped after every optimizer update. It needs either total_steps or both steps_per_epoch and epochs.

Scheduler Step Order

For most PyTorch schedulers, the usual order is:

optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()

This updates the parameters using the current learning rate, then prepares the learning rate for the next step.

For epoch-based schedulers, call scheduler.step() once per epoch. For step-based schedulers, call it once per optimizer update.

Examples:

# Epoch-based
for epoch in range(num_epochs):
    train_one_epoch(...)
    scheduler.step()

# Step-based
for epoch in range(num_epochs):
    for batch in loader:
        ...
        optimizer.step()
        scheduler.step()

Mixing these up changes the schedule dramatically.

Inspecting the Learning Rate

It is useful to log the learning rate during training.

current_lr = optimizer.param_groups[0]["lr"]
print(current_lr)

If the optimizer has multiple parameter groups, each group may have a different learning rate:

for i, group in enumerate(optimizer.param_groups):
    print(i, group["lr"])

When training fails, always check the actual learning rate. A scheduler bug can silently make learning rates too large or too small.

Choosing a Schedule

The right schedule depends on the task and training regime.

Setting	Common schedule
Small experiment	Fixed LR or ReduceLROnPlateau
Classical CNN training	Step decay, multi-step decay, cosine decay
Transformer training	Warmup plus cosine or linear decay
Large-batch training	Warmup plus decay
Fast supervised training	One-cycle policy
Fine-tuning pretrained models	Small LR, often linear decay with warmup

A practical default for many modern PyTorch models is AdamW with linear warmup and cosine decay. For simple baselines, fixed learning rate or step decay is enough.

Learning Rate Range Tests

A learning rate range test tries many learning rates in a short run. The learning rate starts very small and increases over time. The loss is recorded at each value.

The useful learning rate range is often where the loss begins to decrease quickly but before it becomes unstable.

A simple manual version:

min_lr = 1e-6
max_lr = 1.0
num_steps = 200

optimizer = torch.optim.SGD(model.parameters(), lr=min_lr)

for step, batch in enumerate(loader):
    if step >= num_steps:
        break

    ratio = step / (num_steps - 1)
    lr = min_lr * (max_lr / min_lr) ** ratio

    for group in optimizer.param_groups:
        group["lr"] = lr

    loss = compute_loss(model, batch)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(step, lr, loss.item())

This test does not replace validation. It is a diagnostic tool for choosing an initial learning rate.

Summary

Learning rate scheduling changes the learning rate during training. This often improves speed, stability, and final model quality.

Common schedules include step decay, multi-step decay, exponential decay, cosine annealing, warmup, ReduceLROnPlateau, and one-cycle learning. Modern transformer-style training often uses warmup followed by cosine or linear decay.

In PyTorch, schedulers live in torch.optim.lr_scheduler. The key implementation detail is timing: call the scheduler once per epoch or once per step according to the schedule design.