Skip to content

Stochastic Optimization

Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal...

Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal case. We do not usually optimize the loss over the entire data distribution. We optimize an empirical approximation, and even that empirical loss is usually accessed through mini-batches.

The ideal objective is often written as an expectation:

L(θ)=E(x,y)P[(fθ(x),y)]. L(\theta) = \mathbb{E}_{(x,y) \sim P}[\ell(f_\theta(x), y)].

The distribution PP is unknown. We only have samples. With a dataset of NN examples, we approximate the objective by empirical risk:

L^(θ)=1Ni=1N(fθ(xi),yi). \hat{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

Full-batch gradient descent computes L^(θ)\nabla \hat{L}(\theta). Stochastic optimization replaces this full gradient with an estimator computed from one example or a mini-batch.

For a batch BB, the mini-batch gradient is

gB(θ)=1BiBθ(fθ(xi),yi). g_B(\theta) = \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \ell(f_\theta(x_i), y_i).

The update becomes

θt+1=θtηtgB(θt). \theta_{t+1} = \theta_t - \eta_t g_B(\theta_t).

The subscript on ηt\eta_t matters. In stochastic optimization, the learning rate often changes over time.

Unbiased Gradient Estimates

If the batch is sampled uniformly from the dataset, then the mini-batch gradient is an unbiased estimator of the full empirical gradient:

E[gB(θ)]=L^(θ). \mathbb{E}[g_B(\theta)] = \nabla \hat{L}(\theta).

This does not mean every update moves in the best direction. It means the average update direction is correct relative to the empirical objective.

The variance of the estimator decreases as batch size increases. Small batches give noisy gradients. Large batches give more accurate gradients but cost more per update.

This creates a central tradeoff:

Batch sizeGradient noiseUpdate costHardware use
SmallHighLowOften poor
MediumModerateModerateGood
LargeLowHighOften excellent

The best batch size depends on the model, hardware, optimizer, memory capacity, and data pipeline.

Stochastic Gradient Descent

The simplest stochastic optimizer is stochastic gradient descent, usually called SGD.

initialize theta

for each step t:
    batch = sample_batch(data)
    loss = L(theta, batch)
    grad = gradient(loss, theta)
    theta = theta - eta_t * grad

Automatic differentiation computes grad. The optimizer only consumes the resulting tensors.

SGD is attractive because it has little state. Each parameter update requires the parameter value, its gradient, and the current learning rate. This makes SGD memory-efficient and easy to implement.

For a parameter tensor ww, the update is:

wwηtwLB. w \leftarrow w - \eta_t \nabla_w L_B.

The gradient is noisy because LBL_B uses a subset of the data. This noise can help escape some narrow or unstable regions, but it can also slow convergence.

Learning Rate Schedules

With noisy gradients, a fixed learning rate may fail to converge to a stable point. Classical stochastic approximation often uses decreasing learning rates. A simple schedule is:

ηt=η01+kt. \eta_t = \frac{\eta_0}{1 + kt}.

Another common form is step decay:

ηt=η0γt/s, \eta_t = \eta_0 \gamma^{\lfloor t / s \rfloor},

where ss is the decay interval and 0<γ<10 < \gamma < 1.

Modern neural network training often uses warmup, cosine decay, or piecewise schedules. Warmup starts with a small learning rate and increases it during early training. This is useful when large models, normalization layers, adaptive optimizers, or mixed precision make early updates unstable.

A schedule belongs to the optimizer layer, not the AD layer. The AD system should compute gradients for the current loss. It should not encode training policy.

Momentum

SGD can oscillate in directions of high curvature. Momentum reduces this effect by maintaining a velocity buffer:

vt+1=μvt+gt, v_{t+1} = \mu v_t + g_t, θt+1=θtηvt+1. \theta_{t+1} = \theta_t - \eta v_{t+1}.

Here, 0μ<10 \le \mu < 1 controls how much past gradient information is retained.

Momentum smooths noisy gradients and accelerates progress along directions where gradients are consistent. It adds one auxiliary tensor per parameter, with the same shape as the parameter.

The AD system remains unchanged. It still computes gtg_t. Momentum changes how gtg_t is transformed into a parameter update.

Adaptive Methods

Adaptive optimizers maintain statistics of past gradients. A common design keeps a moving average of gradients and squared gradients:

mt+1=β1mt+(1β1)gt, m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t, vt+1=β2vt+(1β2)gt2. v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t^2.

The update scales each coordinate by the estimated gradient magnitude:

θt+1=θtηmt+1vt+1+ϵ. \theta_{t+1} = \theta_t - \eta \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}.

This is the basic shape behind Adam-like methods.

Adaptive methods are useful when parameters have gradients with very different scales. They are also convenient because they often require less learning-rate tuning than plain SGD.

The cost is additional optimizer state. Adam-style methods commonly store two extra tensors per parameter. For large models, this state can dominate memory use.

Stochasticity Sources

Stochastic optimization receives noise from several places.

Sampling noise comes from mini-batch selection. Different batches produce different gradients.

Model noise may come from dropout, randomized layers, data augmentation, or stochastic routing.

Hardware noise may come from nondeterministic parallel reductions, mixed precision arithmetic, and distributed communication.

Data pipeline noise may come from shuffling, asynchronous loading, and randomized preprocessing.

AD differentiates the executed computation. If the forward pass includes random choices, the derivative corresponds to the sampled execution path, unless the system uses a special estimator or reparameterization.

Differentiating Randomness

Randomness is subtle in differentiable programs.

Consider:

zp(z),L(θ)=(fθ(z)). z \sim p(z), \qquad L(\theta) = \ell(f_\theta(z)).

If zz does not depend on θ\theta, then AD can differentiate through fθ(z)f_\theta(z) for the sampled value of zz. The sample is treated as an input value.

If the sampling distribution depends on θ\theta, ordinary pathwise AD may not apply directly. For example:

zpθ(z). z \sim p_\theta(z).

The derivative of an expectation requires more care:

θEzpθ[(z)]. \nabla_\theta \mathbb{E}_{z \sim p_\theta}[\ell(z)].

Some cases use reparameterization. We rewrite sampling as a deterministic transformation of parameter-free noise:

z=hθ(ϵ),ϵp(ϵ). z = h_\theta(\epsilon), \qquad \epsilon \sim p(\epsilon).

Then AD can differentiate through hθh_\theta.

Other cases use score-function estimators, such as REINFORCE-style gradients. These estimators are often high variance and belong more to probabilistic optimization than ordinary reverse mode AD.

Gradient Accumulation

Gradient accumulation is used when the desired batch size does not fit in memory.

Instead of one large batch, we use several micro-batches:

zero gradients

for each micro-batch:
    loss = L(theta, micro_batch) / num_micro_batches
    grad = backward(loss)
    accumulate grad

optimizer step
zero gradients

Dividing the loss by the number of micro-batches keeps the accumulated gradient equal to the average gradient over the effective batch.

This pattern relies on the fact that reverse mode adds contributions into gradient buffers. Accumulation is therefore a controlled use of mutable gradient state.

Distributed Stochastic Optimization

In distributed training, several workers compute gradients on different mini-batches. A common synchronous pattern is:

for each worker:
    compute local gradient

average gradients across workers

for each worker:
    apply same optimizer update

If there are KK workers, each with local batch size bb, the effective batch size is KbKb.

The AD system computes local gradients. The distributed runtime handles communication, usually through an all-reduce operation. The optimizer then uses the averaged gradient.

Distributed stochastic optimization adds several systems concerns:

ConcernEffect
Communication latencySlows each step
Gradient bandwidthLimits model scale
StragglersDelay synchronous updates
NondeterminismComplicates reproducibility
Large batchesMay require learning-rate adjustment

The derivative math is unchanged, but the execution model changes substantially.

AD Boundary

Automatic differentiation provides the gradient estimator for the sampled computation. It does not decide whether the estimator has low variance, whether the learning rate is stable, or whether the sampling scheme is statistically valid.

This boundary is useful:

AD layer:
    compute gradients of the current program execution

optimizer layer:
    transform gradients into parameter updates

training system:
    choose batches, schedules, precision, distribution, and logging

Keeping these layers separate makes AD engines reusable. The same reverse-mode engine can support SGD, momentum, Adam, distributed training, gradient accumulation, and custom optimizers.

Practical Checklist

For stochastic optimization, the main implementation checks are straightforward.

The loss scale should match the intended gradient scale. Averaging over a batch gives gradients that are mostly independent of batch size. Summing over a batch makes gradients grow with batch size.

Gradient buffers should be cleared at the right time. Clearing too early drops contributions. Clearing too late accumulates unintended gradients.

Random seeds should be controlled when reproducibility matters. Some GPU kernels may remain nondeterministic even with fixed seeds.

Optimizer state should be checkpointed with model parameters. Restarting Adam without its moment buffers changes the training trajectory.

Learning-rate schedules should count the intended unit: step, token, example, epoch, or optimizer update. In gradient accumulation, micro-batches and optimizer steps differ.

Stochastic optimization is where AD becomes part of a larger training system. The derivative computation is local and mechanical. The behavior of the optimizer depends on sampling, state, scheduling, numerical precision, and hardware execution.