Stochastic Optimization

Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal case. We do not usually optimize the loss over the entire data distribution. We optimize an empirical approximation, and even that empirical loss is usually accessed through mini-batches.

The ideal objective is often written as an expectation:

L(\theta) = \mathbb{E}_{(x,y) \sim P}[\ell(f_\theta(x), y)].

The distribution $P$ is unknown. We only have samples. With a dataset of $N$ examples, we approximate the objective by empirical risk:

\hat{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

Full-batch gradient descent computes $\nabla \hat{L}(\theta)$ . Stochastic optimization replaces this full gradient with an estimator computed from one example or a mini-batch.

For a batch $B$ , the mini-batch gradient is

g_B(\theta) = \frac{1}{|B|} \sum_{i \in B} \nabla_\theta \ell(f_\theta(x_i), y_i).

The update becomes

\theta_{t+1} = \theta_t - \eta_t g_B(\theta_t).

The subscript on $\eta_t$ matters. In stochastic optimization, the learning rate often changes over time.

Unbiased Gradient Estimates

If the batch is sampled uniformly from the dataset, then the mini-batch gradient is an unbiased estimator of the full empirical gradient:

\mathbb{E}[g_B(\theta)] = \nabla \hat{L}(\theta).

This does not mean every update moves in the best direction. It means the average update direction is correct relative to the empirical objective.

The variance of the estimator decreases as batch size increases. Small batches give noisy gradients. Large batches give more accurate gradients but cost more per update.

This creates a central tradeoff:

Batch size	Gradient noise	Update cost	Hardware use
Small	High	Low	Often poor
Medium	Moderate	Moderate	Good
Large	Low	High	Often excellent

The best batch size depends on the model, hardware, optimizer, memory capacity, and data pipeline.

Stochastic Gradient Descent

The simplest stochastic optimizer is stochastic gradient descent, usually called SGD.

initialize theta

for each step t:
    batch = sample_batch(data)
    loss = L(theta, batch)
    grad = gradient(loss, theta)
    theta = theta - eta_t * grad

Automatic differentiation computes grad. The optimizer only consumes the resulting tensors.

SGD is attractive because it has little state. Each parameter update requires the parameter value, its gradient, and the current learning rate. This makes SGD memory-efficient and easy to implement.

For a parameter tensor $w$ , the update is:

w \leftarrow w - \eta_t \nabla_w L_B.

The gradient is noisy because $L_B$ uses a subset of the data. This noise can help escape some narrow or unstable regions, but it can also slow convergence.

Learning Rate Schedules

With noisy gradients, a fixed learning rate may fail to converge to a stable point. Classical stochastic approximation often uses decreasing learning rates. A simple schedule is:

\eta_t = \frac{\eta_0}{1 + kt}.

Another common form is step decay:

\eta_t = \eta_0 \gamma^{\lfloor t / s \rfloor},

where $s$ is the decay interval and $0 < \gamma < 1$ .

Modern neural network training often uses warmup, cosine decay, or piecewise schedules. Warmup starts with a small learning rate and increases it during early training. This is useful when large models, normalization layers, adaptive optimizers, or mixed precision make early updates unstable.

A schedule belongs to the optimizer layer, not the AD layer. The AD system should compute gradients for the current loss. It should not encode training policy.

Momentum

SGD can oscillate in directions of high curvature. Momentum reduces this effect by maintaining a velocity buffer:

v_{t+1} = \mu v_t + g_t,

\theta_{t+1} = \theta_t - \eta v_{t+1}.

Here, $0 \le \mu < 1$ controls how much past gradient information is retained.

Momentum smooths noisy gradients and accelerates progress along directions where gradients are consistent. It adds one auxiliary tensor per parameter, with the same shape as the parameter.

The AD system remains unchanged. It still computes $g_t$ . Momentum changes how $g_t$ is transformed into a parameter update.

Adaptive Methods

Adaptive optimizers maintain statistics of past gradients. A common design keeps a moving average of gradients and squared gradients:

m_{t+1} = \beta_1 m_t + (1 - \beta_1) g_t,

v_{t+1} = \beta_2 v_t + (1 - \beta_2) g_t^2.

The update scales each coordinate by the estimated gradient magnitude:

\theta_{t+1} = \theta_t - \eta \frac{m_{t+1}}{\sqrt{v_{t+1}} + \epsilon}.

This is the basic shape behind Adam-like methods.

Adaptive methods are useful when parameters have gradients with very different scales. They are also convenient because they often require less learning-rate tuning than plain SGD.

The cost is additional optimizer state. Adam-style methods commonly store two extra tensors per parameter. For large models, this state can dominate memory use.

Stochasticity Sources

Stochastic optimization receives noise from several places.

Sampling noise comes from mini-batch selection. Different batches produce different gradients.

Model noise may come from dropout, randomized layers, data augmentation, or stochastic routing.

Hardware noise may come from nondeterministic parallel reductions, mixed precision arithmetic, and distributed communication.

Data pipeline noise may come from shuffling, asynchronous loading, and randomized preprocessing.

AD differentiates the executed computation. If the forward pass includes random choices, the derivative corresponds to the sampled execution path, unless the system uses a special estimator or reparameterization.

Differentiating Randomness

Randomness is subtle in differentiable programs.

Consider:

z \sim p(z), \qquad L(\theta) = \ell(f_\theta(z)).

If $z$ does not depend on $\theta$ , then AD can differentiate through $f_\theta(z)$ for the sampled value of $z$ . The sample is treated as an input value.

If the sampling distribution depends on $\theta$ , ordinary pathwise AD may not apply directly. For example:

z \sim p_\theta(z).

The derivative of an expectation requires more care:

\nabla_\theta \mathbb{E}_{z \sim p_\theta}[\ell(z)].

Some cases use reparameterization. We rewrite sampling as a deterministic transformation of parameter-free noise:

z = h_\theta(\epsilon), \qquad \epsilon \sim p(\epsilon).

Then AD can differentiate through $h_\theta$ .

Other cases use score-function estimators, such as REINFORCE-style gradients. These estimators are often high variance and belong more to probabilistic optimization than ordinary reverse mode AD.

Gradient Accumulation

Gradient accumulation is used when the desired batch size does not fit in memory.

Instead of one large batch, we use several micro-batches:

zero gradients

for each micro-batch:
    loss = L(theta, micro_batch) / num_micro_batches
    grad = backward(loss)
    accumulate grad

optimizer step
zero gradients

Dividing the loss by the number of micro-batches keeps the accumulated gradient equal to the average gradient over the effective batch.

This pattern relies on the fact that reverse mode adds contributions into gradient buffers. Accumulation is therefore a controlled use of mutable gradient state.

Distributed Stochastic Optimization

In distributed training, several workers compute gradients on different mini-batches. A common synchronous pattern is:

for each worker:
    compute local gradient

average gradients across workers

for each worker:
    apply same optimizer update

If there are $K$ workers, each with local batch size $b$ , the effective batch size is $Kb$ .

The AD system computes local gradients. The distributed runtime handles communication, usually through an all-reduce operation. The optimizer then uses the averaged gradient.

Distributed stochastic optimization adds several systems concerns:

Concern	Effect
Communication latency	Slows each step
Gradient bandwidth	Limits model scale
Stragglers	Delay synchronous updates
Nondeterminism	Complicates reproducibility
Large batches	May require learning-rate adjustment

The derivative math is unchanged, but the execution model changes substantially.

AD Boundary

Automatic differentiation provides the gradient estimator for the sampled computation. It does not decide whether the estimator has low variance, whether the learning rate is stable, or whether the sampling scheme is statistically valid.

This boundary is useful:

AD layer:
    compute gradients of the current program execution

optimizer layer:
    transform gradients into parameter updates

training system:
    choose batches, schedules, precision, distribution, and logging

Keeping these layers separate makes AD engines reusable. The same reverse-mode engine can support SGD, momentum, Adam, distributed training, gradient accumulation, and custom optimizers.

Practical Checklist

For stochastic optimization, the main implementation checks are straightforward.

The loss scale should match the intended gradient scale. Averaging over a batch gives gradients that are mostly independent of batch size. Summing over a batch makes gradients grow with batch size.

Gradient buffers should be cleared at the right time. Clearing too early drops contributions. Clearing too late accumulates unintended gradients.

Random seeds should be controlled when reproducibility matters. Some GPU kernels may remain nondeterministic even with fixed seeds.

Optimizer state should be checkpointed with model parameters. Restarting Adam without its moment buffers changes the training trajectory.

Learning-rate schedules should count the intended unit: step, token, example, epoch, or optimizer update. In gradient accumulation, micro-batches and optimizer steps differ.

Stochastic optimization is where AD becomes part of a larger training system. The derivative computation is local and mechanical. The behavior of the optimizer depends on sampling, state, scheduling, numerical precision, and hardware execution.