Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal...
Stochastic optimization studies optimization when the objective is accessed through samples, noisy estimates, or partial observations. In machine learning, this is the normal case. We do not usually optimize the loss over the entire data distribution. We optimize an empirical approximation, and even that empirical loss is usually accessed through mini-batches.
The ideal objective is often written as an expectation:
The distribution is unknown. We only have samples. With a dataset of examples, we approximate the objective by empirical risk:
Full-batch gradient descent computes . Stochastic optimization replaces this full gradient with an estimator computed from one example or a mini-batch.
For a batch , the mini-batch gradient is
The update becomes
The subscript on matters. In stochastic optimization, the learning rate often changes over time.
Unbiased Gradient Estimates
If the batch is sampled uniformly from the dataset, then the mini-batch gradient is an unbiased estimator of the full empirical gradient:
This does not mean every update moves in the best direction. It means the average update direction is correct relative to the empirical objective.
The variance of the estimator decreases as batch size increases. Small batches give noisy gradients. Large batches give more accurate gradients but cost more per update.
This creates a central tradeoff:
| Batch size | Gradient noise | Update cost | Hardware use |
|---|---|---|---|
| Small | High | Low | Often poor |
| Medium | Moderate | Moderate | Good |
| Large | Low | High | Often excellent |
The best batch size depends on the model, hardware, optimizer, memory capacity, and data pipeline.
Stochastic Gradient Descent
The simplest stochastic optimizer is stochastic gradient descent, usually called SGD.
initialize theta
for each step t:
batch = sample_batch(data)
loss = L(theta, batch)
grad = gradient(loss, theta)
theta = theta - eta_t * gradAutomatic differentiation computes grad. The optimizer only consumes the resulting tensors.
SGD is attractive because it has little state. Each parameter update requires the parameter value, its gradient, and the current learning rate. This makes SGD memory-efficient and easy to implement.
For a parameter tensor , the update is:
The gradient is noisy because uses a subset of the data. This noise can help escape some narrow or unstable regions, but it can also slow convergence.
Learning Rate Schedules
With noisy gradients, a fixed learning rate may fail to converge to a stable point. Classical stochastic approximation often uses decreasing learning rates. A simple schedule is:
Another common form is step decay:
where is the decay interval and .
Modern neural network training often uses warmup, cosine decay, or piecewise schedules. Warmup starts with a small learning rate and increases it during early training. This is useful when large models, normalization layers, adaptive optimizers, or mixed precision make early updates unstable.
A schedule belongs to the optimizer layer, not the AD layer. The AD system should compute gradients for the current loss. It should not encode training policy.
Momentum
SGD can oscillate in directions of high curvature. Momentum reduces this effect by maintaining a velocity buffer:
Here, controls how much past gradient information is retained.
Momentum smooths noisy gradients and accelerates progress along directions where gradients are consistent. It adds one auxiliary tensor per parameter, with the same shape as the parameter.
The AD system remains unchanged. It still computes . Momentum changes how is transformed into a parameter update.
Adaptive Methods
Adaptive optimizers maintain statistics of past gradients. A common design keeps a moving average of gradients and squared gradients:
The update scales each coordinate by the estimated gradient magnitude:
This is the basic shape behind Adam-like methods.
Adaptive methods are useful when parameters have gradients with very different scales. They are also convenient because they often require less learning-rate tuning than plain SGD.
The cost is additional optimizer state. Adam-style methods commonly store two extra tensors per parameter. For large models, this state can dominate memory use.
Stochasticity Sources
Stochastic optimization receives noise from several places.
Sampling noise comes from mini-batch selection. Different batches produce different gradients.
Model noise may come from dropout, randomized layers, data augmentation, or stochastic routing.
Hardware noise may come from nondeterministic parallel reductions, mixed precision arithmetic, and distributed communication.
Data pipeline noise may come from shuffling, asynchronous loading, and randomized preprocessing.
AD differentiates the executed computation. If the forward pass includes random choices, the derivative corresponds to the sampled execution path, unless the system uses a special estimator or reparameterization.
Differentiating Randomness
Randomness is subtle in differentiable programs.
Consider:
If does not depend on , then AD can differentiate through for the sampled value of . The sample is treated as an input value.
If the sampling distribution depends on , ordinary pathwise AD may not apply directly. For example:
The derivative of an expectation requires more care:
Some cases use reparameterization. We rewrite sampling as a deterministic transformation of parameter-free noise:
Then AD can differentiate through .
Other cases use score-function estimators, such as REINFORCE-style gradients. These estimators are often high variance and belong more to probabilistic optimization than ordinary reverse mode AD.
Gradient Accumulation
Gradient accumulation is used when the desired batch size does not fit in memory.
Instead of one large batch, we use several micro-batches:
zero gradients
for each micro-batch:
loss = L(theta, micro_batch) / num_micro_batches
grad = backward(loss)
accumulate grad
optimizer step
zero gradientsDividing the loss by the number of micro-batches keeps the accumulated gradient equal to the average gradient over the effective batch.
This pattern relies on the fact that reverse mode adds contributions into gradient buffers. Accumulation is therefore a controlled use of mutable gradient state.
Distributed Stochastic Optimization
In distributed training, several workers compute gradients on different mini-batches. A common synchronous pattern is:
for each worker:
compute local gradient
average gradients across workers
for each worker:
apply same optimizer updateIf there are workers, each with local batch size , the effective batch size is .
The AD system computes local gradients. The distributed runtime handles communication, usually through an all-reduce operation. The optimizer then uses the averaged gradient.
Distributed stochastic optimization adds several systems concerns:
| Concern | Effect |
|---|---|
| Communication latency | Slows each step |
| Gradient bandwidth | Limits model scale |
| Stragglers | Delay synchronous updates |
| Nondeterminism | Complicates reproducibility |
| Large batches | May require learning-rate adjustment |
The derivative math is unchanged, but the execution model changes substantially.
AD Boundary
Automatic differentiation provides the gradient estimator for the sampled computation. It does not decide whether the estimator has low variance, whether the learning rate is stable, or whether the sampling scheme is statistically valid.
This boundary is useful:
AD layer:
compute gradients of the current program execution
optimizer layer:
transform gradients into parameter updates
training system:
choose batches, schedules, precision, distribution, and loggingKeeping these layers separate makes AD engines reusable. The same reverse-mode engine can support SGD, momentum, Adam, distributed training, gradient accumulation, and custom optimizers.
Practical Checklist
For stochastic optimization, the main implementation checks are straightforward.
The loss scale should match the intended gradient scale. Averaging over a batch gives gradients that are mostly independent of batch size. Summing over a batch makes gradients grow with batch size.
Gradient buffers should be cleared at the right time. Clearing too early drops contributions. Clearing too late accumulates unintended gradients.
Random seeds should be controlled when reproducibility matters. Some GPU kernels may remain nondeterministic even with fixed seeds.
Optimizer state should be checkpointed with model parameters. Restarting Adam without its moment buffers changes the training trajectory.
Learning-rate schedules should count the intended unit: step, token, example, epoch, or optimizer update. In gradient accumulation, micro-batches and optimizer steps differ.
Stochastic optimization is where AD becomes part of a larger training system. The derivative computation is local and mechanical. The behavior of the optimizer depends on sampling, state, scheduling, numerical precision, and hardware execution.