Gradient descent is the basic optimization procedure behind much of modern machine learning. It is simple enough to state in one line, but rich enough to expose many of the...
Gradient Descent
Gradient descent is the basic optimization procedure behind much of modern machine learning. It is simple enough to state in one line, but rich enough to expose many of the main issues in automatic differentiation: representation of objectives, computation of gradients, floating point behavior, memory use, batching, and convergence.
Given a differentiable scalar objective
gradient descent updates the parameter vector by moving against the gradient:
Here, is the learning rate. The gradient points in the direction of steepest local increase of the objective. Moving in the opposite direction gives the steepest local decrease under the Euclidean geometry of the parameter space.
Automatic differentiation enters because real objectives are rarely written as closed-form formulas. They are usually programs.
A model computes a prediction:
a loss function compares prediction and target:
and the optimizer needs
The central service provided by AD is to compute this derivative from the executed program, not from a manually derived symbolic expression.
Objective Functions
In machine learning, the objective is usually an empirical risk:
The parameters may contain millions or billions of scalar values. The output is scalar. This shape matters. Reverse mode AD is especially effective for scalar-output, high-dimensional-input functions, because it computes the full gradient of one scalar loss with cost comparable to a small multiple of the original program evaluation.
For a model with many parameters and one scalar loss, forward mode would need one tangent direction per parameter to obtain the full gradient. Reverse mode propagates one adjoint from the loss backward through the computation graph and accumulates derivatives for all parameters.
This is why reverse mode AD is the standard mode for neural network training.
Local Linear Approximation
Gradient descent is based on the first-order Taylor approximation:
Choosing
gives
For sufficiently small , this predicts a decrease in the loss whenever the gradient is nonzero. This argument is local. It says why the update direction is reasonable near the current point. It does not guarantee that a large step will decrease the objective.
The learning rate controls how much trust we place in this local linear model.
Learning Rate
The learning rate is the most important scalar hyperparameter in gradient descent.
If is too small, optimization progresses slowly. If is too large, the update may overshoot useful regions and cause divergence. The same gradient can be either helpful or destructive depending on the scale of the step.
For a simple quadratic objective,
the gradient is
Gradient descent gives
The iteration converges when
which means
This small example shows a general principle: directions with high curvature require smaller steps. In high-dimensional problems, different directions may have very different curvature. A single scalar learning rate then becomes a compromise.
Full-Batch Gradient Descent
Full-batch gradient descent computes the gradient over the entire training set:
This gradient is deterministic for a fixed dataset and parameter vector. It gives a clean mathematical object, but it is often expensive. Large datasets make full-batch updates impractical because every update requires a pass over all examples.
Full-batch gradient descent is common in small numerical optimization problems, classical regression, and some scientific computing workloads. It is less common in large-scale neural network training.
Mini-Batch Gradient Descent
Modern machine learning usually uses mini-batches. Instead of computing the loss over all examples, we sample a batch :
Then we update using
The mini-batch gradient is an estimate of the full gradient. It introduces noise, but reduces computation per update and improves hardware utilization. Matrix and tensor operations over batches map well to GPUs and TPUs.
From the perspective of AD, mini-batching changes the executed program. The computation graph contains batched tensor operations rather than scalar operations per example. Reverse mode then computes gradients for the batched loss in one backward pass.
The Role of Reverse Mode AD
A typical training step has this structure:
pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()Conceptually, this performs four operations.
First, the forward pass computes intermediate values needed for the loss. Second, the AD system records enough information to differentiate the computation. Third, the backward pass propagates adjoints from the scalar loss back to the parameters. Fourth, the optimizer updates parameter storage using the accumulated gradients.
For a parameter , reverse mode accumulates an adjoint:
The optimizer then applies an update such as
The AD system computes gradients. The optimizer decides how to use them. This separation is important. Gradient descent, momentum, Adam, and other methods can share the same AD engine.
Parameters, Gradients, and Mutation
In implementation, model parameters are mutable arrays. Their gradients are usually stored in parallel buffers.
A parameter tensor may contain values:
weight.dataand its gradient may be stored as:
weight.gradDuring the backward pass, operations add contributions into weight.grad. After the optimizer step, the gradient buffer must be cleared or overwritten before the next step. Otherwise, gradients from multiple batches accumulate unintentionally.
This behavior is useful when deliberate. For example, gradient accumulation simulates a larger batch by running several forward/backward passes before applying one optimizer step. But when accidental, it produces incorrect updates.
Differentiability Requirements
Gradient descent assumes a useful derivative exists. Many machine learning programs contain operations that are only piecewise differentiable:
Its derivative is
At , the classical derivative is undefined. AD systems usually choose a convention, often returning one of the subgradient values. This allows optimization to proceed, but the result should be understood as a practical derivative rule, not as a smooth mathematical derivative.
Similar issues arise with clipping, absolute values, sorting, indexing, masking, and conditional branches.
Gradient Descent as Program Transformation
Suppose the training loss is represented as a program:
loss = L(theta, batch)Automatic differentiation constructs another program:
grad = dL_dtheta(theta, batch)Gradient descent then wraps both programs:
grad = dL_dtheta(theta, batch)
theta = theta - eta * gradThis view is useful because it separates three layers:
| Layer | Responsibility |
|---|---|
| Model program | Computes predictions and loss |
| AD transform | Computes derivatives of the loss program |
| Optimizer | Updates parameters using derivatives |
A clean AD system should make this separation explicit. The derivative transform should not be tied to one optimizer. The optimizer should not need to know the internal graph representation used by AD.
Failure Modes
Gradient descent can fail even when gradients are computed correctly.
The objective may be poorly conditioned. Some directions may change the loss sharply while others change it slowly. A single learning rate then causes either slow progress or instability.
The gradient may vanish. This means the gradient norm becomes very small before the model reaches a useful solution. Deep networks with saturating nonlinearities historically suffered from this problem.
The gradient may explode. This means the gradient norm becomes very large, causing unstable updates. Recurrent networks and very deep models often require gradient clipping or normalization.
The loss landscape may contain saddle points, flat regions, sharp valleys, and many local minima. Gradient descent only uses local first-order information, so it has limited ability to reason globally.
The AD system does not solve these optimization problems. It provides accurate derivatives of the executed program. Optimization remains a separate numerical problem.
Minimal Algorithm
A basic gradient descent loop can be written as:
initialize theta
for each step:
loss = L(theta)
grad = gradient(loss, theta)
theta = theta - eta * gradFor mini-batch training:
initialize theta
for each step:
batch = next_batch(data)
loss = L(theta, batch)
grad = gradient(loss, theta)
theta = theta - eta * gradIn tensor systems, theta is a collection of arrays, not one flat vector. The update is applied elementwise to each parameter tensor.
AD Correctness vs Optimization Success
It is important to distinguish derivative correctness from training success.
An AD system may compute the exact derivative of the floating point program, yet training may still fail because the learning rate is wrong, the model is poorly initialized, the loss is badly scaled, the data is noisy, or the objective is ill-conditioned.
Conversely, a model may train successfully even when the gradients are approximate, clipped, quantized, or computed in reduced precision.
Automatic differentiation gives gradient descent its main input. It does not guarantee convergence. It turns a program into a differentiable numerical object, after which optimization methods operate under the usual limits of numerical analysis.