Chapter 13. Optimization and Machine Learning

Gradient Descent

Gradient descent is the basic optimization procedure behind much of modern machine learning. It is simple enough to state in one line, but rich enough to expose many of the main issues in automatic differentiation: representation of objectives, computation of gradients, floating point behavior, memory use, batching, and convergence.

Given a differentiable scalar objective

L(\theta): \mathbb{R}^n \to \mathbb{R},

gradient descent updates the parameter vector $\theta$ by moving against the gradient:

\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t).

Here, $\eta > 0$ is the learning rate. The gradient $\nabla_\theta L(\theta_t)$ points in the direction of steepest local increase of the objective. Moving in the opposite direction gives the steepest local decrease under the Euclidean geometry of the parameter space.

Automatic differentiation enters because real objectives are rarely written as closed-form formulas. They are usually programs.

A model computes a prediction:

\hat{y} = f_\theta(x),

a loss function compares prediction and target:

\ell = \operatorname{loss}(\hat{y}, y),

and the optimizer needs

\nabla_\theta \ell.

The central service provided by AD is to compute this derivative from the executed program, not from a manually derived symbolic expression.

Objective Functions

In machine learning, the objective is usually an empirical risk:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i).

The parameters $\theta$ may contain millions or billions of scalar values. The output $L(\theta)$ is scalar. This shape matters. Reverse mode AD is especially effective for scalar-output, high-dimensional-input functions, because it computes the full gradient of one scalar loss with cost comparable to a small multiple of the original program evaluation.

For a model with many parameters and one scalar loss, forward mode would need one tangent direction per parameter to obtain the full gradient. Reverse mode propagates one adjoint from the loss backward through the computation graph and accumulates derivatives for all parameters.

This is why reverse mode AD is the standard mode for neural network training.

Local Linear Approximation

Gradient descent is based on the first-order Taylor approximation:

L(\theta + \Delta \theta) \approx L(\theta) + \nabla L(\theta)^\top \Delta \theta.

Choosing

\Delta \theta = -\eta \nabla L(\theta)

gives

L(\theta + \Delta \theta) \approx L(\theta) - \eta \|\nabla L(\theta)\|^2.

For sufficiently small $\eta$ , this predicts a decrease in the loss whenever the gradient is nonzero. This argument is local. It says why the update direction is reasonable near the current point. It does not guarantee that a large step will decrease the objective.

The learning rate controls how much trust we place in this local linear model.

Learning Rate

The learning rate is the most important scalar hyperparameter in gradient descent.

If $\eta$ is too small, optimization progresses slowly. If $\eta$ is too large, the update may overshoot useful regions and cause divergence. The same gradient can be either helpful or destructive depending on the scale of the step.

For a simple quadratic objective,

L(\theta) = \frac{1}{2} a \theta^2,

the gradient is

\nabla L(\theta) = a\theta.

Gradient descent gives

\theta_{t+1} = \theta_t - \eta a\theta_t = (1 - \eta a)\theta_t.

The iteration converges when

|1 - \eta a| < 1,

which means

0 < \eta < \frac{2}{a}.

This small example shows a general principle: directions with high curvature require smaller steps. In high-dimensional problems, different directions may have very different curvature. A single scalar learning rate then becomes a compromise.

Full-Batch Gradient Descent

Full-batch gradient descent computes the gradient over the entire training set:

\nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \ell(f_\theta(x_i), y_i).

This gradient is deterministic for a fixed dataset and parameter vector. It gives a clean mathematical object, but it is often expensive. Large datasets make full-batch updates impractical because every update requires a pass over all examples.

Full-batch gradient descent is common in small numerical optimization problems, classical regression, and some scientific computing workloads. It is less common in large-scale neural network training.

Mini-Batch Gradient Descent

Modern machine learning usually uses mini-batches. Instead of computing the loss over all $N$ examples, we sample a batch $B$ :

L_B(\theta) = \frac{1}{|B|} \sum_{i \in B} \ell(f_\theta(x_i), y_i).

Then we update using

\theta_{t+1} = \theta_t - \eta \nabla L_B(\theta_t).

The mini-batch gradient is an estimate of the full gradient. It introduces noise, but reduces computation per update and improves hardware utilization. Matrix and tensor operations over batches map well to GPUs and TPUs.

From the perspective of AD, mini-batching changes the executed program. The computation graph contains batched tensor operations rather than scalar operations per example. Reverse mode then computes gradients for the batched loss in one backward pass.

The Role of Reverse Mode AD

A typical training step has this structure:

pred = model(x)
loss = loss_fn(pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()

Conceptually, this performs four operations.

First, the forward pass computes intermediate values needed for the loss. Second, the AD system records enough information to differentiate the computation. Third, the backward pass propagates adjoints from the scalar loss back to the parameters. Fourth, the optimizer updates parameter storage using the accumulated gradients.

For a parameter $w$ , reverse mode accumulates an adjoint:

\bar{w} = \frac{\partial L}{\partial w}.

The optimizer then applies an update such as

w \leftarrow w - \eta \bar{w}.

The AD system computes gradients. The optimizer decides how to use them. This separation is important. Gradient descent, momentum, Adam, and other methods can share the same AD engine.

Parameters, Gradients, and Mutation

In implementation, model parameters are mutable arrays. Their gradients are usually stored in parallel buffers.

A parameter tensor may contain values:

weight.data

and its gradient may be stored as:

weight.grad

During the backward pass, operations add contributions into weight.grad. After the optimizer step, the gradient buffer must be cleared or overwritten before the next step. Otherwise, gradients from multiple batches accumulate unintentionally.

This behavior is useful when deliberate. For example, gradient accumulation simulates a larger batch by running several forward/backward passes before applying one optimizer step. But when accidental, it produces incorrect updates.

Differentiability Requirements

Gradient descent assumes a useful derivative exists. Many machine learning programs contain operations that are only piecewise differentiable:

\operatorname{ReLU}(x) = \max(0, x).

Its derivative is

\operatorname{ReLU}'(x) = \begin{cases} 0, & x < 0, \\ 1, & x > 0. \end{cases}

At $x = 0$ , the classical derivative is undefined. AD systems usually choose a convention, often returning one of the subgradient values. This allows optimization to proceed, but the result should be understood as a practical derivative rule, not as a smooth mathematical derivative.

Similar issues arise with clipping, absolute values, sorting, indexing, masking, and conditional branches.

Gradient Descent as Program Transformation

Suppose the training loss is represented as a program:

loss = L(theta, batch)

Automatic differentiation constructs another program:

grad = dL_dtheta(theta, batch)

Gradient descent then wraps both programs:

grad = dL_dtheta(theta, batch)
theta = theta - eta * grad

This view is useful because it separates three layers:

Layer	Responsibility
Model program	Computes predictions and loss
AD transform	Computes derivatives of the loss program
Optimizer	Updates parameters using derivatives

A clean AD system should make this separation explicit. The derivative transform should not be tied to one optimizer. The optimizer should not need to know the internal graph representation used by AD.

Failure Modes

Gradient descent can fail even when gradients are computed correctly.

The objective may be poorly conditioned. Some directions may change the loss sharply while others change it slowly. A single learning rate then causes either slow progress or instability.

The gradient may vanish. This means the gradient norm becomes very small before the model reaches a useful solution. Deep networks with saturating nonlinearities historically suffered from this problem.

The gradient may explode. This means the gradient norm becomes very large, causing unstable updates. Recurrent networks and very deep models often require gradient clipping or normalization.

The loss landscape may contain saddle points, flat regions, sharp valleys, and many local minima. Gradient descent only uses local first-order information, so it has limited ability to reason globally.

The AD system does not solve these optimization problems. It provides accurate derivatives of the executed program. Optimization remains a separate numerical problem.

Minimal Algorithm

A basic gradient descent loop can be written as:

initialize theta

for each step:
    loss = L(theta)
    grad = gradient(loss, theta)
    theta = theta - eta * grad

For mini-batch training:

initialize theta

for each step:
    batch = next_batch(data)
    loss = L(theta, batch)
    grad = gradient(loss, theta)
    theta = theta - eta * grad

In tensor systems, theta is a collection of arrays, not one flat vector. The update is applied elementwise to each parameter tensor.

AD Correctness vs Optimization Success

It is important to distinguish derivative correctness from training success.

An AD system may compute the exact derivative of the floating point program, yet training may still fail because the learning rate is wrong, the model is poorly initialized, the loss is badly scaled, the data is noisy, or the objective is ill-conditioned.

Conversely, a model may train successfully even when the gradients are approximate, clipped, quantized, or computed in reduced precision.

Automatic differentiation gives gradient descent its main input. It does not guarantee convergence. It turns a program into a differentiable numerical object, after which optimization methods operate under the usual limits of numerical analysis.