# Reverse Mode in Deep Learning

## Reverse Mode in Deep Learning

Reverse mode automatic differentiation is the mathematical and systems basis of backpropagation. In deep learning, the objective is usually a scalar loss depending on many parameters. This is exactly the setting where reverse mode is most efficient.

A neural network training step can be written as

$$
L(\theta) = \ell(F(x;\theta), y),
$$

where $x$ is an input batch, $y$ is the target, $F$ is the model, $\theta$ is the collection of parameters, and $L$ is a scalar loss.

The goal is to compute

$$
\nabla_\theta L.
$$

The number of parameters may be very large, but the loss is scalar. Reverse mode computes all parameter gradients with one forward pass and one backward pass.

### Backpropagation as Reverse Mode

Backpropagation is reverse mode AD applied to neural networks.

A network is a composition of layers:

$$
h_0 = x,
$$

$$
h_1 = F_1(h_0;\theta_1),
$$

$$
h_2 = F_2(h_1;\theta_2),
$$

$$
\cdots
$$

$$
h_L = F_L(h_{L-1};\theta_L),
$$

$$
\mathcal{L} = \ell(h_L, y).
$$

The forward pass computes activations $h_1,\ldots,h_L$. The backward pass propagates adjoints:

$$
\bar h_i = \frac{\partial \mathcal{L}}{\partial h_i}.
$$

Each layer receives an output adjoint $\bar h_i$, then computes:

$$
\bar h_{i-1}
$$

and

$$
\bar \theta_i.
$$

This is precisely a local vector-Jacobian product.

### Layer Pullbacks

Each layer defines a forward function:

$$
h_i = F_i(h_{i-1};\theta_i).
$$

Its reverse rule maps output sensitivity into input and parameter sensitivities:

$$
(\bar h_{i-1}, \bar\theta_i) =
F_i^\ast(h_{i-1},\theta_i,\bar h_i).
$$

The layer pullback usually needs saved forward values such as:

| Layer | Saved Values Often Needed |
|---|---|
| Linear | input activation |
| Convolution | input activation, layout metadata |
| ReLU | input or output mask |
| LayerNorm | mean, variance, normalized value |
| Attention | query, key, value, attention weights |
| Softmax | output probabilities |

Deep learning frameworks implement these pullbacks as backward kernels.

### Example: Linear Layer

For a linear layer,

$$
z = xW + b.
$$

Given output adjoint $\bar z$, the reverse rules are

$$
\bar x = \bar z W^\top,
$$

$$
\bar W = x^\top \bar z,
$$

$$
\bar b = \sum_{\text{batch}} \bar z.
$$

These equations show the computational pattern of backpropagation. The same matrix operations used in the forward pass appear again in transposed form during the backward pass.

### Example: Activation Function

For an elementwise activation

$$
h = \sigma(z),
$$

the backward rule is

$$
\bar z = \bar h \odot \sigma'(z),
$$

where $\odot$ denotes elementwise multiplication.

For ReLU,

$$
\sigma(z) = \max(0,z),
$$

the derivative is

$$
\sigma'(z) =
\begin{cases}
1 & z > 0, \\
0 & z < 0.
\end{cases}
$$

At $z = 0$, the derivative is undefined. Frameworks choose a convention, usually zero.

### Computation Graphs in Deep Learning

A training step builds a computation graph containing:

1. tensor operations;
2. parameter reads;
3. loss computation;
4. saved activations;
5. metadata needed for backward kernels.

In eager systems, the graph is built dynamically as Python or another host language executes.

In compiled systems, the graph may be traced, lowered to an intermediate representation, optimized, and executed as a compiled program.

Both approaches implement the same reverse-mode principle.

### Gradients of Parameters

Parameters are leaves in the computation graph.

During the backward pass, each parameter accumulates an adjoint:

$$
\bar\theta_i =
\frac{\partial \mathcal{L}}{\partial \theta_i}.
$$

Optimizers then update parameters using these gradients.

For stochastic gradient descent:

$$
\theta_i \leftarrow \theta_i - \eta \bar\theta_i.
$$

For adaptive methods, the optimizer also maintains state such as momentum and variance estimates.

The AD system computes gradients. The optimizer consumes them.

### Mini-Batches

Training usually uses mini-batches.

For batch size $B$, the loss may be written as

$$
\mathcal{L} =
\frac{1}{B}
\sum_{j=1}^{B}
\ell(F(x_j;\theta), y_j).
$$

Reverse mode propagates adjoints through the batched tensor computation.

The parameter gradient is the sum or mean of per-example gradients, depending on the loss convention.

This reduction matters because it changes gradient scale. Frameworks must define clearly whether losses are summed or averaged.

### Shared Parameters

Neural networks often reuse the same parameter in multiple places.

Examples include:

1. recurrent neural networks;
2. tied embeddings;
3. weight sharing in convolution;
4. transformer blocks with shared components.

If a parameter is used multiple times, its gradient must accumulate contributions from every use:

$$
\bar\theta =
\sum_k
\frac{\partial \mathcal{L}}{\partial \theta^{(k)}}.
$$

This is the same adjoint accumulation rule used for any shared variable.

### Loss Scaling

Mixed precision training often uses loss scaling.

Small gradients may underflow in low-precision formats. To avoid this, the system computes

$$
\tilde{\mathcal{L}} = s\mathcal{L},
$$

where $s$ is a scale factor.

The backward pass computes

$$
\nabla_\theta \tilde{\mathcal{L}} =
s\nabla_\theta \mathcal{L}.
$$

Before the optimizer step, gradients are divided by $s$.

Loss scaling is a numerical technique layered on top of reverse mode. It changes gradient representation, not the underlying derivative.

### Memory Pressure

Deep learning workloads are often memory-bound because reverse mode must preserve activations.

For a transformer, major memory contributors include:

| Component | Role |
|---|---|
| parameters | learned weights |
| gradients | parameter adjoints |
| optimizer state | momentum, variance, etc. |
| activations | saved forward values |
| attention tensors | large intermediate states |

Activation checkpointing reduces memory by discarding selected activations and recomputing them during the backward pass.

This permits larger batch sizes, longer sequences, or deeper models under a fixed memory budget.

### Custom Gradients

Frameworks often allow users to define custom backward rules.

A custom gradient may be useful when:

1. the default AD rule is inefficient;
2. the operation has a known stable derivative;
3. the forward operation calls external code;
4. the mathematical derivative differs from the desired training signal.

For example, a numerically stable softmax-cross-entropy backward rule avoids materializing large or unstable intermediate derivatives.

Custom gradients are powerful but dangerous. An incorrect rule silently changes the optimization problem.

### Non-Smooth Operations

Deep learning contains many non-smooth operations:

1. ReLU;
2. max pooling;
3. clipping;
4. argmax-like approximations;
5. masking;
6. sorting and ranking approximations.

Reverse mode can propagate selected subgradients where systems define them.

For example, max pooling sends gradient only to the selected maximum element.

At ties, the derivative is ambiguous. Frameworks choose deterministic or implementation-dependent conventions.

### Stop-Gradient

Many frameworks provide a stop-gradient operation.

Conceptually:

$$
y = \operatorname{stopgrad}(x)
$$

has forward value

$$
y = x,
$$

but backward rule

$$
\bar x = 0.
$$

This lets users block gradient flow through part of a graph.

Stop-gradient is common in:

1. target networks;
2. reinforcement learning;
3. contrastive learning;
4. self-supervised methods;
5. truncated backpropagation.

It is a deliberate modification of the derivative computation.

### Distributed Training

In distributed data-parallel training, each worker computes gradients on a local mini-batch.

Then gradients are aggregated across workers:

$$
\bar\theta =
\frac{1}{N}
\sum_{i=1}^{N}
\bar\theta_i.
$$

Reverse mode runs locally on each worker. Communication happens after or during gradient computation.

Systems overlap communication with backward execution by reducing gradients layer by layer as they become available.

This overlap is important for large-scale training efficiency.

### Backward Kernel Efficiency

In deep learning frameworks, backward rules are not usually scalar derivative formulas executed element by element. They are optimized tensor kernels.

For example, the backward pass for matrix multiplication uses high-performance matrix multiplication again:

$$
\bar X = \bar Y W^\top,
$$

$$
\bar W = X^\top \bar Y.
$$

Performance depends on:

1. memory layout;
2. kernel fusion;
3. tensor shapes;
4. precision format;
5. device placement.

Reverse mode provides the mathematical structure. Kernel libraries determine practical speed.

### Conceptual Summary

Reverse mode fits deep learning because training usually differentiates a scalar loss with respect to many parameters.

Backpropagation is reverse mode over a layered tensor computation. Each layer stores the forward information it needs, receives an output adjoint, and produces adjoints for its inputs and parameters.

The main systems problems are memory pressure, kernel efficiency, distributed gradient aggregation, mixed precision stability, and correct handling of non-smooth or custom operations.

