Case Studies

This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:

evaluate the primal computation forward;
seed the output adjoint;
propagate adjoints backward;
collect input or parameter gradients.

The examples move from scalar expressions to neural network layers and iterative programs.

Case Study 1: Scalar Expression

Consider

y = \log(1 + x^2).

Decompose the computation:

v_1 = x^2,

v_2 = 1 + v_1,

v_3 = \log(v_2),

y = v_3.

Forward pass:

Variable	Expression
$v_1$	$x^2$
$v_2$	$1 + v_1$
$v_3$	$\log(v_2)$

Reverse initialization:

\bar v_3 = 1.

Backward propagation:

\bar v_2 \mathrel{+}= \bar v_3 \frac{1}{v_2},

\bar v_1 \mathrel{+}= \bar v_2,

\bar x \mathrel{+}= \bar v_1 2x.

Therefore,

\bar x = \frac{2x}{1+x^2}.

Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.

Case Study 2: Shared Intermediate

Consider

y = x^2 + \sin(x^2).

Use a shared intermediate:

v_1 = x^2,

v_2 = \sin(v_1),

v_3 = v_1 + v_2,

y = v_3.

The value $v_1$ contributes to the output through two paths. Reverse mode must accumulate both.

Seed:

\bar v_3 = 1.

From addition:

\bar v_1 \mathrel{+}= \bar v_3,

\bar v_2 \mathrel{+}= \bar v_3.

From sine:

\bar v_1 \mathrel{+}= \bar v_2 \cos(v_1).

Thus,

\bar v_1 = 1 + \cos(v_1).

From square:

\bar x \mathrel{+}= \bar v_1 2x.

Therefore,

\frac{dy}{dx} = 2x(1+\cos(x^2)).

This example shows why adjoints are added, not assigned.

Case Study 3: Two Inputs, One Output

Let

y = x_1x_2 + \sin(x_1).

Wengert list:

v_1 = x_1x_2,

v_2 = \sin(x_1),

v_3 = v_1 + v_2,

y = v_3.

Seed:

\bar v_3 = 1.

From addition:

\bar v_1 = 1, \qquad \bar v_2 = 1.

From sine:

\bar x_1 \mathrel{+}= \cos(x_1).

From multiplication:

\bar x_1 \mathrel{+}= x_2,

\bar x_2 \mathrel{+}= x_1.

Final gradient:

\nabla y = \begin{bmatrix} x_2 + \cos(x_1) \\ x_1 \end{bmatrix}.

One reverse pass computes both partial derivatives.

Case Study 4: Vector-Jacobian Product

Define

f(x_1,x_2) = \begin{bmatrix} x_1x_2 \\ x_1 + x_2^2 \end{bmatrix}.

Let the output seed be

u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}.

Reverse mode computes

u^\top J_f(x).

Write outputs as:

y_1 = x_1x_2,

y_2 = x_1 + x_2^2.

Seed output adjoints:

\bar y_1 = u_1, \qquad \bar y_2 = u_2.

From $y_1$ :

\bar x_1 \mathrel{+}= u_1x_2,

\bar x_2 \mathrel{+}= u_1x_1.

From $y_2$ :

\bar x_1 \mathrel{+}= u_2,

\bar x_2 \mathrel{+}= 2u_2x_2.

Final result:

u^\top J_f(x) = \begin{bmatrix} u_1x_2 + u_2, \quad u_1x_1 + 2u_2x_2 \end{bmatrix}.

Reverse mode computes this product without constructing the full Jacobian as a stored matrix.

Case Study 5: Linear Layer

Consider a batch linear layer:

Y = XW + b.

Shapes:

Symbol	Shape
$X$	$B \times d_{\text{in}}$
$W$	$d_{\text{in}} \times d_{\text{out}}$
$b$	$d_{\text{out}}$
$Y$	$B \times d_{\text{out}}$

Let $\bar Y$ be the adjoint arriving from later layers.

Reverse rules:

\bar X = \bar Y W^\top,

\bar W = X^\top \bar Y,

\bar b = \sum_{i=1}^{B} \bar Y_i.

The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.

This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.

Case Study 6: ReLU Layer

Let

Y = \operatorname{ReLU}(X),

where

Y_{ij} = \max(0, X_{ij}).

The derivative is

\frac{\partial Y_{ij}}{\partial X_{ij}} = \begin{cases} 1 & X_{ij} > 0, \\ 0 & X_{ij} < 0. \end{cases}

At $X_{ij}=0$ , systems choose a convention, commonly zero.

Given output adjoint $\bar Y$ , reverse propagation gives

\bar X_{ij} = \bar Y_{ij} \mathbf{1}_{X_{ij} > 0}.

A ReLU backward pass therefore needs either:

the original input $X$ ;
the output $Y$ ;
a stored Boolean mask.

The mask often costs less memory than storing the full input, depending on representation.

Case Study 7: Softmax Cross-Entropy

Softmax followed by cross-entropy is common in classification.

Given logits $z \in \mathbb{R}^K$ , define

p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}.

For one-hot target $t$ , the loss is

L = -\sum_i t_i \log p_i.

A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.

However, most systems use a fused stable backward rule:

\bar z = p - t.

For a batch with mean reduction:

\bar Z = \frac{1}{B}(P - T).

This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.

Case Study 8: Simple Recurrent Computation

Consider a recurrence:

h_0 = x,

h_{t+1} = \tanh(ah_t + b), \qquad t = 0,\ldots,T-1,

L = h_T.

Forward pass stores or checkpoints hidden states:

h_0,h_1,\ldots,h_T.

Seed:

\bar h_T = 1.

Backward through time:

z_t = ah_t + b,

h_{t+1} = \tanh(z_t).

Reverse rules:

\bar z_t = \bar h_{t+1}(1-h_{t+1}^2),

\bar a \mathrel{+}= \bar z_t h_t,

\bar b \mathrel{+}= \bar z_t,

\bar h_t \mathrel{+}= \bar z_t a.

The same parameter $a$ is used at every time step, so its adjoint accumulates over all steps:

\bar a = \sum_{t=0}^{T-1} \bar z_t h_t.

This is backpropagation through time.

Case Study 9: Checkpointed Chain

Consider a chain of eight layers:

h_0 \to h_1 \to h_2 \to \cdots \to h_8.

Full reverse mode stores all activations:

h_0,h_1,\ldots,h_8.

A checkpointed scheme might store:

h_0,h_4,h_8.

During backward propagation through the segment $h_4 \to h_8$ , the system recomputes:

h_5,h_6,h_7,h_8

from checkpoint $h_4$ , then runs the local backward pass.

Then it repeats for $h_0 \to h_4$ .

This reduces peak activation memory but adds extra forward computation.

The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.

Case Study 10: Tiny Reverse Mode Engine

A minimal scalar reverse mode system stores values, adjoints, and backward closures.

import math

class Var:
    def __init__(self, value):
        self.value = float(value)
        self.grad = 0.0
        self.parents = []

    def backward(self):
        topo = []
        seen = set()

        def visit(v):
            if id(v) in seen:
                return
            seen.add(id(v))
            for parent, _ in v.parents:
                visit(parent)
            topo.append(v)

        visit(self)

        self.grad = 1.0

        for v in reversed(topo):
            for parent, pullback in v.parents:
                parent.grad += pullback(v.grad)

def add(a, b):
    out = Var(a.value + b.value)
    out.parents = [
        (a, lambda g: g),
        (b, lambda g: g),
    ]
    return out

def mul(a, b):
    out = Var(a.value * b.value)
    out.parents = [
        (a, lambda g: g * b.value),
        (b, lambda g: g * a.value),
    ]
    return out

def sin(a):
    out = Var(math.sin(a.value))
    out.parents = [
        (a, lambda g: g * math.cos(a.value)),
    ]
    return out

Using it:

x1 = Var(2.0)
x2 = Var(3.0)

y = add(mul(x1, x2), sin(x1))
y.backward()

print(y.value)
print(x1.grad)
print(x2.grad)

The gradients are:

\bar x_1 = x_2 + \cos(x_1),

\bar x_2 = x_1.

This small engine illustrates the whole mechanism:

the forward pass builds a graph;
each operation stores a pullback;
backward traversal accumulates gradients.

Chapter Summary

Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.

The main ideas from this chapter are:

Concept	Role
adjoint	sensitivity of output to an intermediate value
reverse graph	dependency graph traversed backward
vector-Jacobian product	core operation of reverse mode
reverse accumulation	additive propagation of adjoints
tape	recorded execution trace
Wengert list	single-assignment linearized computation
checkpointing	memory reduction by recomputation
backpropagation	reverse mode applied to neural networks

Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.