# Case Studies

## Case Studies

This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:

1. evaluate the primal computation forward;
2. seed the output adjoint;
3. propagate adjoints backward;
4. collect input or parameter gradients.

The examples move from scalar expressions to neural network layers and iterative programs.

## Case Study 1: Scalar Expression

Consider

$$
y = \log(1 + x^2).
$$

Decompose the computation:

$$
v_1 = x^2,
$$

$$
v_2 = 1 + v_1,
$$

$$
v_3 = \log(v_2),
$$

$$
y = v_3.
$$

Forward pass:

| Variable | Expression |
|---|---|
| $v_1$ | $x^2$ |
| $v_2$ | $1 + v_1$ |
| $v_3$ | $\log(v_2)$ |

Reverse initialization:

$$
\bar v_3 = 1.
$$

Backward propagation:

$$
\bar v_2 \mathrel{+}= \bar v_3 \frac{1}{v_2},
$$

$$
\bar v_1 \mathrel{+}= \bar v_2,
$$

$$
\bar x \mathrel{+}= \bar v_1 2x.
$$

Therefore,

$$
\bar x =
\frac{2x}{1+x^2}.
$$

Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.

## Case Study 2: Shared Intermediate

Consider

$$
y = x^2 + \sin(x^2).
$$

Use a shared intermediate:

$$
v_1 = x^2,
$$

$$
v_2 = \sin(v_1),
$$

$$
v_3 = v_1 + v_2,
$$

$$
y = v_3.
$$

The value $v_1$ contributes to the output through two paths. Reverse mode must accumulate both.

Seed:

$$
\bar v_3 = 1.
$$

From addition:

$$
\bar v_1 \mathrel{+}= \bar v_3,
$$

$$
\bar v_2 \mathrel{+}= \bar v_3.
$$

From sine:

$$
\bar v_1 \mathrel{+}= \bar v_2 \cos(v_1).
$$

Thus,

$$
\bar v_1 = 1 + \cos(v_1).
$$

From square:

$$
\bar x \mathrel{+}= \bar v_1 2x.
$$

Therefore,

$$
\frac{dy}{dx} =
2x(1+\cos(x^2)).
$$

This example shows why adjoints are added, not assigned.

## Case Study 3: Two Inputs, One Output

Let

$$
y = x_1x_2 + \sin(x_1).
$$

Wengert list:

$$
v_1 = x_1x_2,
$$

$$
v_2 = \sin(x_1),
$$

$$
v_3 = v_1 + v_2,
$$

$$
y = v_3.
$$

Seed:

$$
\bar v_3 = 1.
$$

From addition:

$$
\bar v_1 = 1,
\qquad
\bar v_2 = 1.
$$

From sine:

$$
\bar x_1 \mathrel{+}= \cos(x_1).
$$

From multiplication:

$$
\bar x_1 \mathrel{+}= x_2,
$$

$$
\bar x_2 \mathrel{+}= x_1.
$$

Final gradient:

$$
\nabla y =
\begin{bmatrix}
x_2 + \cos(x_1) \\
x_1
\end{bmatrix}.
$$

One reverse pass computes both partial derivatives.

## Case Study 4: Vector-Jacobian Product

Define

$$
f(x_1,x_2) =
\begin{bmatrix}
x_1x_2 \\
x_1 + x_2^2
\end{bmatrix}.
$$

Let the output seed be

$$
u =
\begin{bmatrix}
u_1 \\
u_2
\end{bmatrix}.
$$

Reverse mode computes

$$
u^\top J_f(x).
$$

Write outputs as:

$$
y_1 = x_1x_2,
$$

$$
y_2 = x_1 + x_2^2.
$$

Seed output adjoints:

$$
\bar y_1 = u_1,
\qquad
\bar y_2 = u_2.
$$

From $y_1$:

$$
\bar x_1 \mathrel{+}= u_1x_2,
$$

$$
\bar x_2 \mathrel{+}= u_1x_1.
$$

From $y_2$:

$$
\bar x_1 \mathrel{+}= u_2,
$$

$$
\bar x_2 \mathrel{+}= 2u_2x_2.
$$

Final result:

$$
u^\top J_f(x) =
\begin{bmatrix}
u_1x_2 + u_2,
\quad
u_1x_1 + 2u_2x_2
\end{bmatrix}.
$$

Reverse mode computes this product without constructing the full Jacobian as a stored matrix.

## Case Study 5: Linear Layer

Consider a batch linear layer:

$$
Y = XW + b.
$$

Shapes:

| Symbol | Shape |
|---|---:|
| $X$ | $B \times d_{\text{in}}$ |
| $W$ | $d_{\text{in}} \times d_{\text{out}}$ |
| $b$ | $d_{\text{out}}$ |
| $Y$ | $B \times d_{\text{out}}$ |

Let $\bar Y$ be the adjoint arriving from later layers.

Reverse rules:

$$
\bar X = \bar Y W^\top,
$$

$$
\bar W = X^\top \bar Y,
$$

$$
\bar b = \sum_{i=1}^{B} \bar Y_i.
$$

The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.

This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.

## Case Study 6: ReLU Layer

Let

$$
Y = \operatorname{ReLU}(X),
$$

where

$$
Y_{ij} = \max(0, X_{ij}).
$$

The derivative is

$$
\frac{\partial Y_{ij}}{\partial X_{ij}} =
\begin{cases}
1 & X_{ij} > 0, \\
0 & X_{ij} < 0.
\end{cases}
$$

At $X_{ij}=0$, systems choose a convention, commonly zero.

Given output adjoint $\bar Y$, reverse propagation gives

$$
\bar X_{ij} =
\bar Y_{ij}
\mathbf{1}_{X_{ij} > 0}.
$$

A ReLU backward pass therefore needs either:

1. the original input $X$;
2. the output $Y$;
3. a stored Boolean mask.

The mask often costs less memory than storing the full input, depending on representation.

## Case Study 7: Softmax Cross-Entropy

Softmax followed by cross-entropy is common in classification.

Given logits $z \in \mathbb{R}^K$, define

$$
p_i =
\frac{e^{z_i}}{\sum_j e^{z_j}}.
$$

For one-hot target $t$, the loss is

$$
L = -\sum_i t_i \log p_i.
$$

A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.

However, most systems use a fused stable backward rule:

$$
\bar z = p - t.
$$

For a batch with mean reduction:

$$
\bar Z = \frac{1}{B}(P - T).
$$

This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.

## Case Study 8: Simple Recurrent Computation

Consider a recurrence:

$$
h_0 = x,
$$

$$
h_{t+1} = \tanh(ah_t + b),
\qquad
t = 0,\ldots,T-1,
$$

$$
L = h_T.
$$

Forward pass stores or checkpoints hidden states:

$$
h_0,h_1,\ldots,h_T.
$$

Seed:

$$
\bar h_T = 1.
$$

Backward through time:

$$
z_t = ah_t + b,
$$

$$
h_{t+1} = \tanh(z_t).
$$

Reverse rules:

$$
\bar z_t =
\bar h_{t+1}(1-h_{t+1}^2),
$$

$$
\bar a \mathrel{+}= \bar z_t h_t,
$$

$$
\bar b \mathrel{+}= \bar z_t,
$$

$$
\bar h_t \mathrel{+}= \bar z_t a.
$$

The same parameter $a$ is used at every time step, so its adjoint accumulates over all steps:

$$
\bar a =
\sum_{t=0}^{T-1}
\bar z_t h_t.
$$

This is backpropagation through time.

## Case Study 9: Checkpointed Chain

Consider a chain of eight layers:

$$
h_0 \to h_1 \to h_2 \to \cdots \to h_8.
$$

Full reverse mode stores all activations:

$$
h_0,h_1,\ldots,h_8.
$$

A checkpointed scheme might store:

$$
h_0,h_4,h_8.
$$

During backward propagation through the segment $h_4 \to h_8$, the system recomputes:

$$
h_5,h_6,h_7,h_8
$$

from checkpoint $h_4$, then runs the local backward pass.

Then it repeats for $h_0 \to h_4$.

This reduces peak activation memory but adds extra forward computation.

The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.

## Case Study 10: Tiny Reverse Mode Engine

A minimal scalar reverse mode system stores values, adjoints, and backward closures.

```python
import math

class Var:
    def __init__(self, value):
        self.value = float(value)
        self.grad = 0.0
        self.parents = []

    def backward(self):
        topo = []
        seen = set()

        def visit(v):
            if id(v) in seen:
                return
            seen.add(id(v))
            for parent, _ in v.parents:
                visit(parent)
            topo.append(v)

        visit(self)

        self.grad = 1.0

        for v in reversed(topo):
            for parent, pullback in v.parents:
                parent.grad += pullback(v.grad)

def add(a, b):
    out = Var(a.value + b.value)
    out.parents = [
        (a, lambda g: g),
        (b, lambda g: g),
    ]
    return out

def mul(a, b):
    out = Var(a.value * b.value)
    out.parents = [
        (a, lambda g: g * b.value),
        (b, lambda g: g * a.value),
    ]
    return out

def sin(a):
    out = Var(math.sin(a.value))
    out.parents = [
        (a, lambda g: g * math.cos(a.value)),
    ]
    return out
```

Using it:

```python
x1 = Var(2.0)
x2 = Var(3.0)

y = add(mul(x1, x2), sin(x1))
y.backward()

print(y.value)
print(x1.grad)
print(x2.grad)
```

The gradients are:

$$
\bar x_1 = x_2 + \cos(x_1),
$$

$$
\bar x_2 = x_1.
$$

This small engine illustrates the whole mechanism:

1. the forward pass builds a graph;
2. each operation stores a pullback;
3. backward traversal accumulates gradients.

## Chapter Summary

Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.

The main ideas from this chapter are:

| Concept | Role |
|---|---|
| adjoint | sensitivity of output to an intermediate value |
| reverse graph | dependency graph traversed backward |
| vector-Jacobian product | core operation of reverse mode |
| reverse accumulation | additive propagation of adjoints |
| tape | recorded execution trace |
| Wengert list | single-assignment linearized computation |
| checkpointing | memory reduction by recomputation |
| backpropagation | reverse mode applied to neural networks |

Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.

