Skip to content

Case Studies

This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:

This section studies reverse mode automatic differentiation through concrete examples. Each case has the same structure:

  1. evaluate the primal computation forward;
  2. seed the output adjoint;
  3. propagate adjoints backward;
  4. collect input or parameter gradients.

The examples move from scalar expressions to neural network layers and iterative programs.

Case Study 1: Scalar Expression

Consider

y=log(1+x2). y = \log(1 + x^2).

Decompose the computation:

v1=x2, v_1 = x^2, v2=1+v1, v_2 = 1 + v_1, v3=log(v2), v_3 = \log(v_2), y=v3. y = v_3.

Forward pass:

VariableExpression
v1v_1x2x^2
v2v_21+v11 + v_1
v3v_3log(v2)\log(v_2)

Reverse initialization:

vˉ3=1. \bar v_3 = 1.

Backward propagation:

vˉ2+=vˉ31v2, \bar v_2 \mathrel{+}= \bar v_3 \frac{1}{v_2}, vˉ1+=vˉ2, \bar v_1 \mathrel{+}= \bar v_2, xˉ+=vˉ12x. \bar x \mathrel{+}= \bar v_1 2x.

Therefore,

xˉ=2x1+x2. \bar x = \frac{2x}{1+x^2}.

Reverse mode recovered the derivative by local rules without symbolic simplification of the original expression.

Case Study 2: Shared Intermediate

Consider

y=x2+sin(x2). y = x^2 + \sin(x^2).

Use a shared intermediate:

v1=x2, v_1 = x^2, v2=sin(v1), v_2 = \sin(v_1), v3=v1+v2, v_3 = v_1 + v_2, y=v3. y = v_3.

The value v1v_1 contributes to the output through two paths. Reverse mode must accumulate both.

Seed:

vˉ3=1. \bar v_3 = 1.

From addition:

vˉ1+=vˉ3, \bar v_1 \mathrel{+}= \bar v_3, vˉ2+=vˉ3. \bar v_2 \mathrel{+}= \bar v_3.

From sine:

vˉ1+=vˉ2cos(v1). \bar v_1 \mathrel{+}= \bar v_2 \cos(v_1).

Thus,

vˉ1=1+cos(v1). \bar v_1 = 1 + \cos(v_1).

From square:

xˉ+=vˉ12x. \bar x \mathrel{+}= \bar v_1 2x.

Therefore,

dydx=2x(1+cos(x2)). \frac{dy}{dx} = 2x(1+\cos(x^2)).

This example shows why adjoints are added, not assigned.

Case Study 3: Two Inputs, One Output

Let

y=x1x2+sin(x1). y = x_1x_2 + \sin(x_1).

Wengert list:

v1=x1x2, v_1 = x_1x_2, v2=sin(x1), v_2 = \sin(x_1), v3=v1+v2, v_3 = v_1 + v_2, y=v3. y = v_3.

Seed:

vˉ3=1. \bar v_3 = 1.

From addition:

vˉ1=1,vˉ2=1. \bar v_1 = 1, \qquad \bar v_2 = 1.

From sine:

xˉ1+=cos(x1). \bar x_1 \mathrel{+}= \cos(x_1).

From multiplication:

xˉ1+=x2, \bar x_1 \mathrel{+}= x_2, xˉ2+=x1. \bar x_2 \mathrel{+}= x_1.

Final gradient:

y=[x2+cos(x1)x1]. \nabla y = \begin{bmatrix} x_2 + \cos(x_1) \\ x_1 \end{bmatrix}.

One reverse pass computes both partial derivatives.

Case Study 4: Vector-Jacobian Product

Define

f(x1,x2)=[x1x2x1+x22]. f(x_1,x_2) = \begin{bmatrix} x_1x_2 \\ x_1 + x_2^2 \end{bmatrix}.

Let the output seed be

u=[u1u2]. u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}.

Reverse mode computes

uJf(x). u^\top J_f(x).

Write outputs as:

y1=x1x2, y_1 = x_1x_2, y2=x1+x22. y_2 = x_1 + x_2^2.

Seed output adjoints:

yˉ1=u1,yˉ2=u2. \bar y_1 = u_1, \qquad \bar y_2 = u_2.

From y1y_1:

xˉ1+=u1x2, \bar x_1 \mathrel{+}= u_1x_2, xˉ2+=u1x1. \bar x_2 \mathrel{+}= u_1x_1.

From y2y_2:

xˉ1+=u2, \bar x_1 \mathrel{+}= u_2, xˉ2+=2u2x2. \bar x_2 \mathrel{+}= 2u_2x_2.

Final result:

uJf(x)=[u1x2+u2,u1x1+2u2x2]. u^\top J_f(x) = \begin{bmatrix} u_1x_2 + u_2, \quad u_1x_1 + 2u_2x_2 \end{bmatrix}.

Reverse mode computes this product without constructing the full Jacobian as a stored matrix.

Case Study 5: Linear Layer

Consider a batch linear layer:

Y=XW+b. Y = XW + b.

Shapes:

SymbolShape
XXB×dinB \times d_{\text{in}}
WWdin×doutd_{\text{in}} \times d_{\text{out}}
bbdoutd_{\text{out}}
YYB×doutB \times d_{\text{out}}

Let Yˉ\bar Y be the adjoint arriving from later layers.

Reverse rules:

Xˉ=YˉW, \bar X = \bar Y W^\top, Wˉ=XYˉ, \bar W = X^\top \bar Y, bˉ=i=1BYˉi. \bar b = \sum_{i=1}^{B} \bar Y_i.

The backward pass has the same numerical character as the forward pass: dense matrix multiplication plus a reduction.

This is why deep learning systems implement reverse rules as tensor kernels, not scalar derivative loops.

Case Study 6: ReLU Layer

Let

Y=ReLU(X), Y = \operatorname{ReLU}(X),

where

Yij=max(0,Xij). Y_{ij} = \max(0, X_{ij}).

The derivative is

YijXij={1Xij>0,0Xij<0. \frac{\partial Y_{ij}}{\partial X_{ij}} = \begin{cases} 1 & X_{ij} > 0, \\ 0 & X_{ij} < 0. \end{cases}

At Xij=0X_{ij}=0, systems choose a convention, commonly zero.

Given output adjoint Yˉ\bar Y, reverse propagation gives

Xˉij=Yˉij1Xij>0. \bar X_{ij} = \bar Y_{ij} \mathbf{1}_{X_{ij} > 0}.

A ReLU backward pass therefore needs either:

  1. the original input XX;
  2. the output YY;
  3. a stored Boolean mask.

The mask often costs less memory than storing the full input, depending on representation.

Case Study 7: Softmax Cross-Entropy

Softmax followed by cross-entropy is common in classification.

Given logits zRKz \in \mathbb{R}^K, define

pi=ezijezj. p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}.

For one-hot target tt, the loss is

L=itilogpi. L = -\sum_i t_i \log p_i.

A direct graph would include exponentials, sums, division, logarithms, and multiplication. Reverse mode can differentiate this graph mechanically.

However, most systems use a fused stable backward rule:

zˉ=pt. \bar z = p - t.

For a batch with mean reduction:

Zˉ=1B(PT). \bar Z = \frac{1}{B}(P - T).

This custom backward rule is both simpler and more numerically stable than differentiating each primitive separately.

Case Study 8: Simple Recurrent Computation

Consider a recurrence:

h0=x, h_0 = x, ht+1=tanh(aht+b),t=0,,T1, h_{t+1} = \tanh(ah_t + b), \qquad t = 0,\ldots,T-1, L=hT. L = h_T.

Forward pass stores or checkpoints hidden states:

h0,h1,,hT. h_0,h_1,\ldots,h_T.

Seed:

hˉT=1. \bar h_T = 1.

Backward through time:

zt=aht+b, z_t = ah_t + b, ht+1=tanh(zt). h_{t+1} = \tanh(z_t).

Reverse rules:

zˉt=hˉt+1(1ht+12), \bar z_t = \bar h_{t+1}(1-h_{t+1}^2), aˉ+=zˉtht, \bar a \mathrel{+}= \bar z_t h_t, bˉ+=zˉt, \bar b \mathrel{+}= \bar z_t, hˉt+=zˉta. \bar h_t \mathrel{+}= \bar z_t a.

The same parameter aa is used at every time step, so its adjoint accumulates over all steps:

aˉ=t=0T1zˉtht. \bar a = \sum_{t=0}^{T-1} \bar z_t h_t.

This is backpropagation through time.

Case Study 9: Checkpointed Chain

Consider a chain of eight layers:

h0h1h2h8. h_0 \to h_1 \to h_2 \to \cdots \to h_8.

Full reverse mode stores all activations:

h0,h1,,h8. h_0,h_1,\ldots,h_8.

A checkpointed scheme might store:

h0,h4,h8. h_0,h_4,h_8.

During backward propagation through the segment h4h8h_4 \to h_8, the system recomputes:

h5,h6,h7,h8 h_5,h_6,h_7,h_8

from checkpoint h4h_4, then runs the local backward pass.

Then it repeats for h0h4h_0 \to h_4.

This reduces peak activation memory but adds extra forward computation.

The derivative remains mathematically the same if recomputation faithfully reproduces the original forward values.

Case Study 10: Tiny Reverse Mode Engine

A minimal scalar reverse mode system stores values, adjoints, and backward closures.

import math

class Var:
    def __init__(self, value):
        self.value = float(value)
        self.grad = 0.0
        self.parents = []

    def backward(self):
        topo = []
        seen = set()

        def visit(v):
            if id(v) in seen:
                return
            seen.add(id(v))
            for parent, _ in v.parents:
                visit(parent)
            topo.append(v)

        visit(self)

        self.grad = 1.0

        for v in reversed(topo):
            for parent, pullback in v.parents:
                parent.grad += pullback(v.grad)

def add(a, b):
    out = Var(a.value + b.value)
    out.parents = [
        (a, lambda g: g),
        (b, lambda g: g),
    ]
    return out

def mul(a, b):
    out = Var(a.value * b.value)
    out.parents = [
        (a, lambda g: g * b.value),
        (b, lambda g: g * a.value),
    ]
    return out

def sin(a):
    out = Var(math.sin(a.value))
    out.parents = [
        (a, lambda g: g * math.cos(a.value)),
    ]
    return out

Using it:

x1 = Var(2.0)
x2 = Var(3.0)

y = add(mul(x1, x2), sin(x1))
y.backward()

print(y.value)
print(x1.grad)
print(x2.grad)

The gradients are:

xˉ1=x2+cos(x1), \bar x_1 = x_2 + \cos(x_1), xˉ2=x1. \bar x_2 = x_1.

This small engine illustrates the whole mechanism:

  1. the forward pass builds a graph;
  2. each operation stores a pullback;
  3. backward traversal accumulates gradients.

Chapter Summary

Reverse mode automatic differentiation computes gradients by recording a primal computation and replaying its local derivative rules backward.

The main ideas from this chapter are:

ConceptRole
adjointsensitivity of output to an intermediate value
reverse graphdependency graph traversed backward
vector-Jacobian productcore operation of reverse mode
reverse accumulationadditive propagation of adjoints
taperecorded execution trace
Wengert listsingle-assignment linearized computation
checkpointingmemory reduction by recomputation
backpropagationreverse mode applied to neural networks

Reverse mode is efficient when many inputs influence a small number of outputs. This is the standard setting in optimization and deep learning: many parameters, one scalar objective.