# Reverse Computational Graphs

## Reverse Computational Graphs

Reverse mode automatic differentiation operates on a computational graph. The forward pass evaluates the graph from inputs to outputs. The reverse pass traverses the same graph in the opposite direction and propagates adjoints backward through the dependencies.

A computational graph converts a program into a directed acyclic structure of values and operations. Each node represents either:

1. a variable or intermediate value;
2. an elementary operation.

Edges represent data dependencies.

For example, consider

$$
y = (x_1 x_2) + \sin(x_1).
$$

The forward computation can be decomposed into intermediate values:

$$
v_1 = x_1 x_2,
$$

$$
v_2 = \sin(x_1),
$$

$$
v_3 = v_1 + v_2,
$$

$$
y = v_3.
$$

The graph structure is:

```text
x1 ----\
        (*) ---- v1 --\
x2 ----/               \
                         (+) ---- y
x1 -------- sin ---- v2 /
```

The graph explicitly records how every value depends on earlier values. Reverse mode uses this structure to apply the chain rule systematically.

### Forward Graph Construction

A forward evaluation produces two outputs:

1. primal values;
2. graph structure.

Each primitive operation appends a node to the graph.

For example:

| Node | Operation | Inputs | Output |
|---|---|---|---|
| 1 | multiply | $x_1, x_2$ | $v_1$ |
| 2 | sine | $x_1$ | $v_2$ |
| 3 | add | $v_1, v_2$ | $v_3$ |

This sequence is often called a Wengert list or tape. The order is important because every operation only depends on earlier values.

The forward graph therefore defines a topological ordering of dependencies.

### Reverse Traversal

The reverse pass traverses the graph backward.

The output node begins with seed adjoint

$$
\bar y = 1.
$$

Then the reverse pass visits nodes in reverse topological order:

$$
v_3 \rightarrow v_2 \rightarrow v_1 \rightarrow x_1,x_2.
$$

At each node, the local derivative rule distributes the current adjoint into parent nodes.

For the addition node:

$$
v_3 = v_1 + v_2,
$$

the reverse rule is

$$
\bar v_1 \mathrel{+}= \bar v_3,
$$

$$
\bar v_2 \mathrel{+}= \bar v_3.
$$

For the sine node:

$$
v_2 = \sin(x_1),
$$

the reverse rule is

$$
\bar x_1
\mathrel{+}=
\bar v_2 \cos(x_1).
$$

For the multiplication node:

$$
v_1 = x_1 x_2,
$$

the reverse rule is

$$
\bar x_1
\mathrel{+}=
\bar v_1 x_2,
$$

$$
\bar x_2
\mathrel{+}=
\bar v_1 x_1.
$$

The reverse graph traversal accumulates all contributions into the final adjoints.

### Local Graph Semantics

Each node in the graph defines a local mapping:

$$
v_j = g_j(v_{p_1}, \ldots, v_{p_r}).
$$

The node receives an output adjoint

$$
\bar v_j,
$$

and propagates contributions into each parent:

$$
\bar v_{p_i}
\mathrel{+}=
\bar v_j
\frac{\partial v_j}{\partial v_{p_i}}.
$$

The reverse graph therefore acts like a backward flow of sensitivities.

The forward pass computes values. The reverse pass computes influence.

### Fan-Out and Adjoint Accumulation

Graphs frequently contain fan-out, where one value is used by several later operations.

Consider

$$
v_1 = x^2,
$$

$$
v_2 = v_1 + 1,
$$

$$
v_3 = v_1 \sin(x),
$$

$$
y = v_2 + v_3.
$$

The value $v_1$ influences the output through two separate paths:

1. $v_1 \to v_2 \to y$
2. $v_1 \to v_3 \to y$

The chain rule requires summing both effects:

$$
\bar v_1 =
\frac{\partial y}{\partial v_2}
\frac{\partial v_2}{\partial v_1}
+
\frac{\partial y}{\partial v_3}
\frac{\partial v_3}{\partial v_1}.
$$

This is why reverse mode uses additive accumulation.

Operationally:

```text
adjoint[parent] += contribution
```

never

```text
adjoint[parent] = contribution
```

unless the graph guarantees a single dependency path.

### Reverse Graphs as Transposed Dependency Graphs

The forward graph expresses how values are constructed.

The reverse graph expresses how sensitivities propagate.

Suppose the forward dependency matrix for a local operation is

$$
J =
\frac{\partial v_{\text{out}}}{\partial v_{\text{in}}}.
$$

Forward mode propagates tangents by multiplication with $J$:

$$
\dot v_{\text{out}} = J \dot v_{\text{in}}.
$$

Reverse mode propagates adjoints using the transpose:

$$
\bar v_{\text{in}} = J^\top \bar v_{\text{out}}.
$$

The reverse graph therefore corresponds to the transpose of the local linearized graph.

This interpretation explains why reverse mode traverses edges backward.

### Static Graphs and Dynamic Graphs

Some systems construct the computational graph before execution. Others construct it during execution.

Static graph systems build the graph once and then execute it repeatedly.

Examples include early graph-based machine learning systems.

Advantages:

| Property | Benefit |
|---|---|
| Whole-graph optimization | Better fusion and scheduling |
| Static memory planning | Lower allocation overhead |
| Easier compilation | Better ahead-of-time optimization |

Disadvantages:

| Property | Limitation |
|---|---|
| Fixed structure | Harder support for dynamic control flow |
| Reduced flexibility | More complex debugging |

Dynamic graph systems construct the graph during execution itself.

Each operation immediately appends nodes to the active graph.

Advantages:

| Property | Benefit |
|---|---|
| Natural control flow | Standard language semantics |
| Easier debugging | Immediate execution |
| Flexible graphs | Input-dependent structure |

Disadvantages:

| Property | Limitation |
|---|---|
| Runtime overhead | Dynamic graph construction cost |
| Harder optimization | Less global visibility |

Modern systems often combine both approaches through tracing or staged compilation.

### Reverse Graph Storage

The reverse pass requires information from the forward pass. Each node must retain enough information to compute local derivatives.

For example:

| Operation | Required Saved Values |
|---|---|
| $z=x+y$ | none |
| $z=xy$ | $x,y$ |
| $z=\sin(x)$ | $x$ or $z$ |
| $z=x^{-1}$ | $x$ or $z$ |

The storage structure may contain:

1. operation type;
2. pointers to inputs;
3. output value;
4. saved primal values;
5. local metadata.

This storage is commonly called the tape.

A large reverse graph may consume substantial memory. Deep neural networks can generate graphs containing billions of primitive operations. Memory management therefore becomes a central systems problem.

### Destructive Updates and Aliasing

Programs often mutate memory:

```text
x = x + 1
```

Mutation complicates reverse graphs because earlier values may be overwritten.

Reverse mode conceptually assumes immutable intermediate values. Systems that support mutation must preserve previous states or reconstruct them later.

Aliasing introduces additional complications:

```text
a = x
b = x
```

If one alias changes, the derivative system must maintain correct dependency relationships.

Many AD systems avoid these issues by using purely functional intermediate representations or SSA form.

### Graph Granularity

The computational graph may represent operations at different levels of abstraction.

Fine-grained graphs represent primitive arithmetic operations individually:

```text
t1 = x * y
t2 = sin(t1)
t3 = t2 + z
```

Coarse-grained graphs represent larger kernels:

```text
matmul(A, B)
```

Fine-grained graphs provide flexibility and exact control over differentiation.

Coarse-grained graphs reduce graph size and runtime overhead.

Modern systems usually combine both approaches:

1. large tensor kernels as graph primitives;
2. internal differentiation rules implemented separately.

### Reverse Graph Execution Order

The reverse pass must respect dependency ordering.

If node $u$ contributes to node $v$, then the reverse pass must process $v$ before $u$.

This requires reverse topological traversal.

For a graph with nodes

$$
v_1, v_2, \ldots, v_k,
$$

generated in forward order, reverse mode typically processes:

$$
v_k, v_{k-1}, \ldots, v_1.
$$

This works because the forward trace already satisfies topological ordering.

### Computational Interpretation

A reverse computational graph transforms a program into two coupled systems:

1. a primal computation graph for values;
2. a dual graph for sensitivities.

The forward graph answers:

> How is this value computed?

The reverse graph answers:

> Which earlier values influence this output, and by how much?

Reverse mode differentiation is therefore not symbolic algebra over expressions. It is execution over a graph of local derivative transformations.

The graph structure makes the chain rule operational. Each node performs a small local backward transformation, and the full gradient emerges from the composition of these local adjoint propagations.

