# Computational Graphs

## Computational Graphs

A computational graph represents a calculation as nodes and edges. Nodes represent operations or values. Edges represent data dependencies. Automatic differentiation uses this graph structure to decide how derivative information should move through a computation.

A simple program,

```text
a = x * y
b = sin(a)
c = b + x
```

can be seen as the graph

```text
x ─┬─> multiply ─> a ─> sin ─> b ─┐
   │                               ├─> add ─> c
y ─┘                               │
x ─────────────────────────────────┘
```

The output $c$ depends on $x$ through two paths: directly through the final addition, and indirectly through multiplication and sine. Reverse mode must accumulate both contributions.

### Nodes and Edges

There are two common graph conventions.

In an operation graph, each operation is a node:

```text
x, y -> multiply -> sin -> add -> c
```

In a value graph, each intermediate value is a node, and operations label the edges or transitions:

```text
x, y -> a -> b -> c
```

AD systems often use a mixed representation. A node may store the value produced by an operation, the operation type, references to parent nodes, and enough information to compute local derivatives.

For reverse mode, a node may store:

| Field | Meaning |
|---|---|
| value | primal value computed in the forward pass |
| parents | input nodes used to produce this value |
| local rule | how to propagate adjoints to parents |
| adjoint | accumulated derivative of final output with respect to this value |

The adjoint of an intermediate value $v$ is commonly written as

$$
\bar{v} = \frac{\partial L}{\partial v}
$$

where $L$ is the final scalar output, often a loss.

### Directed Acyclic Graphs

For straight-line programs, the computational graph is a directed acyclic graph, or DAG. Edges point from inputs to outputs. There are no directed cycles because each intermediate value is computed after its dependencies.

For example,

```text
x -> a -> b -> y
```

is acyclic.

Loops in source code do not necessarily create cycles in the executed graph. A loop that runs five times can be unrolled into five repeated blocks:

```text
s0 -> s1 -> s2 -> s3 -> s4 -> s5
```

The executed graph is still acyclic. It may be large, but its nodes have a clear time order.

This distinction is important. AD differentiates the executed computation. A loop gives rise to a sequence of operations, and reverse mode walks that sequence backward.

### Topological Order

A topological order lists graph nodes so that every node appears after its dependencies.

Forward evaluation follows topological order. If an operation needs $x$ and $y$, those values must be available before the operation runs.

Reverse accumulation follows reverse topological order. If an intermediate value contributes to later values, all downstream adjoint contributions must be known before its adjoint can be propagated to its parents.

For the program

```text
a = x * y
b = sin(a)
c = b + x
```

forward order is

```text
x, y, a, b, c
```

reverse order is

```text
c, b, a, y, x
```

More precisely, reverse propagation processes operation nodes backward:

```text
c = b + x
b = sin(a)
a = x * y
```

### Forward Mode on a Graph

Forward mode attaches a tangent to each value. If $v$ is a primal value, its tangent is written as $\dot{v}$.

For each operation, the AD system computes both the ordinary value and its tangent.

For

$$
a = xy
$$

the tangent rule is

$$
\dot{a} = \dot{x}y + x\dot{y}
$$

For

$$
b = \sin a
$$

the tangent rule is

$$
\dot{b} = \cos(a)\dot{a}
$$

For

$$
c = b + x
$$

the tangent rule is

$$
\dot{c} = \dot{b} + \dot{x}
$$

Thus forward mode computes a Jacobian-vector product by pushing one input perturbation through the graph.

### Reverse Mode on a Graph

Reverse mode attaches an adjoint to each value. The adjoint $\bar{v}$ measures how the final output changes when $v$ changes.

For scalar output $c$, initialize

$$
\bar{c} = 1
$$

Then process operations backward.

For

$$
c = b + x
$$

we add

$$
\bar{b} \mathrel{+}= \bar{c}
$$

$$
\bar{x} \mathrel{+}= \bar{c}
$$

For

$$
b = \sin a
$$

we add

$$
\bar{a} \mathrel{+}= \cos(a)\bar{b}
$$

For

$$
a = xy
$$

we add

$$
\bar{x} \mathrel{+}= y\bar{a}
$$

$$
\bar{y} \mathrel{+}= x\bar{a}
$$

The final $\bar{x}$ contains both the direct contribution through $c = b + x$ and the indirect contribution through $a = xy$.

### Accumulation at Shared Nodes

Shared values are common in real programs. A value may be used by several later operations.

```text
u = x * x
v = u + 1
w = sin(u)
y = v * w
```

Here, $u$ feeds both $v$ and $w$. Reverse mode must add both contributions to $\bar{u}$.

The general rule is:

$$
\bar{u} =
\sum_{r \in \operatorname{users}(u)}
\frac{\partial r}{\partial u}^T \bar{r}
$$

where the sum ranges over all immediate downstream users of $u$.

This summation is the graph form of the multivariable chain rule. It is also a practical implementation detail. Adjoint buffers must support accumulation, not simple assignment.

### Static and Dynamic Graphs

Some AD systems build a graph before execution. This is a static graph. The graph can be optimized, compiled, partitioned, and scheduled before values are computed.

Other systems build the graph as the program runs. This is a dynamic graph. The executed operations determine the graph. Dynamic graphs handle ordinary host-language control flow naturally.

| Graph style | Construction time | Strength | Cost |
|---|---|---|---|
| Static graph | before execution | optimization and compilation | less flexible |
| Dynamic graph | during execution | natural control flow | runtime graph overhead |

Modern AD systems often combine both styles. A program may be traced dynamically once, converted into an intermediate representation, optimized, and then compiled.

### Graphs and Control Flow

Conditionals select which operations execute.

```text
if x > 0:
    y = x * x
else:
    y = -x
```

For a concrete value of $x$, only one branch runs. A dynamic graph records the executed branch. A static graph may represent both branches with a control-flow operator.

The derivative is branch-dependent. For $x > 0$,

$$
\frac{dy}{dx} = 2x
$$

For $x < 0$,

$$
\frac{dy}{dx} = -1
$$

At $x = 0$, the classical derivative does not exist because the two one-sided derivatives disagree.

AD differentiates the executed path and applies the derivative rule chosen by the system. It does not automatically reason about all possible branches unless the system explicitly represents symbolic control flow.

### Graph Size

The size of the computational graph matters.

For reverse mode, the graph or tape must retain enough information to run the backward pass. This may include intermediate values, operation types, shapes, and parent references.

For a neural network with many layers, storing all intermediate activations can dominate memory usage. For long simulations, reverse mode may require storing a large time history.

The main options are:

| Strategy | Idea | Tradeoff |
|---|---|---|
| Store everything | keep all needed intermediates | fast backward, high memory |
| Recompute | discard some values and recompute later | lower memory, more compute |
| Checkpoint | store selected states | balanced memory and compute |
| Stream | process graph in pieces | needs careful scheduling |

Memory management is therefore part of AD design, not an implementation afterthought.

### Graph Granularity

A graph can represent computation at different granularities.

At fine granularity, each scalar operation is a node:

```text
multiply, add, sin, exp
```

At coarse granularity, each tensor operation is a node:

```text
matmul, convolution, layer_norm, softmax
```

Fine-grained graphs expose many derivative opportunities but carry high overhead. Coarse-grained graphs reduce overhead and map better to optimized kernels.

Deep learning systems usually operate at tensor granularity. Compiler-based AD systems may lower tensor operations into smaller IR operations for optimization.

The choice of granularity affects:

| Concern | Fine-grained graph | Coarse-grained graph |
|---|---|---|
| Overhead | high | lower |
| Optimization visibility | detailed | limited by primitive set |
| Kernel performance | poor unless fused | good |
| Custom derivatives | many small rules | fewer larger rules |
| Debuggability | detailed trace | simpler high-level trace |

### Computational Graphs as AD Infrastructure

A computational graph is not merely a visualization. It is the data structure that makes derivative propagation systematic.

It records:

1. What was computed.
2. Which values depend on which earlier values.
3. Which local derivative rule applies at each operation.
4. In which order derivative information must flow.

Forward mode can operate without storing the whole graph because tangents move with values. Reverse mode usually needs a graph, tape, or transformed program because it must revisit operations backward.

This is the main systems distinction between forward and reverse AD. Forward mode is local in time. Reverse mode needs access to the past.