Computational Graphs

A computational graph represents a calculation as nodes and edges. Nodes represent operations or values. Edges represent data dependencies. Automatic differentiation uses this graph structure to decide how derivative information should move through a computation.

A simple program,

a = x * y
b = sin(a)
c = b + x

can be seen as the graph

x ─┬─> multiply ─> a ─> sin ─> b ─┐
   │                               ├─> add ─> c
y ─┘                               │
x ─────────────────────────────────┘

The output $c$ depends on $x$ through two paths: directly through the final addition, and indirectly through multiplication and sine. Reverse mode must accumulate both contributions.

Nodes and Edges

There are two common graph conventions.

In an operation graph, each operation is a node:

x, y -> multiply -> sin -> add -> c

In a value graph, each intermediate value is a node, and operations label the edges or transitions:

x, y -> a -> b -> c

AD systems often use a mixed representation. A node may store the value produced by an operation, the operation type, references to parent nodes, and enough information to compute local derivatives.

For reverse mode, a node may store:

Field	Meaning
value	primal value computed in the forward pass
parents	input nodes used to produce this value
local rule	how to propagate adjoints to parents
adjoint	accumulated derivative of final output with respect to this value

The adjoint of an intermediate value $v$ is commonly written as

\bar{v} = \frac{\partial L}{\partial v}

where $L$ is the final scalar output, often a loss.

Directed Acyclic Graphs

For straight-line programs, the computational graph is a directed acyclic graph, or DAG. Edges point from inputs to outputs. There are no directed cycles because each intermediate value is computed after its dependencies.

For example,

x -> a -> b -> y

is acyclic.

Loops in source code do not necessarily create cycles in the executed graph. A loop that runs five times can be unrolled into five repeated blocks:

s0 -> s1 -> s2 -> s3 -> s4 -> s5

The executed graph is still acyclic. It may be large, but its nodes have a clear time order.

This distinction is important. AD differentiates the executed computation. A loop gives rise to a sequence of operations, and reverse mode walks that sequence backward.

Topological Order

A topological order lists graph nodes so that every node appears after its dependencies.

Forward evaluation follows topological order. If an operation needs $x$ and $y$ , those values must be available before the operation runs.

Reverse accumulation follows reverse topological order. If an intermediate value contributes to later values, all downstream adjoint contributions must be known before its adjoint can be propagated to its parents.

For the program

a = x * y
b = sin(a)
c = b + x

forward order is

x, y, a, b, c

reverse order is

c, b, a, y, x

More precisely, reverse propagation processes operation nodes backward:

c = b + x
b = sin(a)
a = x * y

Forward Mode on a Graph

Forward mode attaches a tangent to each value. If $v$ is a primal value, its tangent is written as $\dot{v}$ .

For each operation, the AD system computes both the ordinary value and its tangent.

For

a = xy

the tangent rule is

\dot{a} = \dot{x}y + x\dot{y}

For

b = \sin a

the tangent rule is

\dot{b} = \cos(a)\dot{a}

For

c = b + x

the tangent rule is

\dot{c} = \dot{b} + \dot{x}

Thus forward mode computes a Jacobian-vector product by pushing one input perturbation through the graph.

Reverse Mode on a Graph

Reverse mode attaches an adjoint to each value. The adjoint $\bar{v}$ measures how the final output changes when $v$ changes.

For scalar output $c$ , initialize

\bar{c} = 1

Then process operations backward.

For

c = b + x

we add

\bar{b} \mathrel{+}= \bar{c}

\bar{x} \mathrel{+}= \bar{c}

For

b = \sin a

we add

\bar{a} \mathrel{+}= \cos(a)\bar{b}

For

a = xy

we add

\bar{x} \mathrel{+}= y\bar{a}

\bar{y} \mathrel{+}= x\bar{a}

The final $\bar{x}$ contains both the direct contribution through $c = b + x$ and the indirect contribution through $a = xy$ .

Accumulation at Shared Nodes

Shared values are common in real programs. A value may be used by several later operations.

u = x * x
v = u + 1
w = sin(u)
y = v * w

Here, $u$ feeds both $v$ and $w$ . Reverse mode must add both contributions to $\bar{u}$ .

The general rule is:

\bar{u} = \sum_{r \in \operatorname{users}(u)} \frac{\partial r}{\partial u}^T \bar{r}

where the sum ranges over all immediate downstream users of $u$ .

This summation is the graph form of the multivariable chain rule. It is also a practical implementation detail. Adjoint buffers must support accumulation, not simple assignment.

Static and Dynamic Graphs

Some AD systems build a graph before execution. This is a static graph. The graph can be optimized, compiled, partitioned, and scheduled before values are computed.

Other systems build the graph as the program runs. This is a dynamic graph. The executed operations determine the graph. Dynamic graphs handle ordinary host-language control flow naturally.

Graph style	Construction time	Strength	Cost
Static graph	before execution	optimization and compilation	less flexible
Dynamic graph	during execution	natural control flow	runtime graph overhead

Modern AD systems often combine both styles. A program may be traced dynamically once, converted into an intermediate representation, optimized, and then compiled.

Graphs and Control Flow

Conditionals select which operations execute.

if x > 0:
    y = x * x
else:
    y = -x

For a concrete value of $x$ , only one branch runs. A dynamic graph records the executed branch. A static graph may represent both branches with a control-flow operator.

The derivative is branch-dependent. For $x > 0$ ,

\frac{dy}{dx} = 2x

For $x < 0$ ,

\frac{dy}{dx} = -1

At $x = 0$ , the classical derivative does not exist because the two one-sided derivatives disagree.

AD differentiates the executed path and applies the derivative rule chosen by the system. It does not automatically reason about all possible branches unless the system explicitly represents symbolic control flow.

Graph Size

The size of the computational graph matters.

For reverse mode, the graph or tape must retain enough information to run the backward pass. This may include intermediate values, operation types, shapes, and parent references.

For a neural network with many layers, storing all intermediate activations can dominate memory usage. For long simulations, reverse mode may require storing a large time history.

The main options are:

Strategy	Idea	Tradeoff
Store everything	keep all needed intermediates	fast backward, high memory
Recompute	discard some values and recompute later	lower memory, more compute
Checkpoint	store selected states	balanced memory and compute
Stream	process graph in pieces	needs careful scheduling

Memory management is therefore part of AD design, not an implementation afterthought.

Graph Granularity

A graph can represent computation at different granularities.

At fine granularity, each scalar operation is a node:

multiply, add, sin, exp

At coarse granularity, each tensor operation is a node:

matmul, convolution, layer_norm, softmax

Fine-grained graphs expose many derivative opportunities but carry high overhead. Coarse-grained graphs reduce overhead and map better to optimized kernels.

Deep learning systems usually operate at tensor granularity. Compiler-based AD systems may lower tensor operations into smaller IR operations for optimization.

The choice of granularity affects:

Concern	Fine-grained graph	Coarse-grained graph
Overhead	high	lower
Optimization visibility	detailed	limited by primitive set
Kernel performance	poor unless fused	good
Custom derivatives	many small rules	fewer larger rules
Debuggability	detailed trace	simpler high-level trace

Computational Graphs as AD Infrastructure

A computational graph is not merely a visualization. It is the data structure that makes derivative propagation systematic.

It records:

What was computed.
Which values depend on which earlier values.
Which local derivative rule applies at each operation.
In which order derivative information must flow.

Forward mode can operate without storing the whole graph because tangents move with values. Reverse mode usually needs a graph, tape, or transformed program because it must revisit operations backward.

This is the main systems distinction between forward and reverse AD. Forward mode is local in time. Reverse mode needs access to the past.