Skip to content

Computational Graphs

A computational graph represents a calculation as nodes and edges. Nodes represent operations or values. Edges represent data dependencies. Automatic differentiation uses this...

A computational graph represents a calculation as nodes and edges. Nodes represent operations or values. Edges represent data dependencies. Automatic differentiation uses this graph structure to decide how derivative information should move through a computation.

A simple program,

a = x * y
b = sin(a)
c = b + x

can be seen as the graph

x ─┬─> multiply ─> a ─> sin ─> b ─┐
   │                               ├─> add ─> c
y ─┘                               │
x ─────────────────────────────────┘

The output cc depends on xx through two paths: directly through the final addition, and indirectly through multiplication and sine. Reverse mode must accumulate both contributions.

Nodes and Edges

There are two common graph conventions.

In an operation graph, each operation is a node:

x, y -> multiply -> sin -> add -> c

In a value graph, each intermediate value is a node, and operations label the edges or transitions:

x, y -> a -> b -> c

AD systems often use a mixed representation. A node may store the value produced by an operation, the operation type, references to parent nodes, and enough information to compute local derivatives.

For reverse mode, a node may store:

FieldMeaning
valueprimal value computed in the forward pass
parentsinput nodes used to produce this value
local rulehow to propagate adjoints to parents
adjointaccumulated derivative of final output with respect to this value

The adjoint of an intermediate value vv is commonly written as

vˉ=Lv \bar{v} = \frac{\partial L}{\partial v}

where LL is the final scalar output, often a loss.

Directed Acyclic Graphs

For straight-line programs, the computational graph is a directed acyclic graph, or DAG. Edges point from inputs to outputs. There are no directed cycles because each intermediate value is computed after its dependencies.

For example,

x -> a -> b -> y

is acyclic.

Loops in source code do not necessarily create cycles in the executed graph. A loop that runs five times can be unrolled into five repeated blocks:

s0 -> s1 -> s2 -> s3 -> s4 -> s5

The executed graph is still acyclic. It may be large, but its nodes have a clear time order.

This distinction is important. AD differentiates the executed computation. A loop gives rise to a sequence of operations, and reverse mode walks that sequence backward.

Topological Order

A topological order lists graph nodes so that every node appears after its dependencies.

Forward evaluation follows topological order. If an operation needs xx and yy, those values must be available before the operation runs.

Reverse accumulation follows reverse topological order. If an intermediate value contributes to later values, all downstream adjoint contributions must be known before its adjoint can be propagated to its parents.

For the program

a = x * y
b = sin(a)
c = b + x

forward order is

x, y, a, b, c

reverse order is

c, b, a, y, x

More precisely, reverse propagation processes operation nodes backward:

c = b + x
b = sin(a)
a = x * y

Forward Mode on a Graph

Forward mode attaches a tangent to each value. If vv is a primal value, its tangent is written as v˙\dot{v}.

For each operation, the AD system computes both the ordinary value and its tangent.

For

a=xy a = xy

the tangent rule is

a˙=x˙y+xy˙ \dot{a} = \dot{x}y + x\dot{y}

For

b=sina b = \sin a

the tangent rule is

b˙=cos(a)a˙ \dot{b} = \cos(a)\dot{a}

For

c=b+x c = b + x

the tangent rule is

c˙=b˙+x˙ \dot{c} = \dot{b} + \dot{x}

Thus forward mode computes a Jacobian-vector product by pushing one input perturbation through the graph.

Reverse Mode on a Graph

Reverse mode attaches an adjoint to each value. The adjoint vˉ\bar{v} measures how the final output changes when vv changes.

For scalar output cc, initialize

cˉ=1 \bar{c} = 1

Then process operations backward.

For

c=b+x c = b + x

we add

bˉ+=cˉ \bar{b} \mathrel{+}= \bar{c} xˉ+=cˉ \bar{x} \mathrel{+}= \bar{c}

For

b=sina b = \sin a

we add

aˉ+=cos(a)bˉ \bar{a} \mathrel{+}= \cos(a)\bar{b}

For

a=xy a = xy

we add

xˉ+=yaˉ \bar{x} \mathrel{+}= y\bar{a} yˉ+=xaˉ \bar{y} \mathrel{+}= x\bar{a}

The final xˉ\bar{x} contains both the direct contribution through c=b+xc = b + x and the indirect contribution through a=xya = xy.

Accumulation at Shared Nodes

Shared values are common in real programs. A value may be used by several later operations.

u = x * x
v = u + 1
w = sin(u)
y = v * w

Here, uu feeds both vv and ww. Reverse mode must add both contributions to uˉ\bar{u}.

The general rule is:

uˉ=rusers(u)ruTrˉ \bar{u} = \sum_{r \in \operatorname{users}(u)} \frac{\partial r}{\partial u}^T \bar{r}

where the sum ranges over all immediate downstream users of uu.

This summation is the graph form of the multivariable chain rule. It is also a practical implementation detail. Adjoint buffers must support accumulation, not simple assignment.

Static and Dynamic Graphs

Some AD systems build a graph before execution. This is a static graph. The graph can be optimized, compiled, partitioned, and scheduled before values are computed.

Other systems build the graph as the program runs. This is a dynamic graph. The executed operations determine the graph. Dynamic graphs handle ordinary host-language control flow naturally.

Graph styleConstruction timeStrengthCost
Static graphbefore executionoptimization and compilationless flexible
Dynamic graphduring executionnatural control flowruntime graph overhead

Modern AD systems often combine both styles. A program may be traced dynamically once, converted into an intermediate representation, optimized, and then compiled.

Graphs and Control Flow

Conditionals select which operations execute.

if x > 0:
    y = x * x
else:
    y = -x

For a concrete value of xx, only one branch runs. A dynamic graph records the executed branch. A static graph may represent both branches with a control-flow operator.

The derivative is branch-dependent. For x>0x > 0,

dydx=2x \frac{dy}{dx} = 2x

For x<0x < 0,

dydx=1 \frac{dy}{dx} = -1

At x=0x = 0, the classical derivative does not exist because the two one-sided derivatives disagree.

AD differentiates the executed path and applies the derivative rule chosen by the system. It does not automatically reason about all possible branches unless the system explicitly represents symbolic control flow.

Graph Size

The size of the computational graph matters.

For reverse mode, the graph or tape must retain enough information to run the backward pass. This may include intermediate values, operation types, shapes, and parent references.

For a neural network with many layers, storing all intermediate activations can dominate memory usage. For long simulations, reverse mode may require storing a large time history.

The main options are:

StrategyIdeaTradeoff
Store everythingkeep all needed intermediatesfast backward, high memory
Recomputediscard some values and recompute laterlower memory, more compute
Checkpointstore selected statesbalanced memory and compute
Streamprocess graph in piecesneeds careful scheduling

Memory management is therefore part of AD design, not an implementation afterthought.

Graph Granularity

A graph can represent computation at different granularities.

At fine granularity, each scalar operation is a node:

multiply, add, sin, exp

At coarse granularity, each tensor operation is a node:

matmul, convolution, layer_norm, softmax

Fine-grained graphs expose many derivative opportunities but carry high overhead. Coarse-grained graphs reduce overhead and map better to optimized kernels.

Deep learning systems usually operate at tensor granularity. Compiler-based AD systems may lower tensor operations into smaller IR operations for optimization.

The choice of granularity affects:

ConcernFine-grained graphCoarse-grained graph
Overheadhighlower
Optimization visibilitydetailedlimited by primitive set
Kernel performancepoor unless fusedgood
Custom derivativesmany small rulesfewer larger rules
Debuggabilitydetailed tracesimpler high-level trace

Computational Graphs as AD Infrastructure

A computational graph is not merely a visualization. It is the data structure that makes derivative propagation systematic.

It records:

  1. What was computed.
  2. Which values depend on which earlier values.
  3. Which local derivative rule applies at each operation.
  4. In which order derivative information must flow.

Forward mode can operate without storing the whole graph because tangents move with values. Reverse mode usually needs a graph, tape, or transformed program because it must revisit operations backward.

This is the main systems distinction between forward and reverse AD. Forward mode is local in time. Reverse mode needs access to the past.