Graph IRs

A graph intermediate representation models a program as nodes and edges.

Nodes represent operations or values. Edges represent data dependencies. For automatic differentiation, this form is natural because the chain rule also follows a dependency graph.

A simple function

def f(x):
    y = x * x
    z = sin(y)
    return z

can be represented as:

x -> mul -> y -> sin -> z

The primal graph computes values from inputs to outputs. The derivative graph computes tangent or adjoint values using the same dependency structure.

Dataflow Graphs

A dataflow graph contains operation nodes and value edges.

x0 ----\
       mul -> x1 -> sin -> x2
x0 ----/

Each operation has:

Field	Meaning
op	Operation name, such as `mul`, `sin`, `matmul`
inputs	Values consumed by the operation
outputs	Values produced by the operation
attributes	Static metadata, such as axis, shape, dtype
effects	Optional state, mutation, randomness, or I/O behavior

For AD, every differentiable operation needs a local derivative rule. The graph supplies the wiring.

Forward mode adds tangent edges. Reverse mode adds adjoint edges and reverse operations.

Forward Graph Transformation

Forward mode transforms each primal value v into a pair:

(v, dv)

For an operation

z = op(x, y)

the transformed graph contains:

z  = op(x, y)
dz = jvp_op(x, y, dx, dy)

For multiplication:

z = x * y
dz = dx * y + x * dy

The transformed graph still flows from inputs to outputs. Its structure is close to the original graph, but with additional tangent computations.

Reverse Graph Transformation

Reverse mode builds a backward graph.

For each primal value v, reverse mode introduces an adjoint value:

bar_v

For an operation

z = op(x, y)

the backward graph applies a vector-Jacobian product rule:

(bar_x, bar_y) += vjp_op(x, y, bar_z)

For multiplication:

bar_x += y * bar_z
bar_y += x * bar_z

The backward graph runs in reverse topological order. If several downstream paths contribute to the same adjoint, the graph inserts accumulation nodes.

bar_x = contribution_1 + contribution_2 + ...

This accumulation is central. Primal values usually flow forward once. Adjoint values collect contributions from all paths to the output.

Static Graphs

A static graph is built before execution.

The program first constructs a graph object. The runtime then compiles or interprets that graph.

Static graphs are useful because the system can inspect the whole computation before running it. It can optimize the graph, plan memory, fuse kernels, partition work across devices, and generate derivative graphs ahead of time.

A typical static graph pipeline is:

build graph
infer types and shapes
differentiate graph
optimize graph
lower to executable code
run

This model fits tensor programs well. Nodes such as matmul, conv2d, reduce_sum, and softmax are large enough that graph-level optimization matters.

Dynamic Graphs

A dynamic graph is built during execution.

Each operation executed by the program appends a node to a tape or graph. After the forward pass, reverse mode traverses this recorded graph backward.

This is common in eager deep learning systems. The user writes ordinary host-language code, including branches and loops. The graph records what actually happened.

run user code
record executed tensor operations
seed output adjoint
traverse recorded graph backward

Dynamic graphs are flexible. They handle runtime-dependent control flow naturally. Their cost is that optimization sees only the executed trace, often after some runtime overhead has already been paid.

Graph IR vs SSA IR

Graph IR and SSA IR are closely related.

A straight-line SSA program can be read as a graph:

x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0

The values x1, x2, and x3 are graph edges. The operations mul, sin, and add are graph nodes.

The difference is emphasis.

Representation	Emphasis
SSA IR	Ordered instructions, control-flow blocks, compiler analysis
Graph IR	Dependency structure, tensor operations, scheduling freedom

Graph IRs work well when the computation is mostly dataflow. SSA IRs work well when general control flow, mutation, and compiler lowering matter.

Many systems combine them. A graph may contain regions with SSA blocks. An SSA compiler may use graph views for optimization and differentiation.

Control Flow in Graph IRs

Pure dataflow graphs have no ordinary branch or loop syntax. Control flow must be represented explicitly.

A conditional may be represented as a special node:

y = if(cond, then_graph, else_graph)

A loop may be represented as:

y = while(init_state, cond_graph, body_graph)

These higher-order graph nodes contain subgraphs.

For AD, the derivative of an if node differentiates only the selected branch. The derivative of a while node must account for all executed iterations.

Reverse mode through loops usually requires either saved loop states or recomputation.

forward while:
    run body repeatedly
    save required states

reverse while:
    run reverse body in opposite iteration order

A graph IR must specify how subgraphs capture values, return values, and interact with side effects.

Shape and Type Inference

Graph IRs often represent tensor programs, so shape and type inference are core compiler passes.

For example:

C = matmul(A, B)

requires:

A: tensor<f32, [m, k]>
B: tensor<f32, [k, n]>
C: tensor<f32, [m, n]>

The reverse rules are:

bar_A = matmul(bar_C, transpose(B))
bar_B = matmul(transpose(A), bar_C)

Without shape information, the compiler cannot validate these rules or allocate buffers.

Shape information is also needed for broadcasting. If an input was broadcast in the forward pass, its adjoint must be reduced back to the original shape.

bar_x = reduce_sum_to_shape(bar_z, shape(x))

Graph Optimization

Graph IRs are convenient for whole-computation optimization.

Common graph optimizations include:

Optimization	Example
Constant folding	Precompute static expressions
Dead node elimination	Remove unused operations
Algebraic simplification	Replace `x * 1` with `x`
Operator fusion	Fuse elementwise chains
Layout propagation	Choose memory layouts consistently
Common subgraph elimination	Reuse repeated computations
Rematerialization	Recompute instead of saving
Device placement	Assign nodes to CPU, GPU, TPU, or remote workers
Graph partitioning	Split graph across devices or processes

AD interacts with these passes. The compiler may optimize before differentiation, after differentiation, or both.

Optimizing before AD can simplify the primal graph. Optimizing after AD can remove redundant derivative work. Some transformations are best applied jointly because they affect saved activations and backward memory.

Saved Values and Graph Lifetimes

Reverse mode needs primal values during the backward pass.

A graph IR can make saved values explicit:

save x
save y

or attach saved values to backward functions:

z = op(x, y)
backward(z_bar) uses x, y

The compiler then performs liveness analysis.

It decides whether to:

Strategy	Meaning
Save	Store the value from the forward pass
Recompute	Recreate the value during backward
Checkpoint	Save selected states and recompute between them
Discard	Remove values unnecessary for backward

This is one of the main performance problems in reverse-mode AD. Graph IRs make the problem visible enough for systematic optimization.

Effects in Graphs

A mathematical graph is pure. A program graph may have effects.

Effects include mutation, randomness, I/O, communication, and synchronization.

For AD, effects are dangerous when they make the backward pass inconsistent with the forward pass.

For example, random sampling requires a clear policy:

same random value in backward
new random value in backward
reparameterized differentiable sample
score-function estimator
nondifferentiable boundary

A graph IR should record effect dependencies explicitly. Otherwise the optimizer may reorder operations incorrectly.

For distributed graphs, communication nodes such as all_reduce, send, and recv also need derivative rules. The adjoint of communication is another communication pattern, often in the opposite direction or with reduction.

Primitive Boundaries

Graph IRs usually treat complex operations as primitives.

A softmax node may remain high-level:

p = softmax(x)

or it may lower to:

m = reduce_max(x)
e = exp(x - m)
s = reduce_sum(e)
p = e / s

Keeping it high-level allows a stable and efficient derivative rule. Lowering it exposes more operations to generic optimization.

The choice matters.

High-level primitives preserve mathematical intent. Low-level primitives expose implementation details. A good compiler keeps high-level structure long enough to differentiate well, then lowers when code generation needs details.

Graph IR in Deep Learning

Deep learning frameworks made graph IRs widely visible.

A model is represented as a graph of tensor operations. The loss is a scalar output. Reverse-mode AD constructs a backward graph that computes gradients of parameters.

For a training step, the graph may include:

forward model
loss computation
backward gradient graph
optimizer update

In static graph systems, this full training graph may be compiled. In dynamic systems, the forward graph is recorded for each step and then replayed backward.

Modern systems often blend both approaches. They execute eagerly for usability, then capture graphs for optimization.

Graph IR in Scientific Computing

Scientific computing uses graph IRs when computations are staged, optimized, or distributed.

Examples include:

finite element assembly
PDE solvers
linear algebra pipelines
differentiable rendering
inverse problems
probabilistic programs

In these settings, graph nodes may be larger than tensor kernels. A node might represent a sparse solve, a mesh operation, or a simulation step.

The AD rule for such a node may use implicit differentiation instead of differentiating all internal operations. This can be far more efficient and numerically stable.

Design Criteria

A graph IR for AD should answer these questions:

What are the primitive operations?
What values are differentiable?
How are types and shapes represented?
How is control flow represented?
How are effects represented?
How are saved values represented?
How are adjoints accumulated?
How are custom derivative rules attached?
How does the graph lower to executable code?

A graph that cannot answer these questions precisely may be useful for simple models but fragile for general AD.

Summary

Graph IRs represent programs by their dependency structure. This makes them a natural substrate for automatic differentiation.

Forward mode adds tangent edges and JVP nodes. Reverse mode adds adjoint edges, VJP nodes, and accumulation nodes. Static graphs enable ahead-of-time optimization. Dynamic graphs support ordinary runtime control flow.

The main challenge is not drawing the graph. The challenge is giving the graph precise semantics for control flow, shape, effects, saved values, memory, and lowering. A graph IR is effective for AD only when those semantics are explicit.