A graph intermediate representation models a program as nodes and edges.
Nodes represent operations or values. Edges represent data dependencies. For automatic differentiation, this form is natural because the chain rule also follows a dependency graph.
A simple function
def f(x):
y = x * x
z = sin(y)
return zcan be represented as:
x -> mul -> y -> sin -> zThe primal graph computes values from inputs to outputs. The derivative graph computes tangent or adjoint values using the same dependency structure.
Dataflow Graphs
A dataflow graph contains operation nodes and value edges.
x0 ----\
mul -> x1 -> sin -> x2
x0 ----/Each operation has:
| Field | Meaning |
|---|---|
| op | Operation name, such as mul, sin, matmul |
| inputs | Values consumed by the operation |
| outputs | Values produced by the operation |
| attributes | Static metadata, such as axis, shape, dtype |
| effects | Optional state, mutation, randomness, or I/O behavior |
For AD, every differentiable operation needs a local derivative rule. The graph supplies the wiring.
Forward mode adds tangent edges. Reverse mode adds adjoint edges and reverse operations.
Forward Graph Transformation
Forward mode transforms each primal value v into a pair:
(v, dv)For an operation
z = op(x, y)the transformed graph contains:
z = op(x, y)
dz = jvp_op(x, y, dx, dy)For multiplication:
z = x * y
dz = dx * y + x * dyThe transformed graph still flows from inputs to outputs. Its structure is close to the original graph, but with additional tangent computations.
Reverse Graph Transformation
Reverse mode builds a backward graph.
For each primal value v, reverse mode introduces an adjoint value:
bar_vFor an operation
z = op(x, y)the backward graph applies a vector-Jacobian product rule:
(bar_x, bar_y) += vjp_op(x, y, bar_z)For multiplication:
bar_x += y * bar_z
bar_y += x * bar_zThe backward graph runs in reverse topological order. If several downstream paths contribute to the same adjoint, the graph inserts accumulation nodes.
bar_x = contribution_1 + contribution_2 + ...This accumulation is central. Primal values usually flow forward once. Adjoint values collect contributions from all paths to the output.
Static Graphs
A static graph is built before execution.
The program first constructs a graph object. The runtime then compiles or interprets that graph.
Static graphs are useful because the system can inspect the whole computation before running it. It can optimize the graph, plan memory, fuse kernels, partition work across devices, and generate derivative graphs ahead of time.
A typical static graph pipeline is:
build graph
infer types and shapes
differentiate graph
optimize graph
lower to executable code
runThis model fits tensor programs well. Nodes such as matmul, conv2d, reduce_sum, and softmax are large enough that graph-level optimization matters.
Dynamic Graphs
A dynamic graph is built during execution.
Each operation executed by the program appends a node to a tape or graph. After the forward pass, reverse mode traverses this recorded graph backward.
This is common in eager deep learning systems. The user writes ordinary host-language code, including branches and loops. The graph records what actually happened.
run user code
record executed tensor operations
seed output adjoint
traverse recorded graph backwardDynamic graphs are flexible. They handle runtime-dependent control flow naturally. Their cost is that optimization sees only the executed trace, often after some runtime overhead has already been paid.
Graph IR vs SSA IR
Graph IR and SSA IR are closely related.
A straight-line SSA program can be read as a graph:
x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0The values x1, x2, and x3 are graph edges. The operations mul, sin, and add are graph nodes.
The difference is emphasis.
| Representation | Emphasis |
|---|---|
| SSA IR | Ordered instructions, control-flow blocks, compiler analysis |
| Graph IR | Dependency structure, tensor operations, scheduling freedom |
Graph IRs work well when the computation is mostly dataflow. SSA IRs work well when general control flow, mutation, and compiler lowering matter.
Many systems combine them. A graph may contain regions with SSA blocks. An SSA compiler may use graph views for optimization and differentiation.
Control Flow in Graph IRs
Pure dataflow graphs have no ordinary branch or loop syntax. Control flow must be represented explicitly.
A conditional may be represented as a special node:
y = if(cond, then_graph, else_graph)A loop may be represented as:
y = while(init_state, cond_graph, body_graph)These higher-order graph nodes contain subgraphs.
For AD, the derivative of an if node differentiates only the selected branch. The derivative of a while node must account for all executed iterations.
Reverse mode through loops usually requires either saved loop states or recomputation.
forward while:
run body repeatedly
save required states
reverse while:
run reverse body in opposite iteration orderA graph IR must specify how subgraphs capture values, return values, and interact with side effects.
Shape and Type Inference
Graph IRs often represent tensor programs, so shape and type inference are core compiler passes.
For example:
C = matmul(A, B)requires:
A: tensor<f32, [m, k]>
B: tensor<f32, [k, n]>
C: tensor<f32, [m, n]>The reverse rules are:
bar_A = matmul(bar_C, transpose(B))
bar_B = matmul(transpose(A), bar_C)Without shape information, the compiler cannot validate these rules or allocate buffers.
Shape information is also needed for broadcasting. If an input was broadcast in the forward pass, its adjoint must be reduced back to the original shape.
bar_x = reduce_sum_to_shape(bar_z, shape(x))Graph Optimization
Graph IRs are convenient for whole-computation optimization.
Common graph optimizations include:
| Optimization | Example |
|---|---|
| Constant folding | Precompute static expressions |
| Dead node elimination | Remove unused operations |
| Algebraic simplification | Replace x * 1 with x |
| Operator fusion | Fuse elementwise chains |
| Layout propagation | Choose memory layouts consistently |
| Common subgraph elimination | Reuse repeated computations |
| Rematerialization | Recompute instead of saving |
| Device placement | Assign nodes to CPU, GPU, TPU, or remote workers |
| Graph partitioning | Split graph across devices or processes |
AD interacts with these passes. The compiler may optimize before differentiation, after differentiation, or both.
Optimizing before AD can simplify the primal graph. Optimizing after AD can remove redundant derivative work. Some transformations are best applied jointly because they affect saved activations and backward memory.
Saved Values and Graph Lifetimes
Reverse mode needs primal values during the backward pass.
A graph IR can make saved values explicit:
save x
save yor attach saved values to backward functions:
z = op(x, y)
backward(z_bar) uses x, yThe compiler then performs liveness analysis.
It decides whether to:
| Strategy | Meaning |
|---|---|
| Save | Store the value from the forward pass |
| Recompute | Recreate the value during backward |
| Checkpoint | Save selected states and recompute between them |
| Discard | Remove values unnecessary for backward |
This is one of the main performance problems in reverse-mode AD. Graph IRs make the problem visible enough for systematic optimization.
Effects in Graphs
A mathematical graph is pure. A program graph may have effects.
Effects include mutation, randomness, I/O, communication, and synchronization.
For AD, effects are dangerous when they make the backward pass inconsistent with the forward pass.
For example, random sampling requires a clear policy:
same random value in backward
new random value in backward
reparameterized differentiable sample
score-function estimator
nondifferentiable boundaryA graph IR should record effect dependencies explicitly. Otherwise the optimizer may reorder operations incorrectly.
For distributed graphs, communication nodes such as all_reduce, send, and recv also need derivative rules. The adjoint of communication is another communication pattern, often in the opposite direction or with reduction.
Primitive Boundaries
Graph IRs usually treat complex operations as primitives.
A softmax node may remain high-level:
p = softmax(x)or it may lower to:
m = reduce_max(x)
e = exp(x - m)
s = reduce_sum(e)
p = e / sKeeping it high-level allows a stable and efficient derivative rule. Lowering it exposes more operations to generic optimization.
The choice matters.
High-level primitives preserve mathematical intent. Low-level primitives expose implementation details. A good compiler keeps high-level structure long enough to differentiate well, then lowers when code generation needs details.
Graph IR in Deep Learning
Deep learning frameworks made graph IRs widely visible.
A model is represented as a graph of tensor operations. The loss is a scalar output. Reverse-mode AD constructs a backward graph that computes gradients of parameters.
For a training step, the graph may include:
forward model
loss computation
backward gradient graph
optimizer updateIn static graph systems, this full training graph may be compiled. In dynamic systems, the forward graph is recorded for each step and then replayed backward.
Modern systems often blend both approaches. They execute eagerly for usability, then capture graphs for optimization.
Graph IR in Scientific Computing
Scientific computing uses graph IRs when computations are staged, optimized, or distributed.
Examples include:
finite element assembly
PDE solvers
linear algebra pipelines
differentiable rendering
inverse problems
probabilistic programsIn these settings, graph nodes may be larger than tensor kernels. A node might represent a sparse solve, a mesh operation, or a simulation step.
The AD rule for such a node may use implicit differentiation instead of differentiating all internal operations. This can be far more efficient and numerically stable.
Design Criteria
A graph IR for AD should answer these questions:
What are the primitive operations?
What values are differentiable?
How are types and shapes represented?
How is control flow represented?
How are effects represented?
How are saved values represented?
How are adjoints accumulated?
How are custom derivative rules attached?
How does the graph lower to executable code?A graph that cannot answer these questions precisely may be useful for simple models but fragile for general AD.
Summary
Graph IRs represent programs by their dependency structure. This makes them a natural substrate for automatic differentiation.
Forward mode adds tangent edges and JVP nodes. Reverse mode adds adjoint edges, VJP nodes, and accumulation nodes. Static graphs enable ahead-of-time optimization. Dynamic graphs support ordinary runtime control flow.
The main challenge is not drawing the graph. The challenge is giving the graph precise semantics for control flow, shape, effects, saved values, memory, and lowering. A graph IR is effective for AD only when those semantics are explicit.