# Graph IRs

## Graph IRs

A graph intermediate representation models a program as nodes and edges.

Nodes represent operations or values. Edges represent data dependencies. For automatic differentiation, this form is natural because the chain rule also follows a dependency graph.

A simple function

```python
def f(x):
    y = x * x
    z = sin(y)
    return z
```

can be represented as:

```text
x -> mul -> y -> sin -> z
```

The primal graph computes values from inputs to outputs. The derivative graph computes tangent or adjoint values using the same dependency structure.

### Dataflow Graphs

A dataflow graph contains operation nodes and value edges.

```text
x0 ----\
       mul -> x1 -> sin -> x2
x0 ----/
```

Each operation has:

| Field | Meaning |
|---|---|
| op | Operation name, such as `mul`, `sin`, `matmul` |
| inputs | Values consumed by the operation |
| outputs | Values produced by the operation |
| attributes | Static metadata, such as axis, shape, dtype |
| effects | Optional state, mutation, randomness, or I/O behavior |

For AD, every differentiable operation needs a local derivative rule. The graph supplies the wiring.

Forward mode adds tangent edges. Reverse mode adds adjoint edges and reverse operations.

### Forward Graph Transformation

Forward mode transforms each primal value `v` into a pair:

```text
(v, dv)
```

For an operation

```text
z = op(x, y)
```

the transformed graph contains:

```text
z  = op(x, y)
dz = jvp_op(x, y, dx, dy)
```

For multiplication:

```text
z = x * y
dz = dx * y + x * dy
```

The transformed graph still flows from inputs to outputs. Its structure is close to the original graph, but with additional tangent computations.

### Reverse Graph Transformation

Reverse mode builds a backward graph.

For each primal value `v`, reverse mode introduces an adjoint value:

```text
bar_v
```

For an operation

```text
z = op(x, y)
```

the backward graph applies a vector-Jacobian product rule:

```text
(bar_x, bar_y) += vjp_op(x, y, bar_z)
```

For multiplication:

```text
bar_x += y * bar_z
bar_y += x * bar_z
```

The backward graph runs in reverse topological order. If several downstream paths contribute to the same adjoint, the graph inserts accumulation nodes.

```text
bar_x = contribution_1 + contribution_2 + ...
```

This accumulation is central. Primal values usually flow forward once. Adjoint values collect contributions from all paths to the output.

### Static Graphs

A static graph is built before execution.

The program first constructs a graph object. The runtime then compiles or interprets that graph.

Static graphs are useful because the system can inspect the whole computation before running it. It can optimize the graph, plan memory, fuse kernels, partition work across devices, and generate derivative graphs ahead of time.

A typical static graph pipeline is:

```text
build graph
infer types and shapes
differentiate graph
optimize graph
lower to executable code
run
```

This model fits tensor programs well. Nodes such as `matmul`, `conv2d`, `reduce_sum`, and `softmax` are large enough that graph-level optimization matters.

### Dynamic Graphs

A dynamic graph is built during execution.

Each operation executed by the program appends a node to a tape or graph. After the forward pass, reverse mode traverses this recorded graph backward.

This is common in eager deep learning systems. The user writes ordinary host-language code, including branches and loops. The graph records what actually happened.

```text
run user code
record executed tensor operations
seed output adjoint
traverse recorded graph backward
```

Dynamic graphs are flexible. They handle runtime-dependent control flow naturally. Their cost is that optimization sees only the executed trace, often after some runtime overhead has already been paid.

### Graph IR vs SSA IR

Graph IR and SSA IR are closely related.

A straight-line SSA program can be read as a graph:

```text
x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0
```

The values `x1`, `x2`, and `x3` are graph edges. The operations `mul`, `sin`, and `add` are graph nodes.

The difference is emphasis.

| Representation | Emphasis |
|---|---|
| SSA IR | Ordered instructions, control-flow blocks, compiler analysis |
| Graph IR | Dependency structure, tensor operations, scheduling freedom |

Graph IRs work well when the computation is mostly dataflow. SSA IRs work well when general control flow, mutation, and compiler lowering matter.

Many systems combine them. A graph may contain regions with SSA blocks. An SSA compiler may use graph views for optimization and differentiation.

### Control Flow in Graph IRs

Pure dataflow graphs have no ordinary branch or loop syntax. Control flow must be represented explicitly.

A conditional may be represented as a special node:

```text
y = if(cond, then_graph, else_graph)
```

A loop may be represented as:

```text
y = while(init_state, cond_graph, body_graph)
```

These higher-order graph nodes contain subgraphs.

For AD, the derivative of an `if` node differentiates only the selected branch. The derivative of a `while` node must account for all executed iterations.

Reverse mode through loops usually requires either saved loop states or recomputation.

```text
forward while:
    run body repeatedly
    save required states

reverse while:
    run reverse body in opposite iteration order
```

A graph IR must specify how subgraphs capture values, return values, and interact with side effects.

### Shape and Type Inference

Graph IRs often represent tensor programs, so shape and type inference are core compiler passes.

For example:

```text
C = matmul(A, B)
```

requires:

```text
A: tensor<f32, [m, k]>
B: tensor<f32, [k, n]>
C: tensor<f32, [m, n]>
```

The reverse rules are:

```text
bar_A = matmul(bar_C, transpose(B))
bar_B = matmul(transpose(A), bar_C)
```

Without shape information, the compiler cannot validate these rules or allocate buffers.

Shape information is also needed for broadcasting. If an input was broadcast in the forward pass, its adjoint must be reduced back to the original shape.

```text
bar_x = reduce_sum_to_shape(bar_z, shape(x))
```

### Graph Optimization

Graph IRs are convenient for whole-computation optimization.

Common graph optimizations include:

| Optimization | Example |
|---|---|
| Constant folding | Precompute static expressions |
| Dead node elimination | Remove unused operations |
| Algebraic simplification | Replace `x * 1` with `x` |
| Operator fusion | Fuse elementwise chains |
| Layout propagation | Choose memory layouts consistently |
| Common subgraph elimination | Reuse repeated computations |
| Rematerialization | Recompute instead of saving |
| Device placement | Assign nodes to CPU, GPU, TPU, or remote workers |
| Graph partitioning | Split graph across devices or processes |

AD interacts with these passes. The compiler may optimize before differentiation, after differentiation, or both.

Optimizing before AD can simplify the primal graph. Optimizing after AD can remove redundant derivative work. Some transformations are best applied jointly because they affect saved activations and backward memory.

### Saved Values and Graph Lifetimes

Reverse mode needs primal values during the backward pass.

A graph IR can make saved values explicit:

```text
save x
save y
```

or attach saved values to backward functions:

```text
z = op(x, y)
backward(z_bar) uses x, y
```

The compiler then performs liveness analysis.

It decides whether to:

| Strategy | Meaning |
|---|---|
| Save | Store the value from the forward pass |
| Recompute | Recreate the value during backward |
| Checkpoint | Save selected states and recompute between them |
| Discard | Remove values unnecessary for backward |

This is one of the main performance problems in reverse-mode AD. Graph IRs make the problem visible enough for systematic optimization.

### Effects in Graphs

A mathematical graph is pure. A program graph may have effects.

Effects include mutation, randomness, I/O, communication, and synchronization.

For AD, effects are dangerous when they make the backward pass inconsistent with the forward pass.

For example, random sampling requires a clear policy:

```text
same random value in backward
new random value in backward
reparameterized differentiable sample
score-function estimator
nondifferentiable boundary
```

A graph IR should record effect dependencies explicitly. Otherwise the optimizer may reorder operations incorrectly.

For distributed graphs, communication nodes such as `all_reduce`, `send`, and `recv` also need derivative rules. The adjoint of communication is another communication pattern, often in the opposite direction or with reduction.

### Primitive Boundaries

Graph IRs usually treat complex operations as primitives.

A `softmax` node may remain high-level:

```text
p = softmax(x)
```

or it may lower to:

```text
m = reduce_max(x)
e = exp(x - m)
s = reduce_sum(e)
p = e / s
```

Keeping it high-level allows a stable and efficient derivative rule. Lowering it exposes more operations to generic optimization.

The choice matters.

High-level primitives preserve mathematical intent. Low-level primitives expose implementation details. A good compiler keeps high-level structure long enough to differentiate well, then lowers when code generation needs details.

### Graph IR in Deep Learning

Deep learning frameworks made graph IRs widely visible.

A model is represented as a graph of tensor operations. The loss is a scalar output. Reverse-mode AD constructs a backward graph that computes gradients of parameters.

For a training step, the graph may include:

```text
forward model
loss computation
backward gradient graph
optimizer update
```

In static graph systems, this full training graph may be compiled. In dynamic systems, the forward graph is recorded for each step and then replayed backward.

Modern systems often blend both approaches. They execute eagerly for usability, then capture graphs for optimization.

### Graph IR in Scientific Computing

Scientific computing uses graph IRs when computations are staged, optimized, or distributed.

Examples include:

```text
finite element assembly
PDE solvers
linear algebra pipelines
differentiable rendering
inverse problems
probabilistic programs
```

In these settings, graph nodes may be larger than tensor kernels. A node might represent a sparse solve, a mesh operation, or a simulation step.

The AD rule for such a node may use implicit differentiation instead of differentiating all internal operations. This can be far more efficient and numerically stable.

### Design Criteria

A graph IR for AD should answer these questions:

```text
What are the primitive operations?
What values are differentiable?
How are types and shapes represented?
How is control flow represented?
How are effects represented?
How are saved values represented?
How are adjoints accumulated?
How are custom derivative rules attached?
How does the graph lower to executable code?
```

A graph that cannot answer these questions precisely may be useful for simple models but fragile for general AD.

### Summary

Graph IRs represent programs by their dependency structure. This makes them a natural substrate for automatic differentiation.

Forward mode adds tangent edges and JVP nodes. Reverse mode adds adjoint edges, VJP nodes, and accumulation nodes. Static graphs enable ahead-of-time optimization. Dynamic graphs support ordinary runtime control flow.

The main challenge is not drawing the graph. The challenge is giving the graph precise semantics for control flow, shape, effects, saved values, memory, and lowering. A graph IR is effective for AD only when those semantics are explicit.

