# Dependency Graphs

## Dependency Graphs

A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.

For a straight-line program,

```text
v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1
y  = v5
```

the dependency structure is:

```text
v1 ─┬─> v3 ─> v4 ─┐
    │             ├─> v5
v2 ─┘             │
v1 ───────────────┘
```

Each node represents a variable or operation result. Each edge represents data flow.

The graph determines how derivatives propagate.

## Directed Acyclic Structure

A dependency graph for a straight-line program is a directed acyclic graph.

The graph is directed because dependencies have orientation:

```text
v3 = v1 * v2
```

means $v_3$ depends on $v_1$ and $v_2$, not the reverse.

The graph is acyclic because each variable depends only on variables computed earlier. No variable can depend on itself through a cycle.

This ordering gives a valid evaluation sequence.

If the nodes are ordered:

$$
v_1, v_2, v_3, v_4, v_5
$$

then every node appears after its dependencies.

This ordering is called a topological order.

## Dependency Graph as Function Composition

A dependency graph encodes a decomposition of a function into local maps.

For

```text
v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1
```

the overall function is:

$$
y = g_3(g_2(g_1(x)))
$$

where:

$$
g_1(x_1, x_2) = x_1x_2
$$

$$
g_2(v_3) = \sin(v_3)
$$

$$
g_3(v_4, x_1) = v_4 + x_1
$$

The graph stores the composition structure explicitly.

Automatic differentiation applies the chain rule over this graph.

## Nodes and Edges

Different AD systems represent graphs differently, but the conceptual structure is similar.

| Element | Meaning |
|---|---|
| Node | A computed value |
| Edge | A dependency relation |
| Source node | Input variable |
| Sink node | Output variable |
| Parent | A node used to compute another node |
| Child | A node computed from another node |

For

```text
v3 = v1 * v2
```

the graph contains edges:

$$
v_1 \to v_3
$$

$$
v_2 \to v_3
$$

because $v_3$ depends on both inputs.

## Operation Nodes vs Value Nodes

Some systems represent operations explicitly.

Instead of:

```text
v1 ─┐
    ├─> v3
v2 ─┘
```

they use:

```text
v1 ─┐
    ├─> (*) ─> v3
v2 ─┘
```

The operation node stores:

| Field | Example |
|---|---|
| Operation type | multiply |
| Input references | `v1`, `v2` |
| Output reference | `v3` |
| Derivative rule | product rule |

Operation-centered graphs are common in compiler IRs and tensor frameworks.

Value-centered graphs are common in minimal AD engines.

## Forward Traversal

Forward evaluation follows graph direction.

For

```text
v3 = v1 * v2
v4 = sin(v3)
```

evaluation order is:

$$
v_1, v_2 \to v_3 \to v_4
$$

Forward mode AD propagates tangent information in this direction.

If:

$$
(\dot v_1, \dot v_2)
$$

are known, then:

$$
\dot v_3
$$

can be computed, followed by:

$$
\dot v_4.
$$

The graph structure guarantees all required inputs are available.

## Reverse Traversal

Reverse mode traverses the graph backward.

Starting from:

$$
\bar y = 1
$$

the adjoint propagates from outputs to inputs.

For:

```text
v4 = sin(v3)
```

reverse propagation is:

$$
\bar v_3 += \bar v_4 \cos(v_3).
$$

Then for:

```text
v3 = v1 * v2
```

reverse propagation is:

$$
\bar v_1 += \bar v_3 v_2
$$

$$
\bar v_2 += \bar v_3 v_1.
$$

The backward traversal accumulates sensitivity information along reversed edges.

## Fan-Out and Gradient Accumulation

A node may feed multiple later computations.

Example:

```text
v1 = x
v2 = sin(v1)
v3 = cos(v1)
v4 = v2 + v3
```

Dependency graph:

```text
        ┌─> v2 ─┐
v1 ─────┤       ├─> v4
        └─> v3 ─┘
```

The value $v_1$ influences both $v_2$ and $v_3$.

During reverse mode:

$$
\bar v_1
$$

receives contributions from both paths:

$$
\bar v_1 += \bar v_2 \cos(v_1)
$$

$$
\bar v_1 += -\bar v_3 \sin(v_1).
$$

Thus:

$$
\bar v_1 =
\bar v_2 \cos(v_1) -
\bar v_3 \sin(v_1).
$$

Gradient accumulation is fundamentally graph accumulation.

## Shared Subgraphs

Dependency graphs naturally represent shared computation.

For:

```text
v2 = x * x
v3 = sin(v2)
v4 = cos(v2)
```

the subexpression $x*x$ appears once in the graph.

This prevents redundant work.

Expression trees duplicate repeated subexpressions. Dependency graphs preserve sharing.

This distinction becomes important in large tensor programs where recomputation is expensive.

## Tensor Dependency Graphs

In tensor systems, nodes may represent entire tensor operations.

Example:

```text
v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = softmax(v3)
y  = cross_entropy(v4, target)
```

The graph represents tensor dependencies, not scalar dependencies.

Each node may contain:

| Field | Example |
|---|---|
| Shape | `(1024, 4096)` |
| Dtype | `float32` |
| Device | GPU 0 |
| Layout | row-major |
| Operation | `matmul` |

Reverse mode traverses the same dependency graph but applies tensor derivative rules.

## Dynamic Graphs

Some systems construct dependency graphs dynamically during execution.

Example in a dynamic language:

```python
if x > 0:
    y = x * x
else:
    y = sin(x)
```

The executed branch determines the graph.

If $x > 0$:

```text
x ─> (*) ─> y
```

If $x \le 0$:

```text
x ─> sin ─> y
```

The graph exists only for the executed computation.

Frameworks such as entity["software","PyTorch","deep learning framework"] commonly use this approach.

## Static Graphs

Other systems build a graph representation before execution.

Example:

```text
Input -> MatMul -> ReLU -> MatMul -> Output
```

The graph becomes a compilable program representation.

Advantages include:

| Benefit | Reason |
|---|---|
| Optimization | Whole-graph analysis |
| Fusion | Combine operations |
| Scheduling | Global execution planning |
| Memory reuse | Lifetime analysis |
| Device partitioning | Multi-device placement |

Frameworks such as entity["software","XLA","machine learning compiler"] and entity["software","TensorFlow","machine learning framework"] heavily use static graph compilation.

## Graph Representation in Memory

A minimal graph representation might store:

```go
type Node struct {
    ID      int
    Op      Op
    Inputs  []int
    Users   []int
}
```

Where:

| Field | Meaning |
|---|---|
| `ID` | Node identifier |
| `Op` | Operation type |
| `Inputs` | Dependencies |
| `Users` | Consumers of this node |

A reverse traversal often iterates over nodes in reverse topological order.

## Reverse Topological Order

Suppose the graph order is:

$$
v_1, v_2, v_3, v_4, v_5
$$

Then reverse mode processes:

$$
v_5, v_4, v_3, v_2, v_1.
$$

This guarantees that when a node is processed, all downstream adjoint contributions have already been accumulated.

Reverse topological traversal is one of the central execution rules of reverse-mode AD.

## Graph Transformations

Modern AD systems frequently transform dependency graphs.

Typical transformations include:

| Transformation | Purpose |
|---|---|
| Constant folding | Precompute constants |
| Dead node elimination | Remove unused computations |
| Fusion | Combine kernels |
| Common subexpression elimination | Reuse repeated computation |
| Rematerialization | Recompute instead of storing |
| Shape propagation | Infer tensor shapes |
| Differentiation transform | Generate backward graph |

In compiler-based AD systems, differentiation itself is often implemented as a graph rewrite.

## Graphs and the Chain Rule

The dependency graph is a concrete representation of the chain rule.

For a path:

$$
x \to v_1 \to v_2 \to y
$$

the contribution to the derivative is:

$$
\frac{\partial y}{\partial v_2}
\frac{\partial v_2}{\partial v_1}
\frac{\partial v_1}{\partial x}.
$$

If multiple paths connect $x$ to $y$, the total derivative is the sum over all paths.

This explains why reverse mode accumulates contributions at merge points.

The graph structure directly determines derivative propagation.

## Cycles and Recurrence

Straight-line dependency graphs are acyclic, but real programs may contain loops.

Example:

```text
h_t = tanh(W h_{t-1} + U x_t)
```

This creates a recurrence relation.

AD systems usually handle recurrence by unrolling execution into a larger acyclic graph:

```text
h0 -> h1 -> h2 -> h3 -> ...
```

The resulting unrolled graph is again a DAG.

Backpropagation through time is reverse-mode AD on this expanded graph.

## Core Idea

A dependency graph makes computation structure explicit. Nodes represent computed values. Edges represent data dependencies. Forward mode moves with graph direction. Reverse mode moves against it.

The graph is the operational form of the chain rule. Automatic differentiation works because derivative propagation follows the same dependency structure as the original computation.