Skip to content

Dependency Graphs

A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.

A dependency graph describes how values in a computation depend on earlier values. Automatic differentiation operates on these dependencies.

For a straight-line program,

v1 = x1
v2 = x2
v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1
y  = v5

the dependency structure is:

v1 ─┬─> v3 ─> v4 ─┐
    │             ├─> v5
v2 ─┘             │
v1 ───────────────┘

Each node represents a variable or operation result. Each edge represents data flow.

The graph determines how derivatives propagate.

Directed Acyclic Structure

A dependency graph for a straight-line program is a directed acyclic graph.

The graph is directed because dependencies have orientation:

v3 = v1 * v2

means v3v_3 depends on v1v_1 and v2v_2, not the reverse.

The graph is acyclic because each variable depends only on variables computed earlier. No variable can depend on itself through a cycle.

This ordering gives a valid evaluation sequence.

If the nodes are ordered:

v1,v2,v3,v4,v5 v_1, v_2, v_3, v_4, v_5

then every node appears after its dependencies.

This ordering is called a topological order.

Dependency Graph as Function Composition

A dependency graph encodes a decomposition of a function into local maps.

For

v3 = v1 * v2
v4 = sin(v3)
v5 = v4 + v1

the overall function is:

y=g3(g2(g1(x))) y = g_3(g_2(g_1(x)))

where:

g1(x1,x2)=x1x2 g_1(x_1, x_2) = x_1x_2 g2(v3)=sin(v3) g_2(v_3) = \sin(v_3) g3(v4,x1)=v4+x1 g_3(v_4, x_1) = v_4 + x_1

The graph stores the composition structure explicitly.

Automatic differentiation applies the chain rule over this graph.

Nodes and Edges

Different AD systems represent graphs differently, but the conceptual structure is similar.

ElementMeaning
NodeA computed value
EdgeA dependency relation
Source nodeInput variable
Sink nodeOutput variable
ParentA node used to compute another node
ChildA node computed from another node

For

v3 = v1 * v2

the graph contains edges:

v1v3 v_1 \to v_3 v2v3 v_2 \to v_3

because v3v_3 depends on both inputs.

Operation Nodes vs Value Nodes

Some systems represent operations explicitly.

Instead of:

v1 ─┐
    ├─> v3
v2 ─┘

they use:

v1 ─┐
    ├─> (*) ─> v3
v2 ─┘

The operation node stores:

FieldExample
Operation typemultiply
Input referencesv1, v2
Output referencev3
Derivative ruleproduct rule

Operation-centered graphs are common in compiler IRs and tensor frameworks.

Value-centered graphs are common in minimal AD engines.

Forward Traversal

Forward evaluation follows graph direction.

For

v3 = v1 * v2
v4 = sin(v3)

evaluation order is:

v1,v2v3v4 v_1, v_2 \to v_3 \to v_4

Forward mode AD propagates tangent information in this direction.

If:

(v˙1,v˙2) (\dot v_1, \dot v_2)

are known, then:

v˙3 \dot v_3

can be computed, followed by:

v˙4. \dot v_4.

The graph structure guarantees all required inputs are available.

Reverse Traversal

Reverse mode traverses the graph backward.

Starting from:

yˉ=1 \bar y = 1

the adjoint propagates from outputs to inputs.

For:

v4 = sin(v3)

reverse propagation is:

vˉ3+=vˉ4cos(v3). \bar v_3 += \bar v_4 \cos(v_3).

Then for:

v3 = v1 * v2

reverse propagation is:

vˉ1+=vˉ3v2 \bar v_1 += \bar v_3 v_2 vˉ2+=vˉ3v1. \bar v_2 += \bar v_3 v_1.

The backward traversal accumulates sensitivity information along reversed edges.

Fan-Out and Gradient Accumulation

A node may feed multiple later computations.

Example:

v1 = x
v2 = sin(v1)
v3 = cos(v1)
v4 = v2 + v3

Dependency graph:

        ┌─> v2 ─┐
v1 ─────┤       ├─> v4
        └─> v3 ─┘

The value v1v_1 influences both v2v_2 and v3v_3.

During reverse mode:

vˉ1 \bar v_1

receives contributions from both paths:

vˉ1+=vˉ2cos(v1) \bar v_1 += \bar v_2 \cos(v_1) vˉ1+=vˉ3sin(v1). \bar v_1 += -\bar v_3 \sin(v_1).

Thus:

vˉ1=vˉ2cos(v1)vˉ3sin(v1). \bar v_1 = \bar v_2 \cos(v_1) - \bar v_3 \sin(v_1).

Gradient accumulation is fundamentally graph accumulation.

Shared Subgraphs

Dependency graphs naturally represent shared computation.

For:

v2 = x * x
v3 = sin(v2)
v4 = cos(v2)

the subexpression xxx*x appears once in the graph.

This prevents redundant work.

Expression trees duplicate repeated subexpressions. Dependency graphs preserve sharing.

This distinction becomes important in large tensor programs where recomputation is expensive.

Tensor Dependency Graphs

In tensor systems, nodes may represent entire tensor operations.

Example:

v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = softmax(v3)
y  = cross_entropy(v4, target)

The graph represents tensor dependencies, not scalar dependencies.

Each node may contain:

FieldExample
Shape(1024, 4096)
Dtypefloat32
DeviceGPU 0
Layoutrow-major
Operationmatmul

Reverse mode traverses the same dependency graph but applies tensor derivative rules.

Dynamic Graphs

Some systems construct dependency graphs dynamically during execution.

Example in a dynamic language:

if x > 0:
    y = x * x
else:
    y = sin(x)

The executed branch determines the graph.

If x>0x > 0:

x ─> (*) ─> y

If x0x \le 0:

x ─> sin ─> y

The graph exists only for the executed computation.

Frameworks such as entity[“software”,“PyTorch”,“deep learning framework”] commonly use this approach.

Static Graphs

Other systems build a graph representation before execution.

Example:

Input -> MatMul -> ReLU -> MatMul -> Output

The graph becomes a compilable program representation.

Advantages include:

BenefitReason
OptimizationWhole-graph analysis
FusionCombine operations
SchedulingGlobal execution planning
Memory reuseLifetime analysis
Device partitioningMulti-device placement

Frameworks such as entity[“software”,“XLA”,“machine learning compiler”] and entity[“software”,“TensorFlow”,“machine learning framework”] heavily use static graph compilation.

Graph Representation in Memory

A minimal graph representation might store:

type Node struct {
    ID      int
    Op      Op
    Inputs  []int
    Users   []int
}

Where:

FieldMeaning
IDNode identifier
OpOperation type
InputsDependencies
UsersConsumers of this node

A reverse traversal often iterates over nodes in reverse topological order.

Reverse Topological Order

Suppose the graph order is:

v1,v2,v3,v4,v5 v_1, v_2, v_3, v_4, v_5

Then reverse mode processes:

v5,v4,v3,v2,v1. v_5, v_4, v_3, v_2, v_1.

This guarantees that when a node is processed, all downstream adjoint contributions have already been accumulated.

Reverse topological traversal is one of the central execution rules of reverse-mode AD.

Graph Transformations

Modern AD systems frequently transform dependency graphs.

Typical transformations include:

TransformationPurpose
Constant foldingPrecompute constants
Dead node eliminationRemove unused computations
FusionCombine kernels
Common subexpression eliminationReuse repeated computation
RematerializationRecompute instead of storing
Shape propagationInfer tensor shapes
Differentiation transformGenerate backward graph

In compiler-based AD systems, differentiation itself is often implemented as a graph rewrite.

Graphs and the Chain Rule

The dependency graph is a concrete representation of the chain rule.

For a path:

xv1v2y x \to v_1 \to v_2 \to y

the contribution to the derivative is:

yv2v2v1v1x. \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_1} \frac{\partial v_1}{\partial x}.

If multiple paths connect xx to yy, the total derivative is the sum over all paths.

This explains why reverse mode accumulates contributions at merge points.

The graph structure directly determines derivative propagation.

Cycles and Recurrence

Straight-line dependency graphs are acyclic, but real programs may contain loops.

Example:

h_t = tanh(W h_{t-1} + U x_t)

This creates a recurrence relation.

AD systems usually handle recurrence by unrolling execution into a larger acyclic graph:

h0 -> h1 -> h2 -> h3 -> ...

The resulting unrolled graph is again a DAG.

Backpropagation through time is reverse-mode AD on this expanded graph.

Core Idea

A dependency graph makes computation structure explicit. Nodes represent computed values. Edges represent data dependencies. Forward mode moves with graph direction. Reverse mode moves against it.

The graph is the operational form of the chain rule. Automatic differentiation works because derivative propagation follows the same dependency structure as the original computation.