A dynamic graph is a computation graph built while the program runs. Its structure depends on ordinary runtime values: branches, loop counts, recursive calls, tensor shapes,...
A dynamic graph is a computation graph built while the program runs. Its structure depends on ordinary runtime values: branches, loop counts, recursive calls, tensor shapes, data structures, and sometimes input data itself.
In a static graph system, the graph is mostly known before execution. In a dynamic graph system, the graph is the execution.
This matters for automatic differentiation because reverse mode needs to know which operations occurred. A dynamic graph records the actual operations from the primal run, then traverses them backward.
x = input()
if x > 0:
y = x * x
else:
y = 3 * x
z = sin(y)For , the graph is:
x -> square -> sin -> zFor , the graph is:
x -> multiply_by_3 -> sin -> zThe differentiated graph changes with the executed path.
Define-by-Run Semantics
Dynamic graph AD is often called define-by-run. The program defines the graph by running.
Each primitive operation creates a node:
a = x * y
b = sin(a)
c = b + 1The runtime records:
mul(x, y) -> a
sin(a) -> b
add(b, 1) -> cEach node stores enough information for the backward pass:
| Field | Purpose |
|---|---|
| Operation | Which local derivative rule to use |
| Inputs | Where adjoints must be sent |
| Output | Which value receives an adjoint |
| Saved values | Values needed by the local backward rule |
| Shape and dtype | Metadata for gradient allocation and checks |
For sin(a), the backward rule needs , because
So the node usually stores either a or enough information to recompute it.
Reverse Mode on a Dynamic Graph
Reverse mode proceeds in two phases.
First, the forward pass executes the program and records a graph or tape.
Second, the backward pass starts from an output seed and walks the recorded graph in reverse topological order.
Example:
a = x * y
b = sin(a)
c = b + xForward graph:
x ----\
mul -> a -> sin -> b --\
y ----/ add -> c
x ----------------------------/Backward propagation:
c_bar = 1
b_bar += c_bar
x_bar += c_bar
a_bar += cos(a) * b_bar
x_bar += y * a_bar
y_bar += x * a_barThe important point is accumulation. Since x contributes to c through two paths, its adjoint receives two contributions.
Dynamic Control Flow
Dynamic graphs naturally support ordinary host-language control flow.
def f(x, n):
y = x
for i in range(n):
if y > 1:
y = y / 2
else:
y = y * y + 1
return yThe graph depends on:
n- the intermediate values of
y - the branch taken at each iteration
A dynamic AD system records exactly the operations performed.
For one input, the graph may contain:
square, add, divide, divideFor another input, it may contain:
divide, square, add, divideThe backward pass follows the recorded sequence. It does not need to inspect branches again, except to replay saved graph structure.
Comparison with Static Graphs
Static graphs represent computation before execution. Dynamic graphs represent computation during execution.
| Property | Static graph | Dynamic graph |
|---|---|---|
| Graph construction | Before execution | During execution |
| Control flow | Often explicit IR nodes | Native language control flow |
| Debugging | Less direct | Usually easier |
| Optimization | Strong global optimization | Harder global optimization |
| Compilation | Natural | Requires tracing, staging, or capture |
| Dynamic shapes | More constrained | More natural |
| Reverse-mode tape | Graph itself | Runtime tape or node graph |
Dynamic graphs are especially useful for models with irregular control flow, such as tree networks, variable-length loops, adaptive solvers, and programs over symbolic structures.
Static graphs are often better for whole-program optimization, deployment, batching, and device execution.
Modern systems often combine both designs. They run dynamically by default, then trace or compile hot regions.
Graph Lifetime
A dynamic graph usually has a limited lifetime.
For a single training step:
loss = model(batch)
loss.backward()
optimizer.step()the graph is built during model(batch), consumed by backward(), then released.
This avoids unbounded memory growth.
If the user stores graph-attached tensors across iterations, memory may grow unexpectedly:
losses = []
for batch in data:
loss = model(batch)
losses.append(loss)If each loss retains its graph, the program retains all intermediate states from all batches.
A common remedy is to detach values used only for logging:
losses.append(detach(loss))The exact API varies by system, but the principle is the same: values that should not carry derivative history must be separated from the graph.
Saved Intermediates
A backward rule often needs forward-pass values.
For multiplication,
the local adjoints are:
So the backward node needs both x and y.
For relu(x), the backward rule needs to know whether . It may store the input, the output, or a compact mask.
y = relu(x)Backward:
x_bar += where(x > 0, y_bar, 0)Saved intermediates are a central memory cost in dynamic graph AD. Large tensors saved for backward may dominate memory use.
Detaching and Stopping Gradients
Dynamic systems usually provide a way to stop graph construction.
Conceptually:
y = detach(x)means:
primal value of y = primal value of x
gradient edge from y to x = removedSo if
z = detach(x) * xthen
as a primal function, but AD treats the first factor as constant. The derivative returned is
rather than
This is intentional. Detach is a graph operation, not an ordinary mathematical identity.
Stop-gradient is useful for targets, baselines, cached features, truncated recurrent models, and custom optimization procedures. It should be used carefully because it changes the differentiated computation.
In-Place Mutation
Dynamic graphs interact poorly with in-place mutation when a backward rule needs the old value.
Example:
y = x * x
x += 1
loss = y.sum()The backward rule for x * x needs the original value of x. If mutation overwrites it, the backward pass may use the wrong value.
Systems handle this in several ways:
| Method | Description |
|---|---|
| Version counters | Detect mutation after a value was saved |
| Copy-on-write | Preserve old values when needed |
| Functionalization | Rewrite mutation into immutable updates |
| Restrictions | Reject unsafe in-place operations |
| Custom rules | Let specific mutations define safe adjoints |
The core invariant is simple: the backward pass must see the same values required by the derivative rules of the forward pass.
Dynamic Shapes
Dynamic graphs can naturally represent computations whose tensor shapes depend on data.
k = count_positive(x)
y = top_k(x, k)The resulting graph shape may differ between inputs.
This flexibility is useful, but differentiation becomes more subtle. Shape-changing operations often involve discrete decisions. The derivative usually flows through selected values, not through the selection count or index computation.
For example, top_k passes gradients to selected elements. It normally gives no useful gradient with respect to the integer indices that determined the selection.
Dynamic shape support therefore does not make discrete structure differentiable. It only allows the executed structure to be recorded.
Reusing Dynamic Graphs
A dynamic graph is usually tied to the values from one execution. Reusing it for another input is often invalid because:
- branch choices may differ
- shapes may differ
- saved intermediates differ
- aliasing and mutation histories differ
- random choices may differ
Compiled systems solve this by capturing a graph template rather than a single graph instance. The template may include guards:
shape(x) == (32, 128)
dtype(x) == float32
branch pattern unchangedIf guards hold, the compiled graph can be reused. If guards fail, the system retraces or falls back to interpretation.
This is the basic mechanism behind many dynamic-to-static compilation systems.
Randomness and Dynamic Graphs
Random operations become part of the executed computation.
mask = random_bernoulli(p)
y = x * maskFor a fixed sampled mask, AD differentiates through the multiplication:
But the sampling step itself is generally non-differentiable. Gradients do not flow through the random draw unless the system uses a differentiable reparameterization or a score-function estimator.
This distinction matters in dropout, stochastic computation graphs, reinforcement learning, variational inference, and randomized algorithms.
Debugging Dynamic Graphs
Dynamic graphs are easier to debug than static graphs because the differentiated program follows ordinary runtime behavior.
Useful checks include:
print(value)
print(value.requires_grad)
print(value.grad_fn)
print(value.shape)The programmer can inspect actual values, branch decisions, and intermediate tensors.
However, bugs may be input-specific. A model may differentiate correctly for one input and fail for another because the second input triggers a different path.
Dynamic control flow improves expressiveness, but it expands the set of execution paths that must be tested.
Correctness Rule
For a dynamic graph system, the derivative is computed from the graph produced by the primal execution.
Forward pass:
execute ordinary program
record operations that occurred
Backward pass:
traverse recorded graph backward
apply local adjoint rules
accumulate gradients into shared inputsDynamic graphs make AD operationally close to normal programming. Their cost is that graph structure, memory usage, and differentiability may vary from one execution to the next.