Skip to content

Dynamic Graphs

A dynamic graph is a computation graph built while the program runs. Its structure depends on ordinary runtime values: branches, loop counts, recursive calls, tensor shapes,...

A dynamic graph is a computation graph built while the program runs. Its structure depends on ordinary runtime values: branches, loop counts, recursive calls, tensor shapes, data structures, and sometimes input data itself.

In a static graph system, the graph is mostly known before execution. In a dynamic graph system, the graph is the execution.

This matters for automatic differentiation because reverse mode needs to know which operations occurred. A dynamic graph records the actual operations from the primal run, then traverses them backward.

x = input()

if x > 0:
    y = x * x
else:
    y = 3 * x

z = sin(y)

For x>0x > 0, the graph is:

x -> square -> sin -> z

For x0x \le 0, the graph is:

x -> multiply_by_3 -> sin -> z

The differentiated graph changes with the executed path.

Define-by-Run Semantics

Dynamic graph AD is often called define-by-run. The program defines the graph by running.

Each primitive operation creates a node:

a = x * y
b = sin(a)
c = b + 1

The runtime records:

mul(x, y) -> a
sin(a)    -> b
add(b, 1) -> c

Each node stores enough information for the backward pass:

FieldPurpose
OperationWhich local derivative rule to use
InputsWhere adjoints must be sent
OutputWhich value receives an adjoint
Saved valuesValues needed by the local backward rule
Shape and dtypeMetadata for gradient allocation and checks

For sin(a), the backward rule needs aa, because

ddasin(a)=cos(a). \frac{d}{da}\sin(a)=\cos(a).

So the node usually stores either a or enough information to recompute it.

Reverse Mode on a Dynamic Graph

Reverse mode proceeds in two phases.

First, the forward pass executes the program and records a graph or tape.

Second, the backward pass starts from an output seed and walks the recorded graph in reverse topological order.

Example:

a = x * y
b = sin(a)
c = b + x

Forward graph:

x ----\
       mul -> a -> sin -> b --\
y ----/                        add -> c
x ----------------------------/

Backward propagation:

c_bar = 1

b_bar += c_bar
x_bar += c_bar

a_bar += cos(a) * b_bar

x_bar += y * a_bar
y_bar += x * a_bar

The important point is accumulation. Since x contributes to c through two paths, its adjoint receives two contributions.

Dynamic Control Flow

Dynamic graphs naturally support ordinary host-language control flow.

def f(x, n):
    y = x

    for i in range(n):
        if y > 1:
            y = y / 2
        else:
            y = y * y + 1

    return y

The graph depends on:

  • n
  • the intermediate values of y
  • the branch taken at each iteration

A dynamic AD system records exactly the operations performed.

For one input, the graph may contain:

square, add, divide, divide

For another input, it may contain:

divide, square, add, divide

The backward pass follows the recorded sequence. It does not need to inspect branches again, except to replay saved graph structure.

Comparison with Static Graphs

Static graphs represent computation before execution. Dynamic graphs represent computation during execution.

PropertyStatic graphDynamic graph
Graph constructionBefore executionDuring execution
Control flowOften explicit IR nodesNative language control flow
DebuggingLess directUsually easier
OptimizationStrong global optimizationHarder global optimization
CompilationNaturalRequires tracing, staging, or capture
Dynamic shapesMore constrainedMore natural
Reverse-mode tapeGraph itselfRuntime tape or node graph

Dynamic graphs are especially useful for models with irregular control flow, such as tree networks, variable-length loops, adaptive solvers, and programs over symbolic structures.

Static graphs are often better for whole-program optimization, deployment, batching, and device execution.

Modern systems often combine both designs. They run dynamically by default, then trace or compile hot regions.

Graph Lifetime

A dynamic graph usually has a limited lifetime.

For a single training step:

loss = model(batch)
loss.backward()
optimizer.step()

the graph is built during model(batch), consumed by backward(), then released.

This avoids unbounded memory growth.

If the user stores graph-attached tensors across iterations, memory may grow unexpectedly:

losses = []

for batch in data:
    loss = model(batch)
    losses.append(loss)

If each loss retains its graph, the program retains all intermediate states from all batches.

A common remedy is to detach values used only for logging:

losses.append(detach(loss))

The exact API varies by system, but the principle is the same: values that should not carry derivative history must be separated from the graph.

Saved Intermediates

A backward rule often needs forward-pass values.

For multiplication,

z=xy z = xy

the local adjoints are:

xˉ+=yzˉ \bar{x} \mathrel{+}= y\bar{z} yˉ+=xzˉ \bar{y} \mathrel{+}= x\bar{z}

So the backward node needs both x and y.

For relu(x), the backward rule needs to know whether x>0x > 0. It may store the input, the output, or a compact mask.

y = relu(x)

Backward:

x_bar += where(x > 0, y_bar, 0)

Saved intermediates are a central memory cost in dynamic graph AD. Large tensors saved for backward may dominate memory use.

Detaching and Stopping Gradients

Dynamic systems usually provide a way to stop graph construction.

Conceptually:

y = detach(x)

means:

primal value of y = primal value of x
gradient edge from y to x = removed

So if

z = detach(x) * x

then

z=x2 z = x^2

as a primal function, but AD treats the first factor as constant. The derivative returned is

dzdx=x \frac{dz}{dx}=x

rather than

2x. 2x.

This is intentional. Detach is a graph operation, not an ordinary mathematical identity.

Stop-gradient is useful for targets, baselines, cached features, truncated recurrent models, and custom optimization procedures. It should be used carefully because it changes the differentiated computation.

In-Place Mutation

Dynamic graphs interact poorly with in-place mutation when a backward rule needs the old value.

Example:

y = x * x
x += 1
loss = y.sum()

The backward rule for x * x needs the original value of x. If mutation overwrites it, the backward pass may use the wrong value.

Systems handle this in several ways:

MethodDescription
Version countersDetect mutation after a value was saved
Copy-on-writePreserve old values when needed
FunctionalizationRewrite mutation into immutable updates
RestrictionsReject unsafe in-place operations
Custom rulesLet specific mutations define safe adjoints

The core invariant is simple: the backward pass must see the same values required by the derivative rules of the forward pass.

Dynamic Shapes

Dynamic graphs can naturally represent computations whose tensor shapes depend on data.

k = count_positive(x)
y = top_k(x, k)

The resulting graph shape may differ between inputs.

This flexibility is useful, but differentiation becomes more subtle. Shape-changing operations often involve discrete decisions. The derivative usually flows through selected values, not through the selection count or index computation.

For example, top_k passes gradients to selected elements. It normally gives no useful gradient with respect to the integer indices that determined the selection.

Dynamic shape support therefore does not make discrete structure differentiable. It only allows the executed structure to be recorded.

Reusing Dynamic Graphs

A dynamic graph is usually tied to the values from one execution. Reusing it for another input is often invalid because:

  • branch choices may differ
  • shapes may differ
  • saved intermediates differ
  • aliasing and mutation histories differ
  • random choices may differ

Compiled systems solve this by capturing a graph template rather than a single graph instance. The template may include guards:

shape(x) == (32, 128)
dtype(x) == float32
branch pattern unchanged

If guards hold, the compiled graph can be reused. If guards fail, the system retraces or falls back to interpretation.

This is the basic mechanism behind many dynamic-to-static compilation systems.

Randomness and Dynamic Graphs

Random operations become part of the executed computation.

mask = random_bernoulli(p)
y = x * mask

For a fixed sampled mask, AD differentiates through the multiplication:

xˉ=maskyˉ. \bar{x} = mask \cdot \bar{y}.

But the sampling step itself is generally non-differentiable. Gradients do not flow through the random draw unless the system uses a differentiable reparameterization or a score-function estimator.

This distinction matters in dropout, stochastic computation graphs, reinforcement learning, variational inference, and randomized algorithms.

Debugging Dynamic Graphs

Dynamic graphs are easier to debug than static graphs because the differentiated program follows ordinary runtime behavior.

Useful checks include:

print(value)
print(value.requires_grad)
print(value.grad_fn)
print(value.shape)

The programmer can inspect actual values, branch decisions, and intermediate tensors.

However, bugs may be input-specific. A model may differentiate correctly for one input and fail for another because the second input triggers a different path.

Dynamic control flow improves expressiveness, but it expands the set of execution paths that must be tested.

Correctness Rule

For a dynamic graph system, the derivative is computed from the graph produced by the primal execution.

Forward pass:
    execute ordinary program
    record operations that occurred

Backward pass:
    traverse recorded graph backward
    apply local adjoint rules
    accumulate gradients into shared inputs

Dynamic graphs make AD operationally close to normal programming. Their cost is that graph structure, memory usage, and differentiability may vary from one execution to the next.