Control Flow

Control flow determines which operations a program executes. Straight-line programs have a fixed sequence of operations, but ordinary programs contain branches, loops, recursion, early returns, and exceptions. Automatic differentiation must account for this structure without losing the basic rule: derivatives propagate through the operations that actually determine the output.

A small example already shows the issue:

if x > 0:
    y = x * x
else:
    y = -x

This program does not define one straight-line trace for all inputs. It defines different traces for different regions of the input space.

For $x > 0$ :

v1 = x
v2 = v1 * v1
y  = v2

For $x \le 0$ :

v1 = x
v2 = -v1
y  = v2

The derivative depends on which branch is taken:

\frac{dy}{dx} = \begin{cases} 2x, & x > 0, \\ -1, & x < 0. \end{cases}

At $x = 0$ , the function is continuous but not differentiable. An AD system must choose a convention, report an error, or propagate the derivative of the executed branch.

Branches

A branch selects one subprogram from several alternatives.

if condition:
    y = f(x)
else:
    y = g(x)

For a fixed execution, only one branch contributes to the output. Dynamic AD systems usually differentiate the executed branch only.

If the condition evaluates to true, the local computation is:

y = f(x)

If it evaluates to false, the local computation is:

y = g(x)

The derivative is therefore branch-local.

This is correct inside a region where the branch decision does not change. It may fail to describe behavior at the boundary where the condition changes.

Branch Boundaries

A branch condition may introduce a discontinuity or a kink.

Example:

if x > 0:
    y = x
else:
    y = 0

This is the ReLU function:

y = \max(0, x).

Its derivative is:

\frac{dy}{dx} = \begin{cases} 1, & x > 0, \\ 0, & x < 0. \end{cases}

At $x = 0$ , the classical derivative is undefined.

Many machine learning systems define a conventional derivative at the boundary, often $0$ . This convention is useful computationally, but it is a design choice, not a theorem of calculus.

Data-Dependent Control Flow

Control flow is data-dependent when the branch condition depends on differentiable inputs.

if x > threshold:
    y = x * x
else:
    y = sin(x)

Within each region, AD computes the derivative of the selected formula. Across the threshold, the program may have a jump, kink, or other non-smooth behavior.

AD usually does not include the derivative of the branch decision itself. The comparison $x > threshold$ is discrete. Its output changes abruptly and has no useful ordinary derivative.

Thus AD treats the executed path as fixed during differentiation.

Loops

A loop repeats a block of computation.

h = h0
for t in range(n):
    h = tanh(W * h + b)
y = h

For a fixed $n$ , the loop can be unrolled into a straight-line program:

h0 = input
h1 = tanh(W * h0 + b)
h2 = tanh(W * h1 + b)
h3 = tanh(W * h2 + b)
...
hn = tanh(W * h_{n-1} + b)
y  = hn

Forward mode propagates tangents through each iteration. Reverse mode records the intermediate states and traverses the iterations backward.

This is the same mechanism used in backpropagation through time.

Loop-Carried Dependencies

A loop-carried dependency occurs when one iteration depends on the previous iteration.

s = 0
for i in range(n):
    s = s + x[i] * x[i]

Here, each new value of s depends on the old value of s.

In single-assignment form:

s0 = 0
s1 = s0 + x[0] * x[0]
s2 = s1 + x[1] * x[1]
s3 = s2 + x[2] * x[2]
...

Reverse mode propagates adjoints backward through this chain. The adjoint of each s_i contributes to the previous s_{i-1} and to the corresponding input x[i].

Dynamic Loop Counts

A loop count may depend on runtime data.

while norm(x) > eps:
    x = step(x)

For one execution, the loop runs a specific number of iterations. A dynamic AD system differentiates the executed unrolled trace.

This derivative assumes that the number of iterations stays fixed under small perturbations. That assumption can fail near points where a small input change changes the stopping time.

Stopping criteria are therefore a source of non-smoothness.

Recursion

Recursion is another form of repeated computation.

def f(x, n):
    if n == 0:
        return x
    return sin(f(x, n - 1))

For a fixed call, recursion expands into a finite computation trace:

v0 = x
v1 = sin(v0)
v2 = sin(v1)
...
vn = sin(v_{n-1})

Reverse mode can differentiate this trace if the system records the call structure and intermediate values.

Recursive differentiation requires careful handling of stack frames, captured variables, and return values.

Early Returns

An early return exits a function before later code executes.

def f(x):
    if x < 0:
        return 0
    y = x * x
    return y

Only the executed operations belong to the derivative trace.

If $x < 0$ , the derivative path is constant:

\frac{dy}{dx} = 0.

If $x \ge 0$ , the derivative path is:

\frac{dy}{dx} = 2x.

Early returns are branch structure. AD systems treat them like conditionals.

Exceptions and Invalid Regions

Some computations are defined only on part of the input domain.

y = log(x)

The derivative rule

dy = \frac{1}{x} dx

requires $x > 0$ for real-valued computation.

Control flow often guards such operations:

if x > 0:
    y = log(x)
else:
    y = 0

The derivative follows the executed path. However, the mathematical function may still be discontinuous or non-differentiable at the guard boundary.

AD cannot remove domain issues. It only differentiates the computation that the program performs.

Static Versus Dynamic Treatment

There are two broad strategies for control flow.

Strategy	Method	Typical use
Dynamic tracing	Record the executed path	PyTorch-style eager systems
Static transformation	Transform control-flow constructs directly	Compiler-based AD

Dynamic tracing is simple for ordinary host-language control flow. It records what happened.

Static transformation can reason about loops and branches as program structure. It can optimize memory, fuse operations, and generate differentiated control flow. It is harder to implement because it must preserve language semantics.

Differentiating Structured Control Flow

A compiler-based AD system may transform:

if c:
    y = f(x)
else:
    y = g(x)

into differentiated control flow:

if c:
    y, dy = df(x, dx)
else:
    y, dy = dg(x, dx)

For reverse mode, it may generate:

if c:
    backward_f(y_bar)
else:
    backward_g(y_bar)

The condition must be replayed or recorded so the backward pass follows the same branch as the forward pass.

Control Flow and Tapes

Reverse mode must know which operations were executed. Therefore, control flow often becomes tape structure.

A tape may contain records such as:

enter_branch true
mul v1 v1 -> v2
return v2

For loops, the tape may contain one record per executed iteration.

iter 0: matmul ...
iter 0: tanh ...
iter 1: matmul ...
iter 1: tanh ...
...

The backward pass traverses these records in reverse order.

This is correct but may use large amounts of memory for long loops.

Control Flow and Checkpointing

Loops amplify reverse-mode memory costs.

If a loop runs one million iterations, storing every intermediate value may be infeasible. Checkpointing stores selected states and recomputes missing values during the backward pass.

For a loop:

h0 -> h1 -> h2 -> ... -> hn

a checkpointing strategy may store:

h0, h1000, h2000, ..., hn

During backward execution, it recomputes intermediate states inside each block.

This trades extra computation for lower memory.

Side Effects in Control Flow

Control flow often interacts with mutation and side effects.

if x > 0:
    buffer.append(x)
else:
    buffer.append(-x)

The derivative of the numeric value may be simple, but the program also modifies external state.

AD systems usually restrict such effects, trace them, or separate differentiable computation from non-differentiable effects.

Side effects complicate reverse mode because the backward pass may need to undo, replay, or reason about state changes.

Non-Smooth Control Flow

Control flow frequently creates non-smooth functions.

Examples include:

max(x, 0)
abs(x)
clip(x, lo, hi)
where(mask, a, b)
top_k(x)
sort(x)
argmax(x)

Some are piecewise differentiable. Some are discrete and have zero or undefined derivative almost everywhere.

AD can propagate derivatives through the selected numeric path, but users must understand the mathematical meaning of the result.

For discrete choices such as argmax, ordinary AD usually cannot provide a useful gradient with respect to the choice.

Core Idea

Control flow selects the computation to differentiate. For a fixed execution, branches, loops, recursion, and early returns produce a concrete trace. AD differentiates that trace.

This works well away from branch boundaries and stopping-time changes. At non-smooth points, the derivative may be undefined, conventional, or merely the derivative of the executed path.