# Control Flow

## Control Flow

Control flow determines which operations a program executes. Straight-line programs have a fixed sequence of operations, but ordinary programs contain branches, loops, recursion, early returns, and exceptions. Automatic differentiation must account for this structure without losing the basic rule: derivatives propagate through the operations that actually determine the output.

A small example already shows the issue:

```text
if x > 0:
    y = x * x
else:
    y = -x
```

This program does not define one straight-line trace for all inputs. It defines different traces for different regions of the input space.

For $x > 0$:

```text
v1 = x
v2 = v1 * v1
y  = v2
```

For $x \le 0$:

```text
v1 = x
v2 = -v1
y  = v2
```

The derivative depends on which branch is taken:

$$
\frac{dy}{dx} =
\begin{cases}
2x, & x > 0, \\
-1, & x < 0.
\end{cases}
$$

At $x = 0$, the function is continuous but not differentiable. An AD system must choose a convention, report an error, or propagate the derivative of the executed branch.

## Branches

A branch selects one subprogram from several alternatives.

```text
if condition:
    y = f(x)
else:
    y = g(x)
```

For a fixed execution, only one branch contributes to the output. Dynamic AD systems usually differentiate the executed branch only.

If the condition evaluates to true, the local computation is:

```text
y = f(x)
```

If it evaluates to false, the local computation is:

```text
y = g(x)
```

The derivative is therefore branch-local.

This is correct inside a region where the branch decision does not change. It may fail to describe behavior at the boundary where the condition changes.

## Branch Boundaries

A branch condition may introduce a discontinuity or a kink.

Example:

```text
if x > 0:
    y = x
else:
    y = 0
```

This is the ReLU function:

$$
y = \max(0, x).
$$

Its derivative is:

$$
\frac{dy}{dx} =
\begin{cases}
1, & x > 0, \\
0, & x < 0.
\end{cases}
$$

At $x = 0$, the classical derivative is undefined.

Many machine learning systems define a conventional derivative at the boundary, often $0$. This convention is useful computationally, but it is a design choice, not a theorem of calculus.

## Data-Dependent Control Flow

Control flow is data-dependent when the branch condition depends on differentiable inputs.

```text
if x > threshold:
    y = x * x
else:
    y = sin(x)
```

Within each region, AD computes the derivative of the selected formula. Across the threshold, the program may have a jump, kink, or other non-smooth behavior.

AD usually does not include the derivative of the branch decision itself. The comparison $x > threshold$ is discrete. Its output changes abruptly and has no useful ordinary derivative.

Thus AD treats the executed path as fixed during differentiation.

## Loops

A loop repeats a block of computation.

```text
h = h0
for t in range(n):
    h = tanh(W * h + b)
y = h
```

For a fixed $n$, the loop can be unrolled into a straight-line program:

```text
h0 = input
h1 = tanh(W * h0 + b)
h2 = tanh(W * h1 + b)
h3 = tanh(W * h2 + b)
...
hn = tanh(W * h_{n-1} + b)
y  = hn
```

Forward mode propagates tangents through each iteration. Reverse mode records the intermediate states and traverses the iterations backward.

This is the same mechanism used in backpropagation through time.

## Loop-Carried Dependencies

A loop-carried dependency occurs when one iteration depends on the previous iteration.

```text
s = 0
for i in range(n):
    s = s + x[i] * x[i]
```

Here, each new value of `s` depends on the old value of `s`.

In single-assignment form:

```text
s0 = 0
s1 = s0 + x[0] * x[0]
s2 = s1 + x[1] * x[1]
s3 = s2 + x[2] * x[2]
...
```

Reverse mode propagates adjoints backward through this chain. The adjoint of each `s_i` contributes to the previous `s_{i-1}` and to the corresponding input `x[i]`.

## Dynamic Loop Counts

A loop count may depend on runtime data.

```text
while norm(x) > eps:
    x = step(x)
```

For one execution, the loop runs a specific number of iterations. A dynamic AD system differentiates the executed unrolled trace.

This derivative assumes that the number of iterations stays fixed under small perturbations. That assumption can fail near points where a small input change changes the stopping time.

Stopping criteria are therefore a source of non-smoothness.

## Recursion

Recursion is another form of repeated computation.

```text
def f(x, n):
    if n == 0:
        return x
    return sin(f(x, n - 1))
```

For a fixed call, recursion expands into a finite computation trace:

```text
v0 = x
v1 = sin(v0)
v2 = sin(v1)
...
vn = sin(v_{n-1})
```

Reverse mode can differentiate this trace if the system records the call structure and intermediate values.

Recursive differentiation requires careful handling of stack frames, captured variables, and return values.

## Early Returns

An early return exits a function before later code executes.

```text
def f(x):
    if x < 0:
        return 0
    y = x * x
    return y
```

Only the executed operations belong to the derivative trace.

If $x < 0$, the derivative path is constant:

$$
\frac{dy}{dx} = 0.
$$

If $x \ge 0$, the derivative path is:

$$
\frac{dy}{dx} = 2x.
$$

Early returns are branch structure. AD systems treat them like conditionals.

## Exceptions and Invalid Regions

Some computations are defined only on part of the input domain.

```text
y = log(x)
```

The derivative rule

$$
dy = \frac{1}{x} dx
$$

requires $x > 0$ for real-valued computation.

Control flow often guards such operations:

```text
if x > 0:
    y = log(x)
else:
    y = 0
```

The derivative follows the executed path. However, the mathematical function may still be discontinuous or non-differentiable at the guard boundary.

AD cannot remove domain issues. It only differentiates the computation that the program performs.

## Static Versus Dynamic Treatment

There are two broad strategies for control flow.

| Strategy | Method | Typical use |
|---|---|---|
| Dynamic tracing | Record the executed path | PyTorch-style eager systems |
| Static transformation | Transform control-flow constructs directly | Compiler-based AD |

Dynamic tracing is simple for ordinary host-language control flow. It records what happened.

Static transformation can reason about loops and branches as program structure. It can optimize memory, fuse operations, and generate differentiated control flow. It is harder to implement because it must preserve language semantics.

## Differentiating Structured Control Flow

A compiler-based AD system may transform:

```text
if c:
    y = f(x)
else:
    y = g(x)
```

into differentiated control flow:

```text
if c:
    y, dy = df(x, dx)
else:
    y, dy = dg(x, dx)
```

For reverse mode, it may generate:

```text
if c:
    backward_f(y_bar)
else:
    backward_g(y_bar)
```

The condition must be replayed or recorded so the backward pass follows the same branch as the forward pass.

## Control Flow and Tapes

Reverse mode must know which operations were executed. Therefore, control flow often becomes tape structure.

A tape may contain records such as:

```text
enter_branch true
mul v1 v1 -> v2
return v2
```

For loops, the tape may contain one record per executed iteration.

```text
iter 0: matmul ...
iter 0: tanh ...
iter 1: matmul ...
iter 1: tanh ...
...
```

The backward pass traverses these records in reverse order.

This is correct but may use large amounts of memory for long loops.

## Control Flow and Checkpointing

Loops amplify reverse-mode memory costs.

If a loop runs one million iterations, storing every intermediate value may be infeasible. Checkpointing stores selected states and recomputes missing values during the backward pass.

For a loop:

```text
h0 -> h1 -> h2 -> ... -> hn
```

a checkpointing strategy may store:

```text
h0, h1000, h2000, ..., hn
```

During backward execution, it recomputes intermediate states inside each block.

This trades extra computation for lower memory.

## Side Effects in Control Flow

Control flow often interacts with mutation and side effects.

```text
if x > 0:
    buffer.append(x)
else:
    buffer.append(-x)
```

The derivative of the numeric value may be simple, but the program also modifies external state.

AD systems usually restrict such effects, trace them, or separate differentiable computation from non-differentiable effects.

Side effects complicate reverse mode because the backward pass may need to undo, replay, or reason about state changes.

## Non-Smooth Control Flow

Control flow frequently creates non-smooth functions.

Examples include:

```text
max(x, 0)
abs(x)
clip(x, lo, hi)
where(mask, a, b)
top_k(x)
sort(x)
argmax(x)
```

Some are piecewise differentiable. Some are discrete and have zero or undefined derivative almost everywhere.

AD can propagate derivatives through the selected numeric path, but users must understand the mathematical meaning of the result.

For discrete choices such as `argmax`, ordinary AD usually cannot provide a useful gradient with respect to the choice.

## Core Idea

Control flow selects the computation to differentiate. For a fixed execution, branches, loops, recursion, and early returns produce a concrete trace. AD differentiates that trace.

This works well away from branch boundaries and stopping-time changes. At non-smooth points, the derivative may be undefined, conventional, or merely the derivative of the executed path.

