A loop repeats a computation until a condition fails or a fixed iteration count is reached. In automatic differentiation, loops are important because many numerical algorithms...
Loops
A loop repeats a computation until a condition fails or a fixed iteration count is reached. In automatic differentiation, loops are important because many numerical algorithms are fundamentally iterative:
- optimization methods
- recurrent neural networks
- numerical solvers
- simulations
- dynamic programming
- graph algorithms
A loop introduces repeated dependence between program states. The derivative of the loop therefore depends on how perturbations propagate through the sequence of iterations.
A loop can be viewed as repeated function composition.
Consider the program:
x = x0
for i in range(n):
x = f(x)After iterations,
where denotes -fold composition.
The derivative is obtained by repeated application of the chain rule:
A loop is therefore a dynamic chain rule accumulator.
Loops as State Transitions
Most loops operate on a mutable program state. Let the loop state at iteration be
and let the loop body define a transition function
Then after iterations,
The Jacobian of the full loop is
This is the fundamental differentiation rule for loops.
The loop body contributes a local Jacobian at every iteration. The full derivative is the product of all local Jacobians in execution order.
For scalar states:
For vector states:
where
Forward Mode Through Loops
Forward mode propagates tangents alongside primal values through every iteration.
Suppose:
x = x0
for i in range(n):
x = f(x)Forward mode augments the state:
At iteration ,
and
The tangent evolves according to the linearized loop dynamics.
Conceptually:
x = x0
dx = dx0
for i in range(n):
x_new = f(x)
dx_new = f'(x) * dx
x = x_new
dx = dx_newThe tangent update follows the same control flow and iteration count as the primal computation.
Forward mode is naturally streaming. It does not need to store previous loop states. Memory usage is therefore small:
with respect to iteration count, assuming fixed-size state.
The cost is proportional to the number of propagated tangent directions.
Example: Repeated Squaring
Consider:
x = x0
for i in range(3):
x = x * xThe iterations are:
Thus,
Forward mode computes:
Iteration 1:
Iteration 2:
Iteration 3:
The tangent accumulates multiplicatively through the loop.
Reverse Mode Through Loops
Reverse mode differentiates loops by traversing iterations backward.
Suppose:
x = x0
for i in range(n):
x = f(x)The forward execution produces states:
The reverse pass propagates adjoints backward:
using
The backward pass therefore reverses the loop.
Conceptually:
# forward
states = [x0]
x = x0
for i in range(n):
x = f(x)
states.append(x)
# backward
x_bar = seed
for i in reversed(range(n)):
x_prev = states[i]
x_bar = f'(x_prev)^T * x_barThe reverse pass requires access to the intermediate states from the forward pass.
This introduces the major systems problem of reverse-mode loops:
Reverse mode usually needs loop-state storage.If the loop runs for millions of iterations, storing all intermediate states may be prohibitively expensive.
Wengert Lists and Dynamic Tapes
In tape-based reverse mode, each loop iteration appends operations to a tape.
For example:
for i in range(n):
x = sin(x)generates a tape containing:
x1 = sin(x0)
x2 = sin(x1)
x3 = sin(x2)
...
xn = sin(xn-1)The backward pass walks this tape in reverse order:
The tape length grows linearly with iteration count.
If the loop body is large or the iteration count is data-dependent, memory consumption can dominate execution cost.
Static vs Dynamic Loops
Loops fall into two broad categories.
Static Loops
A static loop has a compile-time known iteration count.
Example:
for i in range(10):
x = f(x)Compilers may unroll such loops partially or fully.
The differentiated program becomes a finite repeated composition.
Static loops are easier to optimize because:
- iteration count is known
- memory requirements are predictable
- control flow is stable
- vectorization is easier
Dynamic Loops
A dynamic loop depends on runtime values.
Example:
while norm(x) > eps:
x = update(x)The number of iterations depends on the input.
Dynamic loops are harder because:
- forward and backward passes must agree on iteration count
- loop states may vary in shape or structure
- compilation becomes more difficult
- tracing systems may specialize to observed iteration counts
The backward pass must replay exactly the same iteration structure as the forward pass.
Loop-Carried Dependencies
A loop-carried dependency occurs when one iteration depends on results from previous iterations.
Example:
h = h0
for t in range(T):
h = tanh(W @ h + b)This defines a recurrence:
The derivative across time becomes:
where
This repeated matrix multiplication causes two important phenomena:
- vanishing gradients
- exploding gradients
If the spectral norm satisfies
then gradients shrink exponentially.
If
then gradients grow exponentially.
This is a direct consequence of loop differentiation.
Backpropagation Through Time
Recurrent neural networks are differentiated by unrolling loops across time.
The recurrent program:
h = h0
for t in range(T):
h = F(h, x[t])is transformed into:
h1 = F(h0, x1)
h2 = F(h1, x2)
...
hT = F(hT-1, xT)Reverse mode then propagates adjoints backward through the unrolled graph.
This procedure is called backpropagation through time (BPTT).
The loop becomes an explicit chain of repeated operations.
The cost scales linearly with sequence length:
- forward cost:
- backward cost:
- memory cost:
for fixed hidden-state size.
Long sequences therefore create large memory pressure.
Checkpointing for Long Loops
Reverse mode stores intermediate states so the backward pass can reconstruct local derivatives.
For long loops, storing every state may be impossible.
Checkpointing trades computation for memory.
Instead of storing every state, the system stores selected checkpoints:
x0 ---- x100 ---- x200 ---- x300During the backward pass, missing states are recomputed from nearby checkpoints.
This changes complexity:
| Strategy | Memory | Recomputation |
|---|---|---|
| Store all states | High | None |
| Recompute everything | Low | Very high |
| Checkpointing | Moderate | Moderate |
Optimal checkpoint schedules are an important research topic in reverse-mode AD.
Differentiating While Loops
A while loop introduces a termination condition:
while c(x):
x = f(x)This defines a variable-length composition:
where depends on the input.
Differentiation assumes the iteration count remains locally stable.
If small perturbations change the number of iterations, the derivative may become discontinuous.
Example:
while x < 1:
x = 2 * xThe number of iterations changes abruptly near powers of two.
The resulting function is piecewise smooth but globally non-smooth.
Most AD systems differentiate the observed execution path and treat the iteration count as fixed during local differentiation.
Fixed-Point Iterations
Many loops seek a fixed point:
Example:
x = x0
for i in range(max_iter):
x = f(x)
if converged(x):
breakNaively differentiating through all iterations may be expensive.
An alternative is implicit differentiation.
If
then differentiating gives
This avoids storing all intermediate iterations.
Implicit differentiation is discussed later in the book, but loops are the motivating structure behind it.
Parallel Loops and Associativity
Some loops represent reductions:
s = 0
for i in range(n):
s += x[i]The derivative is straightforward:
But systems implementations may parallelize the reduction as a tree:
instead of sequential accumulation.
Because floating-point addition is not associative, the exact primal and derivative values may vary across execution orders.
Parallel differentiation therefore introduces reproducibility concerns.
Mutation Inside Loops
Loops often mutate arrays or tensors in place:
for i in range(n):
x[i] = x[i] * 2Mutation complicates reverse mode because previous values may be overwritten before gradients are computed.
A reverse pass may require:
- storing old values
- versioning arrays
- converting mutation into functional updates
- using SSA-style representations
Many differentiable systems internally lower mutable loops into immutable graph representations.
Loop Fusion and Optimization
AD compilers frequently optimize loops after differentiation.
Example:
for i in range(n):
y[i] = sin(x[i])
for i in range(n):
z[i] = y[i] * 2may be fused into:
for i in range(n):
z[i] = 2 * sin(x[i])Fusion reduces:
- memory traffic
- intermediate allocations
- kernel launches
- tape size
Loop optimization is therefore tightly connected to AD performance.
Correctness Rule
For a loop
s = s0
for i in range(n):
s = T(s)the derivative is obtained by repeated chain-rule application:
Forward mode propagates tangents through iterations in execution order.
Reverse mode propagates adjoints backward through iterations in reverse order.
The central systems problem is state management:
Forward mode:
low memory, streaming propagation.
Reverse mode:
efficient for scalar outputs,
but usually requires loop-state storage.Loops therefore transform differentiation from a local operation into a dynamical process across execution time.