Skip to content

Loops and Recurrence Relations

Loops express repeated computation. Recurrence relations express the same idea mathematically: each state is computed from one or more earlier states.

Loops express repeated computation. Recurrence relations express the same idea mathematically: each state is computed from one or more earlier states.

A loop such as

h = h0
for t in range(n):
    h = tanh(w * h + b)
y = h

corresponds to the recurrence

ht+1=tanh(wht+b),y=hn. h_{t+1} = \tanh(w h_t + b), \qquad y = h_n.

Automatic differentiation treats this as a sequence of dependent operations. For a fixed number of iterations, the loop can be expanded into a straight-line program.

Unrolling a Loop

For n=3n = 3, the loop becomes:

h0 = input
a0 = w * h0 + b
h1 = tanh(a0)

a1 = w * h1 + b
h2 = tanh(a1)

a2 = w * h2 + b
h3 = tanh(a2)

y = h3

The dependency chain is:

h0 -> a0 -> h1 -> a1 -> h2 -> a2 -> h3 -> y

Forward mode moves left to right. Reverse mode moves right to left.

Forward Differentiation Through a Recurrence

Let

ht+1=F(ht,θ), h_{t+1} = F(h_t, \theta),

where θ\theta is a parameter. Forward mode propagates both the state and its tangent.

If we differentiate with respect to θ\theta, define

st=htθ. s_t = \frac{\partial h_t}{\partial \theta}.

Then

st+1=Fhtst+Fθ. s_{t+1} = \frac{\partial F}{\partial h_t}s_t + \frac{\partial F}{\partial \theta}.

For

ht+1=tanh(wht+b), h_{t+1} = \tanh(w h_t + b),

with respect to ww:

st+1=(1ht+12)(ht+wst). s_{t+1} = (1 - h_{t+1}^2)(h_t + w s_t).

Forward mode computes this during the same pass as the original loop.

Reverse Differentiation Through a Recurrence

Reverse mode starts from the final output and propagates adjoints backward through time.

For

ht+1=F(ht,θ), h_{t+1} = F(h_t, \theta),

the adjoint of hth_t receives

hˉt+=hˉt+1Fht. \bar h_t += \bar h_{t+1} \frac{\partial F}{\partial h_t}.

The parameter adjoint receives

θˉ+=hˉt+1Fθ. \bar \theta += \bar h_{t+1} \frac{\partial F}{\partial \theta}.

For

ht+1=tanh(wht+b), h_{t+1} = \tanh(w h_t + b),

let

at=wht+b,ht+1=tanh(at). a_t = w h_t + b, \qquad h_{t+1} = \tanh(a_t).

Then

aˉt=hˉt+1(1ht+12), \bar a_t = \bar h_{t+1}(1 - h_{t+1}^2), wˉ+=aˉtht,bˉ+=aˉt,hˉt+=aˉtw. \bar w += \bar a_t h_t, \qquad \bar b += \bar a_t, \qquad \bar h_t += \bar a_t w.

These rules are applied from t=n1t = n-1 down to 00.

Backpropagation Through Time

Backpropagation through time is reverse-mode AD applied to an unrolled recurrence.

The word “time” does not require physical time. It refers to an ordered sequence of repeated computation steps.

Examples include:

Program structureRecurrence view
Recurrent neural networkHidden state update
Iterative solverApproximation update
SimulationState transition
Optimization loopParameter update
Dynamic programTable update
Markov processState evolution

The same reverse-mode rule applies: store or reconstruct the states, then propagate adjoints backward through the unrolled computation.

Loop-Carried State

A loop-carried variable is a value that passes from one iteration to the next.

s = 0
for i in range(n):
    s = s + x[i] * x[i]
y = s

In single-assignment form:

s0 = 0
p0 = x0 * x0
s1 = s0 + p0

p1 = x1 * x1
s2 = s1 + p1

p2 = x2 * x2
s3 = s2 + p2

The adjoint of each state flows backward:

s3 -> s2 -> s1 -> s0

Each input receives its contribution:

xˉi+=2xisˉi+1. \bar x_i += 2x_i \bar s_{i+1}.

For this sum, sˉi+1=1\bar s_{i+1}=1, so

yxi=2xi. \frac{\partial y}{\partial x_i}=2x_i.

Multiple State Variables

Loops often carry several variables.

a = a0
b = b0
for t in range(n):
    new_a = a + b
    new_b = a * b
    a = new_a
    b = new_b
y = a

The recurrence is

at+1=at+bt, a_{t+1} = a_t + b_t, bt+1=atbt. b_{t+1} = a_t b_t.

The Jacobian of one step is

[at+1/atat+1/btbt+1/atbt+1/bt]=[11btat]. \begin{bmatrix} \partial a_{t+1}/\partial a_t & \partial a_{t+1}/\partial b_t \\ \partial b_{t+1}/\partial a_t & \partial b_{t+1}/\partial b_t \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ b_t & a_t \end{bmatrix}.

Forward mode multiplies tangents by this Jacobian at each step. Reverse mode multiplies adjoints by its transpose.

Loops With Arrays

Array loops introduce indexed dependencies.

for i in range(n):
    y[i] = x[i] * x[i]

Each output depends only on the corresponding input:

yi=xi2. y_i = x_i^2.

The Jacobian is diagonal.

By contrast:

s = 0
for i in range(n):
    s = s + x[i]
    y[i] = s

defines a prefix sum:

yi=j=0ixj. y_i = \sum_{j=0}^{i} x_j.

Here, each output yiy_i depends on many earlier inputs. The Jacobian is lower triangular.

AD does not require the programmer to build this Jacobian explicitly. It propagates through the loop structure.

Dynamic Loop Counts

A loop may stop based on runtime data.

while norm(r) > eps:
    x = update(x)
    r = residual(x)

For one execution, the loop runs kk iterations. AD differentiates the trace of length kk.

This derivative is valid under the assumption that small input perturbations keep the same number of iterations. Near a point where the stopping count changes, the derivative may be discontinuous or misleading.

This is common in solvers, simulations, and optimization routines.

Memory Cost of Reverse Mode

Reverse mode usually needs primal values from each iteration.

For

ht+1=F(ht,θ), h_{t+1} = F(h_t,\theta),

the backward step at time tt often needs hth_t, ht+1h_{t+1}, and intermediate values inside FF.

If the loop has nn iterations and the state size is SS, storing all states costs

O(nS) O(nS)

memory.

This is often the limiting cost in long sequences and simulations.

Recompute Instead of Store

A system can reduce memory by recomputing states during the backward pass.

Instead of storing

h0, h1, h2, ..., hn

it may store only checkpoints:

h0, h100, h200, ..., hn

During backward execution, it recomputes missing states from the nearest checkpoint.

This trades extra compute for lower memory. The technique is essential for long recurrent networks, neural ODEs, differentiable physics, and iterative solvers.

Loop Fusion and Differentiation

Compilers may transform loops before or after differentiation.

For example:

for i in range(n):
    y[i] = sin(x[i])

for i in range(n):
    z[i] = y[i] + 1

can be fused into:

for i in range(n):
    z[i] = sin(x[i]) + 1

Differentiation interacts with such transformations. A compiler must preserve the primal result and the derivative result.

In many cases, differentiating after optimization gives faster derivative code. In other cases, differentiating before optimization exposes more structure to the optimizer.

Recurrences and Stability

Repeated multiplication by local Jacobians can amplify or shrink derivatives.

For a scalar recurrence

ht+1=F(ht), h_{t+1} = F(h_t),

the derivative is

hnh0=t=0n1F(ht). \frac{\partial h_n}{\partial h_0} = \prod_{t=0}^{n-1} F'(h_t).

If the factors tend to have magnitude greater than 11, gradients may explode. If they tend to have magnitude less than 11, gradients may vanish.

This is a mathematical property of the recurrence, not only an implementation problem.

Core Idea

A loop is a compact syntax for a repeated dependency graph. For a fixed execution, AD can unroll the loop into a straight-line trace and apply the same derivative propagation rules as before.

Forward mode carries tangents through each iteration. Reverse mode stores or reconstructs loop states and propagates adjoints backward. The main difficulties are dynamic stopping conditions, memory cost, and numerical stability.