Loops and Recurrence Relations

Loops express repeated computation. Recurrence relations express the same idea mathematically: each state is computed from one or more earlier states.

A loop such as

h = h0
for t in range(n):
    h = tanh(w * h + b)
y = h

corresponds to the recurrence

h_{t+1} = \tanh(w h_t + b), \qquad y = h_n.

Automatic differentiation treats this as a sequence of dependent operations. For a fixed number of iterations, the loop can be expanded into a straight-line program.

Unrolling a Loop

For $n = 3$ , the loop becomes:

h0 = input
a0 = w * h0 + b
h1 = tanh(a0)

a1 = w * h1 + b
h2 = tanh(a1)

a2 = w * h2 + b
h3 = tanh(a2)

y = h3

The dependency chain is:

h0 -> a0 -> h1 -> a1 -> h2 -> a2 -> h3 -> y

Forward mode moves left to right. Reverse mode moves right to left.

Forward Differentiation Through a Recurrence

Let

h_{t+1} = F(h_t, \theta),

where $\theta$ is a parameter. Forward mode propagates both the state and its tangent.

If we differentiate with respect to $\theta$ , define

s_t = \frac{\partial h_t}{\partial \theta}.

Then

s_{t+1} = \frac{\partial F}{\partial h_t}s_t + \frac{\partial F}{\partial \theta}.

For

h_{t+1} = \tanh(w h_t + b),

with respect to $w$ :

s_{t+1} = (1 - h_{t+1}^2)(h_t + w s_t).

Forward mode computes this during the same pass as the original loop.

Reverse Differentiation Through a Recurrence

Reverse mode starts from the final output and propagates adjoints backward through time.

For

h_{t+1} = F(h_t, \theta),

the adjoint of $h_t$ receives

\bar h_t += \bar h_{t+1} \frac{\partial F}{\partial h_t}.

The parameter adjoint receives

\bar \theta += \bar h_{t+1} \frac{\partial F}{\partial \theta}.

For

h_{t+1} = \tanh(w h_t + b),

let

a_t = w h_t + b, \qquad h_{t+1} = \tanh(a_t).

Then

\bar a_t = \bar h_{t+1}(1 - h_{t+1}^2),

\bar w += \bar a_t h_t, \qquad \bar b += \bar a_t, \qquad \bar h_t += \bar a_t w.

These rules are applied from $t = n-1$ down to $0$ .

Backpropagation Through Time

Backpropagation through time is reverse-mode AD applied to an unrolled recurrence.

The word “time” does not require physical time. It refers to an ordered sequence of repeated computation steps.

Examples include:

Program structure	Recurrence view
Recurrent neural network	Hidden state update
Iterative solver	Approximation update
Simulation	State transition
Optimization loop	Parameter update
Dynamic program	Table update
Markov process	State evolution

The same reverse-mode rule applies: store or reconstruct the states, then propagate adjoints backward through the unrolled computation.

Loop-Carried State

A loop-carried variable is a value that passes from one iteration to the next.

s = 0
for i in range(n):
    s = s + x[i] * x[i]
y = s

In single-assignment form:

s0 = 0
p0 = x0 * x0
s1 = s0 + p0

p1 = x1 * x1
s2 = s1 + p1

p2 = x2 * x2
s3 = s2 + p2

The adjoint of each state flows backward:

s3 -> s2 -> s1 -> s0

Each input receives its contribution:

\bar x_i += 2x_i \bar s_{i+1}.

For this sum, $\bar s_{i+1}=1$ , so

\frac{\partial y}{\partial x_i}=2x_i.

Multiple State Variables

Loops often carry several variables.

a = a0
b = b0
for t in range(n):
    new_a = a + b
    new_b = a * b
    a = new_a
    b = new_b
y = a

The recurrence is

a_{t+1} = a_t + b_t,

b_{t+1} = a_t b_t.

The Jacobian of one step is

\begin{bmatrix} \partial a_{t+1}/\partial a_t & \partial a_{t+1}/\partial b_t \\ \partial b_{t+1}/\partial a_t & \partial b_{t+1}/\partial b_t \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ b_t & a_t \end{bmatrix}.

Forward mode multiplies tangents by this Jacobian at each step. Reverse mode multiplies adjoints by its transpose.

Loops With Arrays

Array loops introduce indexed dependencies.

for i in range(n):
    y[i] = x[i] * x[i]

Each output depends only on the corresponding input:

y_i = x_i^2.

The Jacobian is diagonal.

By contrast:

s = 0
for i in range(n):
    s = s + x[i]
    y[i] = s

defines a prefix sum:

y_i = \sum_{j=0}^{i} x_j.

Here, each output $y_i$ depends on many earlier inputs. The Jacobian is lower triangular.

AD does not require the programmer to build this Jacobian explicitly. It propagates through the loop structure.

Dynamic Loop Counts

A loop may stop based on runtime data.

while norm(r) > eps:
    x = update(x)
    r = residual(x)

For one execution, the loop runs $k$ iterations. AD differentiates the trace of length $k$ .

This derivative is valid under the assumption that small input perturbations keep the same number of iterations. Near a point where the stopping count changes, the derivative may be discontinuous or misleading.

This is common in solvers, simulations, and optimization routines.

Memory Cost of Reverse Mode

Reverse mode usually needs primal values from each iteration.

For

h_{t+1} = F(h_t,\theta),

the backward step at time $t$ often needs $h_t$ , $h_{t+1}$ , and intermediate values inside $F$ .

If the loop has $n$ iterations and the state size is $S$ , storing all states costs

O(nS)

memory.

This is often the limiting cost in long sequences and simulations.

Recompute Instead of Store

A system can reduce memory by recomputing states during the backward pass.

Instead of storing

h0, h1, h2, ..., hn

it may store only checkpoints:

h0, h100, h200, ..., hn

During backward execution, it recomputes missing states from the nearest checkpoint.

This trades extra compute for lower memory. The technique is essential for long recurrent networks, neural ODEs, differentiable physics, and iterative solvers.

Loop Fusion and Differentiation

Compilers may transform loops before or after differentiation.

For example:

for i in range(n):
    y[i] = sin(x[i])

for i in range(n):
    z[i] = y[i] + 1

can be fused into:

for i in range(n):
    z[i] = sin(x[i]) + 1

Differentiation interacts with such transformations. A compiler must preserve the primal result and the derivative result.

In many cases, differentiating after optimization gives faster derivative code. In other cases, differentiating before optimization exposes more structure to the optimizer.

Recurrences and Stability

Repeated multiplication by local Jacobians can amplify or shrink derivatives.

For a scalar recurrence

h_{t+1} = F(h_t),

the derivative is

\frac{\partial h_n}{\partial h_0} = \prod_{t=0}^{n-1} F'(h_t).

If the factors tend to have magnitude greater than $1$ , gradients may explode. If they tend to have magnitude less than $1$ , gradients may vanish.

This is a mathematical property of the recurrence, not only an implementation problem.

Core Idea

A loop is a compact syntax for a repeated dependency graph. For a fixed execution, AD can unroll the loop into a straight-line trace and apply the same derivative propagation rules as before.

Forward mode carries tangents through each iteration. Reverse mode stores or reconstructs loop states and propagates adjoints backward. The main difficulties are dynamic stopping conditions, memory cost, and numerical stability.