# Loops and Recurrence Relations

## Loops and Recurrence Relations

Loops express repeated computation. Recurrence relations express the same idea mathematically: each state is computed from one or more earlier states.

A loop such as

```text
h = h0
for t in range(n):
    h = tanh(w * h + b)
y = h
```

corresponds to the recurrence

$$
h_{t+1} = \tanh(w h_t + b),
\qquad
y = h_n.
$$

Automatic differentiation treats this as a sequence of dependent operations. For a fixed number of iterations, the loop can be expanded into a straight-line program.

## Unrolling a Loop

For $n = 3$, the loop becomes:

```text
h0 = input
a0 = w * h0 + b
h1 = tanh(a0)

a1 = w * h1 + b
h2 = tanh(a1)

a2 = w * h2 + b
h3 = tanh(a2)

y = h3
```

The dependency chain is:

```text
h0 -> a0 -> h1 -> a1 -> h2 -> a2 -> h3 -> y
```

Forward mode moves left to right. Reverse mode moves right to left.

## Forward Differentiation Through a Recurrence

Let

$$
h_{t+1} = F(h_t, \theta),
$$

where $\theta$ is a parameter. Forward mode propagates both the state and its tangent.

If we differentiate with respect to $\theta$, define

$$
s_t = \frac{\partial h_t}{\partial \theta}.
$$

Then

$$
s_{t+1} =
\frac{\partial F}{\partial h_t}s_t
+
\frac{\partial F}{\partial \theta}.
$$

For

$$
h_{t+1} = \tanh(w h_t + b),
$$

with respect to $w$:

$$
s_{t+1} =
(1 - h_{t+1}^2)(h_t + w s_t).
$$

Forward mode computes this during the same pass as the original loop.

## Reverse Differentiation Through a Recurrence

Reverse mode starts from the final output and propagates adjoints backward through time.

For

$$
h_{t+1} = F(h_t, \theta),
$$

the adjoint of $h_t$ receives

$$
\bar h_t
+=
\bar h_{t+1}
\frac{\partial F}{\partial h_t}.
$$

The parameter adjoint receives

$$
\bar \theta
+=
\bar h_{t+1}
\frac{\partial F}{\partial \theta}.
$$

For

$$
h_{t+1} = \tanh(w h_t + b),
$$

let

$$
a_t = w h_t + b,
\qquad
h_{t+1} = \tanh(a_t).
$$

Then

$$
\bar a_t = \bar h_{t+1}(1 - h_{t+1}^2),
$$

$$
\bar w += \bar a_t h_t,
\qquad
\bar b += \bar a_t,
\qquad
\bar h_t += \bar a_t w.
$$

These rules are applied from $t = n-1$ down to $0$.

## Backpropagation Through Time

Backpropagation through time is reverse-mode AD applied to an unrolled recurrence.

The word “time” does not require physical time. It refers to an ordered sequence of repeated computation steps.

Examples include:

| Program structure | Recurrence view |
|---|---|
| Recurrent neural network | Hidden state update |
| Iterative solver | Approximation update |
| Simulation | State transition |
| Optimization loop | Parameter update |
| Dynamic program | Table update |
| Markov process | State evolution |

The same reverse-mode rule applies: store or reconstruct the states, then propagate adjoints backward through the unrolled computation.

## Loop-Carried State

A loop-carried variable is a value that passes from one iteration to the next.

```text
s = 0
for i in range(n):
    s = s + x[i] * x[i]
y = s
```

In single-assignment form:

```text
s0 = 0
p0 = x0 * x0
s1 = s0 + p0

p1 = x1 * x1
s2 = s1 + p1

p2 = x2 * x2
s3 = s2 + p2
```

The adjoint of each state flows backward:

```text
s3 -> s2 -> s1 -> s0
```

Each input receives its contribution:

$$
\bar x_i += 2x_i \bar s_{i+1}.
$$

For this sum, $\bar s_{i+1}=1$, so

$$
\frac{\partial y}{\partial x_i}=2x_i.
$$

## Multiple State Variables

Loops often carry several variables.

```text
a = a0
b = b0
for t in range(n):
    new_a = a + b
    new_b = a * b
    a = new_a
    b = new_b
y = a
```

The recurrence is

$$
a_{t+1} = a_t + b_t,
$$

$$
b_{t+1} = a_t b_t.
$$

The Jacobian of one step is

$$
\begin{bmatrix}
\partial a_{t+1}/\partial a_t & \partial a_{t+1}/\partial b_t \\
\partial b_{t+1}/\partial a_t & \partial b_{t+1}/\partial b_t
\end{bmatrix} =
\begin{bmatrix}
1 & 1 \\
b_t & a_t
\end{bmatrix}.
$$

Forward mode multiplies tangents by this Jacobian at each step. Reverse mode multiplies adjoints by its transpose.

## Loops With Arrays

Array loops introduce indexed dependencies.

```text
for i in range(n):
    y[i] = x[i] * x[i]
```

Each output depends only on the corresponding input:

$$
y_i = x_i^2.
$$

The Jacobian is diagonal.

By contrast:

```text
s = 0
for i in range(n):
    s = s + x[i]
    y[i] = s
```

defines a prefix sum:

$$
y_i = \sum_{j=0}^{i} x_j.
$$

Here, each output $y_i$ depends on many earlier inputs. The Jacobian is lower triangular.

AD does not require the programmer to build this Jacobian explicitly. It propagates through the loop structure.

## Dynamic Loop Counts

A loop may stop based on runtime data.

```text
while norm(r) > eps:
    x = update(x)
    r = residual(x)
```

For one execution, the loop runs $k$ iterations. AD differentiates the trace of length $k$.

This derivative is valid under the assumption that small input perturbations keep the same number of iterations. Near a point where the stopping count changes, the derivative may be discontinuous or misleading.

This is common in solvers, simulations, and optimization routines.

## Memory Cost of Reverse Mode

Reverse mode usually needs primal values from each iteration.

For

$$
h_{t+1} = F(h_t,\theta),
$$

the backward step at time $t$ often needs $h_t$, $h_{t+1}$, and intermediate values inside $F$.

If the loop has $n$ iterations and the state size is $S$, storing all states costs

$$
O(nS)
$$

memory.

This is often the limiting cost in long sequences and simulations.

## Recompute Instead of Store

A system can reduce memory by recomputing states during the backward pass.

Instead of storing

```text
h0, h1, h2, ..., hn
```

it may store only checkpoints:

```text
h0, h100, h200, ..., hn
```

During backward execution, it recomputes missing states from the nearest checkpoint.

This trades extra compute for lower memory. The technique is essential for long recurrent networks, neural ODEs, differentiable physics, and iterative solvers.

## Loop Fusion and Differentiation

Compilers may transform loops before or after differentiation.

For example:

```text
for i in range(n):
    y[i] = sin(x[i])

for i in range(n):
    z[i] = y[i] + 1
```

can be fused into:

```text
for i in range(n):
    z[i] = sin(x[i]) + 1
```

Differentiation interacts with such transformations. A compiler must preserve the primal result and the derivative result.

In many cases, differentiating after optimization gives faster derivative code. In other cases, differentiating before optimization exposes more structure to the optimizer.

## Recurrences and Stability

Repeated multiplication by local Jacobians can amplify or shrink derivatives.

For a scalar recurrence

$$
h_{t+1} = F(h_t),
$$

the derivative is

$$
\frac{\partial h_n}{\partial h_0} =
\prod_{t=0}^{n-1}
F'(h_t).
$$

If the factors tend to have magnitude greater than $1$, gradients may explode. If they tend to have magnitude less than $1$, gradients may vanish.

This is a mathematical property of the recurrence, not only an implementation problem.

## Core Idea

A loop is a compact syntax for a repeated dependency graph. For a fixed execution, AD can unroll the loop into a straight-line trace and apply the same derivative propagation rules as before.

Forward mode carries tangents through each iteration. Reverse mode stores or reconstructs loop states and propagates adjoints backward. The main difficulties are dynamic stopping conditions, memory cost, and numerical stability.

