# Memory and State

## Memory and State

Automatic differentiation operates on computations, but computations execute inside a memory model. Variables occupy storage locations, arrays are mutated, buffers are reused, and values may outlive the operations that created them. Differentiation therefore depends not only on algebraic structure, but also on program state.

A simple mathematical expression:

$$
y = x^2 + 1
$$

contains no explicit memory model. A program implementation does.

```text
tmp = x * x
y = tmp + 1
```

The variable `tmp` occupies storage. It has a lifetime. It may later be overwritten or freed. Reverse-mode AD may need its value long after the forward computation has moved on.

Memory management becomes part of differentiation semantics.

## State in Pure and Imperative Programs

A pure functional computation has no mutable state.

```text
v1 = x * x
v2 = v1 + 1
```

Each variable is assigned once and never changes.

An imperative computation may reuse storage:

```text
x = x * x
x = x + 1
```

The second program destroys the original value of `x`.

Mathematically the computations are equivalent. Operationally they differ because reverse mode may need the old value after it has been overwritten.

This is why many AD systems internally transform programs into single-assignment form.

## Mutable State

Mutation changes the contents of a memory location.

```text
a[0] = x
a[1] = y
a[0] = a[0] + 1
```

The value of `a[0]` changes over time.

A reverse-mode system must answer several questions:

| Question | Example |
|---|---|
| Which value existed at a given point? | old `a[0]` |
| Was the value overwritten? | yes |
| Do multiple variables alias the same buffer? | possible |
| Can the backward pass reconstruct the old state? | maybe |

Mutation complicates differentiation because the backward pass conceptually runs “backward in time.”

## Reverse Mode Requires Historical State

Consider:

```text
x = 2
y = x * x
```

The backward rule for multiplication uses the primal value:

$$
\bar x += 2x \bar y.
$$

If the original value of `x` has been overwritten, reverse mode cannot evaluate the derivative correctly unless it stored or recomputed the value.

This requirement extends to tensor programs:

```text
y = relu(x)
```

The backward rule depends on the sign of the original `x`:

$$
\bar x =
\bar y \odot 1_{x > 0}.
$$

The backward pass therefore needs the original activation pattern.

## State as a Time Sequence

A mutable variable can be modeled as a sequence of immutable states.

Imperative form:

```text
x = 1
x = x + 2
x = x * 3
```

Equivalent single-assignment form:

```text
x0 = 1
x1 = x0 + 2
x2 = x1 * 3
```

The apparent mutation becomes an explicit dependency chain.

This transformation is central to compiler-based AD systems.

## Static Single Assignment Form

Static single assignment form, usually called SSA, assigns each variable exactly once.

Instead of:

```text
x = x + 1
x = x * 2
```

SSA form becomes:

```text
x1 = x0 + 1
x2 = x1 * 2
```

Advantages for AD:

| Property | Benefit |
|---|---|
| Explicit dependencies | Easier graph construction |
| No overwritten values | Reverse mode can reference old states |
| Simpler analysis | Easier optimization |
| Cleaner lifetime analysis | Better memory reuse |

Modern compiler IRs commonly use SSA because it aligns naturally with dependency graphs and differentiation.

## Aliasing

Aliasing occurs when multiple references point to the same memory.

```text
a = buffer
b = buffer
a[0] = 1
```

Now `b[0]` also changes.

Aliasing complicates AD because mutations propagate indirectly through shared storage.

The backward pass must know:

| Issue | Consequence |
|---|---|
| Which references share memory? | affects derivative correctness |
| Which writes overwrite previous values? | affects replay |
| Which operations are in-place? | affects tape structure |

Tensor frameworks often track aliasing explicitly to prevent invalid gradient computation.

## In-Place Operations

An in-place operation modifies existing storage.

```text
x += y
```

Instead of allocating a new tensor, the program overwrites `x`.

This is efficient for memory usage, but dangerous for reverse mode.

Suppose:

```text
x = relu(x)
```

The backward pass may need the original pre-ReLU values. If they were overwritten, the information is lost.

Many AD systems therefore:

| Strategy | Meaning |
|---|---|
| Forbid some in-place ops | preserve correctness |
| Copy before mutation | preserve old state |
| Track versions | detect invalid reuse |
| Recompute values | avoid storage |

Frameworks such as entity["software","PyTorch","deep learning framework"] include version counters to detect illegal in-place modifications during gradient recording.

## Tape Memory

Reverse mode often stores execution state in a tape.

A tape record may contain:

```go
type TapeNode struct {
    Op       Op
    Inputs   []VarID
    Outputs  []VarID
    Saved    []Value
}
```

The `Saved` values contain the primal state needed for the backward rule.

For example:

```text
v = sin(a)
```

may save `a` or `v` because the backward rule needs:

$$
\bar a += \bar v \cos(a).
$$

Without the saved value, reverse propagation cannot evaluate the derivative.

## Activation Storage in Neural Networks

Large neural networks store many intermediate activations.

Example:

```text
v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = relu(v3)
y  = loss(v4)
```

Reverse mode may need:

| Tensor | Needed for |
|---|---|
| `v1` | ReLU backward |
| `v2` | matrix backward |
| `v3` | ReLU backward |
| `v4` | loss backward |

For large models, activation storage dominates memory cost.

Training memory usage often scales approximately with:

$$
O(\text{number of activations})
$$

rather than parameter count alone.

## Recomputation

Instead of storing all intermediate states, a system can recompute them.

Suppose:

```text
v1 -> v2 -> v3 -> v4
```

Only `v1` and `v4` are stored.

During backward execution:

1. recompute `v2`,
2. recompute `v3`,
3. apply backward rules.

This reduces memory at the cost of additional computation.

The tradeoff is:

| Strategy | Memory | Compute |
|---|---|
| Store everything | high | low |
| Recompute aggressively | low | high |
| Checkpoint selectively | balanced | balanced |

## Stateful Randomness

Random number generation introduces hidden state.

```text
y = dropout(x)
```

The output depends on a random mask.

Backward propagation usually requires the same mask used during forward execution.

Systems therefore:

| Method | Purpose |
|---|---|
| Save random masks | exact replay |
| Save RNG seeds | deterministic regeneration |
| Use stateless RNGs | explicit dependency |

Randomness becomes part of differentiable state.

## Stateful External Systems

Programs may interact with external systems:

```text
value = database.read(key)
```

or

```text
send_packet(x)
```

These operations may not be differentiable.

External state creates several problems:

| Problem | Reason |
|---|---|
| Non-repeatability | external state may change |
| Side effects | cannot easily reverse |
| Hidden dependencies | not represented in graph |
| Non-differentiability | discrete behavior |

AD systems often isolate differentiable kernels from external effects.

## Stateful Optimizers

Optimization algorithms themselves contain state.

Example:

```text
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad * grad
param = param - lr * m / sqrt(v)
```

The optimizer state evolves over time.

Differentiating through optimization therefore requires differentiating through the optimizer state transitions.

This appears in:

| Area | Example |
|---|---|
| Meta-learning | optimize the optimizer |
| Hyperparameter optimization | differentiate learning rates |
| Differentiable training loops | gradients through SGD |
| Neural architecture search | nested optimization |

The optimizer becomes part of the computational graph.

## Persistent Versus Ephemeral State

Some state exists only during one computation. Other state persists across executions.

| State type | Example |
|---|---|
| Ephemeral | temporary activation |
| Persistent | model parameters |
| Global | random generator |
| External | filesystem contents |

Persistent state creates long-lived dependency chains.

For example, a parameter update loop:

```text
theta_{t+1} = theta_t - lr * grad_t
```

creates a recurrence over training time itself.

## Memory Lifetime Analysis

A compiler can determine when a value is no longer needed.

Example:

```text
v1 = x * x
v2 = sin(v1)
v3 = v2 + 1
```

Once `v2` is computed, `v1` may no longer be needed in forward execution. But reverse mode may still require it.

Thus there are two lifetimes:

| Lifetime type | Purpose |
|---|---|
| Forward lifetime | primal execution |
| Reverse lifetime | backward execution |

Reverse-mode AD extends many variable lifetimes dramatically.

## Functionalization

Some AD systems transform mutable programs into purely functional representations.

Imperative form:

```text
x[0] += 1
```

Functionalized form:

```text
x2 = update(x1, 0, x1[0] + 1)
```

This removes mutation and makes dependencies explicit.

Functionalization simplifies:

| Task | Reason |
|---|---|
| Graph construction | immutable values |
| Reverse replay | no hidden overwrites |
| Parallel execution | fewer races |
| Compiler optimization | explicit dependencies |

Modern differentiable compilers increasingly use this approach internally.

## Core Idea

Automatic differentiation depends on program state because reverse propagation often requires historical values. Mutation, aliasing, in-place updates, random state, and external effects all influence whether the derivative computation is correct and reproducible.

A differentiable program is therefore not only a mathematical function. It is also a state transition system whose memory behavior determines how derivatives can be stored, reconstructed, and propagated backward through execution.

