Memory and State

Automatic differentiation operates on computations, but computations execute inside a memory model. Variables occupy storage locations, arrays are mutated, buffers are reused, and values may outlive the operations that created them. Differentiation therefore depends not only on algebraic structure, but also on program state.

A simple mathematical expression:

$$ y = x^2 + 1 $$

contains no explicit memory model. A program implementation does.

tmp = x * x
y = tmp + 1

The variable tmp occupies storage. It has a lifetime. It may later be overwritten or freed. Reverse-mode AD may need its value long after the forward computation has moved on.

Memory management becomes part of differentiation semantics.

State in Pure and Imperative Programs

A pure functional computation has no mutable state.

v1 = x * x
v2 = v1 + 1

Each variable is assigned once and never changes.

An imperative computation may reuse storage:

x = x * x
x = x + 1

The second program destroys the original value of x.

Mathematically the computations are equivalent. Operationally they differ because reverse mode may need the old value after it has been overwritten.

This is why many AD systems internally transform programs into single-assignment form.

Mutable State

Mutation changes the contents of a memory location.

a[0] = x
a[1] = y
a[0] = a[0] + 1

The value of a[0] changes over time.

A reverse-mode system must answer several questions:

Question	Example
Which value existed at a given point?	old `a[0]`
Was the value overwritten?	yes
Do multiple variables alias the same buffer?	possible
Can the backward pass reconstruct the old state?	maybe

Mutation complicates differentiation because the backward pass conceptually runs “backward in time.”

Reverse Mode Requires Historical State

Consider:

x = 2
y = x * x

The backward rule for multiplication uses the primal value:

$$ \bar x += 2x \bar y. $$

If the original value of x has been overwritten, reverse mode cannot evaluate the derivative correctly unless it stored or recomputed the value.

This requirement extends to tensor programs:

y = relu(x)

The backward rule depends on the sign of the original x:

$$ \bar x = \bar y \odot 1_{x > 0}. $$

The backward pass therefore needs the original activation pattern.

State as a Time Sequence

A mutable variable can be modeled as a sequence of immutable states.

Imperative form:

x = 1
x = x + 2
x = x * 3

Equivalent single-assignment form:

x0 = 1
x1 = x0 + 2
x2 = x1 * 3

The apparent mutation becomes an explicit dependency chain.

This transformation is central to compiler-based AD systems.

Static Single Assignment Form

Static single assignment form, usually called SSA, assigns each variable exactly once.

Instead of:

x = x + 1
x = x * 2

SSA form becomes:

x1 = x0 + 1
x2 = x1 * 2

Advantages for AD:

Property	Benefit
Explicit dependencies	Easier graph construction
No overwritten values	Reverse mode can reference old states
Simpler analysis	Easier optimization
Cleaner lifetime analysis	Better memory reuse

Modern compiler IRs commonly use SSA because it aligns naturally with dependency graphs and differentiation.

Aliasing

Aliasing occurs when multiple references point to the same memory.

a = buffer
b = buffer
a[0] = 1

Now b[0] also changes.

Aliasing complicates AD because mutations propagate indirectly through shared storage.

The backward pass must know:

Issue	Consequence
Which references share memory?	affects derivative correctness
Which writes overwrite previous values?	affects replay
Which operations are in-place?	affects tape structure

Tensor frameworks often track aliasing explicitly to prevent invalid gradient computation.

In-Place Operations

An in-place operation modifies existing storage.

x += y

Instead of allocating a new tensor, the program overwrites x.

This is efficient for memory usage, but dangerous for reverse mode.

Suppose:

x = relu(x)

The backward pass may need the original pre-ReLU values. If they were overwritten, the information is lost.

Many AD systems therefore:

Strategy	Meaning
Forbid some in-place ops	preserve correctness
Copy before mutation	preserve old state
Track versions	detect invalid reuse
Recompute values	avoid storage

Frameworks such as entity["software","PyTorch","deep learning framework"] include version counters to detect illegal in-place modifications during gradient recording.

Tape Memory

Reverse mode often stores execution state in a tape.

A tape record may contain:

type TapeNode struct {
    Op       Op
    Inputs   []VarID
    Outputs  []VarID
    Saved    []Value
}

The Saved values contain the primal state needed for the backward rule.

For example:

v = sin(a)

may save a or v because the backward rule needs:

$$ \bar a += \bar v \cos(a). $$

Without the saved value, reverse propagation cannot evaluate the derivative.

Activation Storage in Neural Networks

Large neural networks store many intermediate activations.

Example:

v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = relu(v3)
y  = loss(v4)

Reverse mode may need:

Tensor	Needed for
`v1`	ReLU backward
`v2`	matrix backward
`v3`	ReLU backward
`v4`	loss backward

For large models, activation storage dominates memory cost.

Training memory usage often scales approximately with:

$$ O(\text{number of activations}) $$

rather than parameter count alone.

Recomputation

Instead of storing all intermediate states, a system can recompute them.

Suppose:

v1 -> v2 -> v3 -> v4

Only v1 and v4 are stored.

During backward execution:

recompute v2,
recompute v3,
apply backward rules.

This reduces memory at the cost of additional computation.

The tradeoff is:

| Strategy | Memory | Compute | |---|---| | Store everything | high | low | | Recompute aggressively | low | high | | Checkpoint selectively | balanced | balanced |

Stateful Randomness

Random number generation introduces hidden state.

y = dropout(x)

The output depends on a random mask.

Backward propagation usually requires the same mask used during forward execution.

Systems therefore:

Method	Purpose
Save random masks	exact replay
Save RNG seeds	deterministic regeneration
Use stateless RNGs	explicit dependency

Randomness becomes part of differentiable state.

Stateful External Systems

Programs may interact with external systems:

value = database.read(key)

send_packet(x)

These operations may not be differentiable.

External state creates several problems:

Problem	Reason
Non-repeatability	external state may change
Side effects	cannot easily reverse
Hidden dependencies	not represented in graph
Non-differentiability	discrete behavior

AD systems often isolate differentiable kernels from external effects.

Stateful Optimizers

Optimization algorithms themselves contain state.

Example:

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad * grad
param = param - lr * m / sqrt(v)

The optimizer state evolves over time.

Differentiating through optimization therefore requires differentiating through the optimizer state transitions.

This appears in:

Area	Example
Meta-learning	optimize the optimizer
Hyperparameter optimization	differentiate learning rates
Differentiable training loops	gradients through SGD
Neural architecture search	nested optimization

The optimizer becomes part of the computational graph.

Persistent Versus Ephemeral State

Some state exists only during one computation. Other state persists across executions.

State type	Example
Ephemeral	temporary activation
Persistent	model parameters
Global	random generator
External	filesystem contents

Persistent state creates long-lived dependency chains.

For example, a parameter update loop:

theta_{t+1} = theta_t - lr * grad_t

creates a recurrence over training time itself.

Memory Lifetime Analysis

A compiler can determine when a value is no longer needed.

Example:

v1 = x * x
v2 = sin(v1)
v3 = v2 + 1

Once v2 is computed, v1 may no longer be needed in forward execution. But reverse mode may still require it.

Thus there are two lifetimes:

Lifetime type	Purpose
Forward lifetime	primal execution
Reverse lifetime	backward execution

Reverse-mode AD extends many variable lifetimes dramatically.

Functionalization

Some AD systems transform mutable programs into purely functional representations.

Imperative form:

x[0] += 1

Functionalized form:

x2 = update(x1, 0, x1[0] + 1)

This removes mutation and makes dependencies explicit.

Functionalization simplifies:

Task	Reason
Graph construction	immutable values
Reverse replay	no hidden overwrites
Parallel execution	fewer races
Compiler optimization	explicit dependencies

Modern differentiable compilers increasingly use this approach internally.

Core Idea

Automatic differentiation depends on program state because reverse propagation often requires historical values. Mutation, aliasing, in-place updates, random state, and external effects all influence whether the derivative computation is correct and reproducible.

A differentiable program is therefore not only a mathematical function. It is also a state transition system whose memory behavior determines how derivatives can be stored, reconstructed, and propagated backward through execution.