Skip to content

Memory and State

Automatic differentiation operates on computations, but computations execute inside a memory model. Variables occupy storage locations, arrays are mutated, buffers are reused,...

Automatic differentiation operates on computations, but computations execute inside a memory model. Variables occupy storage locations, arrays are mutated, buffers are reused, and values may outlive the operations that created them. Differentiation therefore depends not only on algebraic structure, but also on program state.

A simple mathematical expression:

y=x2+1 y = x^2 + 1

contains no explicit memory model. A program implementation does.

tmp = x * x
y = tmp + 1

The variable tmp occupies storage. It has a lifetime. It may later be overwritten or freed. Reverse-mode AD may need its value long after the forward computation has moved on.

Memory management becomes part of differentiation semantics.

State in Pure and Imperative Programs

A pure functional computation has no mutable state.

v1 = x * x
v2 = v1 + 1

Each variable is assigned once and never changes.

An imperative computation may reuse storage:

x = x * x
x = x + 1

The second program destroys the original value of x.

Mathematically the computations are equivalent. Operationally they differ because reverse mode may need the old value after it has been overwritten.

This is why many AD systems internally transform programs into single-assignment form.

Mutable State

Mutation changes the contents of a memory location.

a[0] = x
a[1] = y
a[0] = a[0] + 1

The value of a[0] changes over time.

A reverse-mode system must answer several questions:

QuestionExample
Which value existed at a given point?old a[0]
Was the value overwritten?yes
Do multiple variables alias the same buffer?possible
Can the backward pass reconstruct the old state?maybe

Mutation complicates differentiation because the backward pass conceptually runs “backward in time.”

Reverse Mode Requires Historical State

Consider:

x = 2
y = x * x

The backward rule for multiplication uses the primal value:

xˉ+=2xyˉ. \bar x += 2x \bar y.

If the original value of x has been overwritten, reverse mode cannot evaluate the derivative correctly unless it stored or recomputed the value.

This requirement extends to tensor programs:

y = relu(x)

The backward rule depends on the sign of the original x:

xˉ=yˉ1x>0. \bar x = \bar y \odot 1_{x > 0}.

The backward pass therefore needs the original activation pattern.

State as a Time Sequence

A mutable variable can be modeled as a sequence of immutable states.

Imperative form:

x = 1
x = x + 2
x = x * 3

Equivalent single-assignment form:

x0 = 1
x1 = x0 + 2
x2 = x1 * 3

The apparent mutation becomes an explicit dependency chain.

This transformation is central to compiler-based AD systems.

Static Single Assignment Form

Static single assignment form, usually called SSA, assigns each variable exactly once.

Instead of:

x = x + 1
x = x * 2

SSA form becomes:

x1 = x0 + 1
x2 = x1 * 2

Advantages for AD:

PropertyBenefit
Explicit dependenciesEasier graph construction
No overwritten valuesReverse mode can reference old states
Simpler analysisEasier optimization
Cleaner lifetime analysisBetter memory reuse

Modern compiler IRs commonly use SSA because it aligns naturally with dependency graphs and differentiation.

Aliasing

Aliasing occurs when multiple references point to the same memory.

a = buffer
b = buffer
a[0] = 1

Now b[0] also changes.

Aliasing complicates AD because mutations propagate indirectly through shared storage.

The backward pass must know:

IssueConsequence
Which references share memory?affects derivative correctness
Which writes overwrite previous values?affects replay
Which operations are in-place?affects tape structure

Tensor frameworks often track aliasing explicitly to prevent invalid gradient computation.

In-Place Operations

An in-place operation modifies existing storage.

x += y

Instead of allocating a new tensor, the program overwrites x.

This is efficient for memory usage, but dangerous for reverse mode.

Suppose:

x = relu(x)

The backward pass may need the original pre-ReLU values. If they were overwritten, the information is lost.

Many AD systems therefore:

StrategyMeaning
Forbid some in-place opspreserve correctness
Copy before mutationpreserve old state
Track versionsdetect invalid reuse
Recompute valuesavoid storage

Frameworks such as entity[“software”,“PyTorch”,“deep learning framework”] include version counters to detect illegal in-place modifications during gradient recording.

Tape Memory

Reverse mode often stores execution state in a tape.

A tape record may contain:

type TapeNode struct {
    Op       Op
    Inputs   []VarID
    Outputs  []VarID
    Saved    []Value
}

The Saved values contain the primal state needed for the backward rule.

For example:

v = sin(a)

may save a or v because the backward rule needs:

aˉ+=vˉcos(a). \bar a += \bar v \cos(a).

Without the saved value, reverse propagation cannot evaluate the derivative.

Activation Storage in Neural Networks

Large neural networks store many intermediate activations.

Example:

v1 = matmul(x, w1)
v2 = relu(v1)
v3 = matmul(v2, w2)
v4 = relu(v3)
y  = loss(v4)

Reverse mode may need:

TensorNeeded for
v1ReLU backward
v2matrix backward
v3ReLU backward
v4loss backward

For large models, activation storage dominates memory cost.

Training memory usage often scales approximately with:

O(number of activations) O(\text{number of activations})

rather than parameter count alone.

Recomputation

Instead of storing all intermediate states, a system can recompute them.

Suppose:

v1 -> v2 -> v3 -> v4

Only v1 and v4 are stored.

During backward execution:

  1. recompute v2,
  2. recompute v3,
  3. apply backward rules.

This reduces memory at the cost of additional computation.

The tradeoff is:

| Strategy | Memory | Compute | |—|—| | Store everything | high | low | | Recompute aggressively | low | high | | Checkpoint selectively | balanced | balanced |

Stateful Randomness

Random number generation introduces hidden state.

y = dropout(x)

The output depends on a random mask.

Backward propagation usually requires the same mask used during forward execution.

Systems therefore:

MethodPurpose
Save random masksexact replay
Save RNG seedsdeterministic regeneration
Use stateless RNGsexplicit dependency

Randomness becomes part of differentiable state.

Stateful External Systems

Programs may interact with external systems:

value = database.read(key)

or

send_packet(x)

These operations may not be differentiable.

External state creates several problems:

ProblemReason
Non-repeatabilityexternal state may change
Side effectscannot easily reverse
Hidden dependenciesnot represented in graph
Non-differentiabilitydiscrete behavior

AD systems often isolate differentiable kernels from external effects.

Stateful Optimizers

Optimization algorithms themselves contain state.

Example:

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad * grad
param = param - lr * m / sqrt(v)

The optimizer state evolves over time.

Differentiating through optimization therefore requires differentiating through the optimizer state transitions.

This appears in:

AreaExample
Meta-learningoptimize the optimizer
Hyperparameter optimizationdifferentiate learning rates
Differentiable training loopsgradients through SGD
Neural architecture searchnested optimization

The optimizer becomes part of the computational graph.

Persistent Versus Ephemeral State

Some state exists only during one computation. Other state persists across executions.

State typeExample
Ephemeraltemporary activation
Persistentmodel parameters
Globalrandom generator
Externalfilesystem contents

Persistent state creates long-lived dependency chains.

For example, a parameter update loop:

theta_{t+1} = theta_t - lr * grad_t

creates a recurrence over training time itself.

Memory Lifetime Analysis

A compiler can determine when a value is no longer needed.

Example:

v1 = x * x
v2 = sin(v1)
v3 = v2 + 1

Once v2 is computed, v1 may no longer be needed in forward execution. But reverse mode may still require it.

Thus there are two lifetimes:

Lifetime typePurpose
Forward lifetimeprimal execution
Reverse lifetimebackward execution

Reverse-mode AD extends many variable lifetimes dramatically.

Functionalization

Some AD systems transform mutable programs into purely functional representations.

Imperative form:

x[0] += 1

Functionalized form:

x2 = update(x1, 0, x1[0] + 1)

This removes mutation and makes dependencies explicit.

Functionalization simplifies:

TaskReason
Graph constructionimmutable values
Reverse replayno hidden overwrites
Parallel executionfewer races
Compiler optimizationexplicit dependencies

Modern differentiable compilers increasingly use this approach internally.

Core Idea

Automatic differentiation depends on program state because reverse propagation often requires historical values. Mutation, aliasing, in-place updates, random state, and external effects all influence whether the derivative computation is correct and reproducible.

A differentiable program is therefore not only a mathematical function. It is also a state transition system whose memory behavior determines how derivatives can be stored, reconstructed, and propagated backward through execution.