Skip to content

Memory Planning

Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.

Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.

For automatic differentiation, memory planning is a central systems problem because reverse mode often requires large numbers of intermediate values from the forward pass.

A simple mathematical expression may appear small:

y = sin(x * x)

but a deep neural network, PDE solver, or simulation may produce millions of intermediate tensors. Reverse mode may need many of them later.

Without careful planning, memory usage becomes the limiting factor rather than arithmetic throughput.

Why Reverse Mode Uses So Much Memory

Forward mode propagates tangent information alongside primal values. Once an operation finishes, earlier intermediates can often be discarded.

Reverse mode works differently.

The backward pass for an operation usually needs primal values computed during the forward pass.

For multiplication:

z = x * y

the reverse rule is:

bar_x += y * bar_z
bar_y += x * bar_z

The backward pass needs x and y.

For sin(x):

bar_x += cos(x) * bar_y

the backward pass needs x, or equivalently sin(x) if the derivative rule is rewritten.

Thus reverse mode must preserve enough information from the forward pass to compute the reverse pass later.

A naive reverse-mode system stores every intermediate value.

The Tape

Many reverse-mode systems use a tape.

The tape is a runtime record of operations and saved values.

A tape entry may contain:

FieldPurpose
operation typeSelect backward rule
input referencesLocate parents
output referencesLocate produced value
saved primal valuesNeeded during backward
shape or dtype metadataNeeded for tensor rules
control-flow metadataNeeded to replay branches and loops

Example:

tape:
    op=mul, save=(x, y)
    op=sin, save=(x)
    op=add, save=()

During backward execution, the tape is traversed in reverse order.

Activation Memory

In deep learning, saved forward intermediates are often called activations.

For a network:

x -> layer1 -> layer2 -> layer3 -> loss

reverse mode may need outputs from every layer.

The activation memory often scales with:

batch size
sequence length
hidden dimension
number of layers
dtype size

Modern models can require gigabytes or even terabytes of activation storage.

This is why memory planning is a first-class concern in AD systems.

Liveness

A value is live if it may still be needed later.

Memory planning starts with liveness analysis.

For a primal program:

x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0
return x3

the compiler asks:

When is x1 last needed?
When is x2 last needed?
Will reverse mode require them later?

If x1 is needed by the backward rule for sin, it remains live until the reverse pass consumes it.

Once a value is dead, its storage can be reused.

Buffer Reuse

Two values whose lifetimes do not overlap can share the same memory buffer.

Example:

x1 live during steps 1-3
x2 live during steps 4-6

The compiler may allocate both into the same buffer.

buffer A:
    stores x1
    later reused for x2

This reduces peak memory usage.

Tensor programs benefit greatly from reuse because tensor allocations are large and allocation overhead can be expensive.

In-Place Updates

An operation may overwrite an input buffer instead of allocating a new output.

Example:

x = relu(x)

If the old value of x is no longer needed, the output may reuse the same storage.

However, reverse mode complicates this.

The backward rule for ReLU may need to know which elements were positive during the forward pass.

bar_x = bar_y * (x > 0)

If the in-place update destroyed the original x, the backward pass may fail.

Therefore memory planning must know:

Which values are needed later?
Can overwritten values be recomputed?
Can metadata summarize the needed information?

Many AD systems either avoid unsafe in-place mutation or track it explicitly.

Saved Values vs Recomputation

A value needed during backward can be:

StrategyMeaning
SavedStore the forward value
RecomputationRecompute it later
CheckpointedSave selected boundaries only
CompressedStore reduced representation

The optimal choice depends on compute cost and memory cost.

Example:

x1 = expensive(x0)
x2 = cheap(x1)

Saving x1 may use large memory. Recomputation may repeat expensive work.

A compiler may choose different strategies for different operations.

Checkpointing

Checkpointing trades computation for memory.

Instead of saving every intermediate, the system saves selected checkpoints and recomputes missing values during backward.

Example:

x0 -> x1 -> x2 -> x3 -> x4

Naive reverse mode saves:

x1, x2, x3, x4

Checkpointing may save only:

x2, x4

During backward:

recompute x3 from x2
run backward for x4
run backward for x3

This reduces memory at the cost of extra forward computation.

Checkpointing is essential for long sequences, large transformer models, neural ODEs, and scientific simulations.

Rematerialization

Recomputation of discarded values is often called rematerialization.

The compiler decides:

store value
or
recompute value later

This decision depends on:

FactorEffect
Tensor sizeLarge tensors are expensive to store
Compute costExpensive operations are costly to recompute
Reuse countFrequently reused values favor storage
Device bandwidthMemory movement may dominate runtime
Parallel executionRecomputation may block scheduling

A good planner balances arithmetic cost against memory pressure.

Reverse-Mode Scheduling

The backward pass itself has memory requirements.

Adjoints must remain alive until all downstream contributions arrive.

Example:

x1 = f(x0)
x2 = g(x0)
x3 = add(x1, x2)

The adjoint for x0 receives contributions from both paths:

bar_x0 += contribution_from_f
bar_x0 += contribution_from_g

Thus bar_x0 must stay live until all contributions are accumulated.

The order of reverse execution affects peak memory use.

Graph Scheduling

Graph IRs permit flexible execution order.

A graph scheduler chooses:

operation order
buffer allocation
buffer reuse
device placement
communication overlap
recompute strategy

For AD, the scheduler may jointly optimize forward and backward graphs.

For example:

forward kernel
save activation
launch backward kernel
free activation

A better schedule may overlap computation and freeing more aggressively.

Tensor Lifetimes

Large tensor systems often model lifetimes explicitly.

Example:

tensor A:
    allocated at step 10
    last used at step 35

tensor B:
    allocated at step 36
    last used at step 50

The planner can place both in the same memory region.

This becomes more important on accelerators where memory is limited and allocation is expensive.

GPU memory fragmentation can significantly reduce usable capacity even when nominal free memory exists.

Static vs Dynamic Memory Planning

Static planning determines memory layout before execution.

This is possible when:

shapes are known
control flow is fixed
graph structure is static

Static planning enables aggressive optimization.

Dynamic planning allocates memory during execution.

This is necessary when:

shapes vary at runtime
control flow depends on data
graphs are dynamic

Dynamic planning is more flexible but may have higher runtime overhead and fragmentation.

Modern systems often combine both:

static allocation for stable graph regions
dynamic allocation for variable regions

Memory Pools

Repeated allocation and deallocation are expensive.

Many AD runtimes use memory pools.

Instead of returning freed memory to the operating system, buffers remain in a reusable pool.

allocate buffer
use buffer
mark free
reuse buffer later

This reduces allocator overhead and fragmentation.

Memory pools are especially important for GPU execution because device allocations can be slow.

Alias Analysis

Mutation introduces aliasing problems.

Example:

a = b
a[i] = 1

The update affects both names.

The memory planner must know:

Which buffers may alias?
Which updates are visible elsewhere?
Which saved values become invalid after mutation?

Incorrect alias analysis can produce wrong gradients.

Compiler-based systems often use SSA-style memory models or effect systems to make aliasing explicit.

Compression

Saved activations may be compressed.

Strategies include:

Compression methodTradeoff
Lower precisionLess memory, reduced numerical accuracy
QuantizationSmaller storage, reconstruction overhead
SparsificationReduced memory for sparse activations
Delta encodingEffective for slowly changing states
Recomputation from seedsMinimal storage, extra compute

Compression is increasingly important for very large models.

Distributed Memory

Large-scale training distributes activations and gradients across devices.

The planner must coordinate:

device-local buffers
communication buffers
gradient synchronization
pipeline stages
activation sharding
optimizer state placement

The derivative graph is therefore also a communication graph.

Memory planning becomes coupled with network scheduling.

Forward-Mode Memory

Forward mode uses less memory because derivatives propagate immediately.

A forward-mode computation usually needs only:

current primal values
current tangent values

No backward replay is required.

This is one reason forward mode is attractive for small-input problems and higher-order local differentiation.

Persistent vs Ephemeral Values

Some values persist across calls:

model parameters
optimizer state
compiled kernels
memory pools

Others are ephemeral:

temporary activations
adjoints
intermediate reductions
loop-local buffers

The planner must distinguish them.

Persistent values may justify expensive layout optimization. Ephemeral values favor rapid allocation and reuse.

Memory Planning and Numerical Stability

Recomputation changes floating-point behavior.

Example:

save exact forward value
vs
recompute approximately equivalent value

Because floating-point arithmetic is not associative, recomputation may produce slightly different results.

For most machine learning workloads this is acceptable. For high-precision scientific computing, deterministic replay may matter.

The planner therefore influences both performance and numerical reproducibility.

Compiler Integration

Memory planning is closely tied to compiler optimization.

The compiler needs:

shape inference
liveness analysis
effect analysis
alias analysis
scheduling
device placement
cost models

A modern AD compiler does not treat memory as an afterthought. Memory is part of the program transformation process itself.

Design Questions

A memory planning system for AD should answer:

Which values must survive for backward?
Which values can be recomputed?
When can buffers be reused?
Which operations allow in-place updates?
How are aliases tracked?
How are dynamic shapes handled?
How are buffers distributed across devices?
How is checkpointing chosen?
How are memory and compute tradeoffs modeled?

Different systems make different tradeoffs depending on workload.

Summary

Memory planning determines the practical scalability of reverse-mode automatic differentiation.

The main challenge is that backward computation depends on forward intermediates. A naive system stores everything. A scalable system analyzes lifetimes, reuses buffers, rematerializes values, checkpoints computations, and coordinates memory with execution scheduling.

In modern AD systems, memory planning is not separate from differentiation. The structure of the derivative program directly determines what must be stored, when it must remain alive, and what can be discarded or recomputed.