Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.
Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.
For automatic differentiation, memory planning is a central systems problem because reverse mode often requires large numbers of intermediate values from the forward pass.
A simple mathematical expression may appear small:
y = sin(x * x)but a deep neural network, PDE solver, or simulation may produce millions of intermediate tensors. Reverse mode may need many of them later.
Without careful planning, memory usage becomes the limiting factor rather than arithmetic throughput.
Why Reverse Mode Uses So Much Memory
Forward mode propagates tangent information alongside primal values. Once an operation finishes, earlier intermediates can often be discarded.
Reverse mode works differently.
The backward pass for an operation usually needs primal values computed during the forward pass.
For multiplication:
z = x * ythe reverse rule is:
bar_x += y * bar_z
bar_y += x * bar_zThe backward pass needs x and y.
For sin(x):
bar_x += cos(x) * bar_ythe backward pass needs x, or equivalently sin(x) if the derivative rule is rewritten.
Thus reverse mode must preserve enough information from the forward pass to compute the reverse pass later.
A naive reverse-mode system stores every intermediate value.
The Tape
Many reverse-mode systems use a tape.
The tape is a runtime record of operations and saved values.
A tape entry may contain:
| Field | Purpose |
|---|---|
| operation type | Select backward rule |
| input references | Locate parents |
| output references | Locate produced value |
| saved primal values | Needed during backward |
| shape or dtype metadata | Needed for tensor rules |
| control-flow metadata | Needed to replay branches and loops |
Example:
tape:
op=mul, save=(x, y)
op=sin, save=(x)
op=add, save=()During backward execution, the tape is traversed in reverse order.
Activation Memory
In deep learning, saved forward intermediates are often called activations.
For a network:
x -> layer1 -> layer2 -> layer3 -> lossreverse mode may need outputs from every layer.
The activation memory often scales with:
batch size
sequence length
hidden dimension
number of layers
dtype sizeModern models can require gigabytes or even terabytes of activation storage.
This is why memory planning is a first-class concern in AD systems.
Liveness
A value is live if it may still be needed later.
Memory planning starts with liveness analysis.
For a primal program:
x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0
return x3the compiler asks:
When is x1 last needed?
When is x2 last needed?
Will reverse mode require them later?If x1 is needed by the backward rule for sin, it remains live until the reverse pass consumes it.
Once a value is dead, its storage can be reused.
Buffer Reuse
Two values whose lifetimes do not overlap can share the same memory buffer.
Example:
x1 live during steps 1-3
x2 live during steps 4-6The compiler may allocate both into the same buffer.
buffer A:
stores x1
later reused for x2This reduces peak memory usage.
Tensor programs benefit greatly from reuse because tensor allocations are large and allocation overhead can be expensive.
In-Place Updates
An operation may overwrite an input buffer instead of allocating a new output.
Example:
x = relu(x)If the old value of x is no longer needed, the output may reuse the same storage.
However, reverse mode complicates this.
The backward rule for ReLU may need to know which elements were positive during the forward pass.
bar_x = bar_y * (x > 0)If the in-place update destroyed the original x, the backward pass may fail.
Therefore memory planning must know:
Which values are needed later?
Can overwritten values be recomputed?
Can metadata summarize the needed information?Many AD systems either avoid unsafe in-place mutation or track it explicitly.
Saved Values vs Recomputation
A value needed during backward can be:
| Strategy | Meaning |
|---|---|
| Saved | Store the forward value |
| Recomputation | Recompute it later |
| Checkpointed | Save selected boundaries only |
| Compressed | Store reduced representation |
The optimal choice depends on compute cost and memory cost.
Example:
x1 = expensive(x0)
x2 = cheap(x1)Saving x1 may use large memory. Recomputation may repeat expensive work.
A compiler may choose different strategies for different operations.
Checkpointing
Checkpointing trades computation for memory.
Instead of saving every intermediate, the system saves selected checkpoints and recomputes missing values during backward.
Example:
x0 -> x1 -> x2 -> x3 -> x4Naive reverse mode saves:
x1, x2, x3, x4Checkpointing may save only:
x2, x4During backward:
recompute x3 from x2
run backward for x4
run backward for x3This reduces memory at the cost of extra forward computation.
Checkpointing is essential for long sequences, large transformer models, neural ODEs, and scientific simulations.
Rematerialization
Recomputation of discarded values is often called rematerialization.
The compiler decides:
store value
or
recompute value laterThis decision depends on:
| Factor | Effect |
|---|---|
| Tensor size | Large tensors are expensive to store |
| Compute cost | Expensive operations are costly to recompute |
| Reuse count | Frequently reused values favor storage |
| Device bandwidth | Memory movement may dominate runtime |
| Parallel execution | Recomputation may block scheduling |
A good planner balances arithmetic cost against memory pressure.
Reverse-Mode Scheduling
The backward pass itself has memory requirements.
Adjoints must remain alive until all downstream contributions arrive.
Example:
x1 = f(x0)
x2 = g(x0)
x3 = add(x1, x2)The adjoint for x0 receives contributions from both paths:
bar_x0 += contribution_from_f
bar_x0 += contribution_from_gThus bar_x0 must stay live until all contributions are accumulated.
The order of reverse execution affects peak memory use.
Graph Scheduling
Graph IRs permit flexible execution order.
A graph scheduler chooses:
operation order
buffer allocation
buffer reuse
device placement
communication overlap
recompute strategyFor AD, the scheduler may jointly optimize forward and backward graphs.
For example:
forward kernel
save activation
launch backward kernel
free activationA better schedule may overlap computation and freeing more aggressively.
Tensor Lifetimes
Large tensor systems often model lifetimes explicitly.
Example:
tensor A:
allocated at step 10
last used at step 35
tensor B:
allocated at step 36
last used at step 50The planner can place both in the same memory region.
This becomes more important on accelerators where memory is limited and allocation is expensive.
GPU memory fragmentation can significantly reduce usable capacity even when nominal free memory exists.
Static vs Dynamic Memory Planning
Static planning determines memory layout before execution.
This is possible when:
shapes are known
control flow is fixed
graph structure is staticStatic planning enables aggressive optimization.
Dynamic planning allocates memory during execution.
This is necessary when:
shapes vary at runtime
control flow depends on data
graphs are dynamicDynamic planning is more flexible but may have higher runtime overhead and fragmentation.
Modern systems often combine both:
static allocation for stable graph regions
dynamic allocation for variable regionsMemory Pools
Repeated allocation and deallocation are expensive.
Many AD runtimes use memory pools.
Instead of returning freed memory to the operating system, buffers remain in a reusable pool.
allocate buffer
use buffer
mark free
reuse buffer laterThis reduces allocator overhead and fragmentation.
Memory pools are especially important for GPU execution because device allocations can be slow.
Alias Analysis
Mutation introduces aliasing problems.
Example:
a = b
a[i] = 1The update affects both names.
The memory planner must know:
Which buffers may alias?
Which updates are visible elsewhere?
Which saved values become invalid after mutation?Incorrect alias analysis can produce wrong gradients.
Compiler-based systems often use SSA-style memory models or effect systems to make aliasing explicit.
Compression
Saved activations may be compressed.
Strategies include:
| Compression method | Tradeoff |
|---|---|
| Lower precision | Less memory, reduced numerical accuracy |
| Quantization | Smaller storage, reconstruction overhead |
| Sparsification | Reduced memory for sparse activations |
| Delta encoding | Effective for slowly changing states |
| Recomputation from seeds | Minimal storage, extra compute |
Compression is increasingly important for very large models.
Distributed Memory
Large-scale training distributes activations and gradients across devices.
The planner must coordinate:
device-local buffers
communication buffers
gradient synchronization
pipeline stages
activation sharding
optimizer state placementThe derivative graph is therefore also a communication graph.
Memory planning becomes coupled with network scheduling.
Forward-Mode Memory
Forward mode uses less memory because derivatives propagate immediately.
A forward-mode computation usually needs only:
current primal values
current tangent valuesNo backward replay is required.
This is one reason forward mode is attractive for small-input problems and higher-order local differentiation.
Persistent vs Ephemeral Values
Some values persist across calls:
model parameters
optimizer state
compiled kernels
memory poolsOthers are ephemeral:
temporary activations
adjoints
intermediate reductions
loop-local buffersThe planner must distinguish them.
Persistent values may justify expensive layout optimization. Ephemeral values favor rapid allocation and reuse.
Memory Planning and Numerical Stability
Recomputation changes floating-point behavior.
Example:
save exact forward value
vs
recompute approximately equivalent valueBecause floating-point arithmetic is not associative, recomputation may produce slightly different results.
For most machine learning workloads this is acceptable. For high-precision scientific computing, deterministic replay may matter.
The planner therefore influences both performance and numerical reproducibility.
Compiler Integration
Memory planning is closely tied to compiler optimization.
The compiler needs:
shape inference
liveness analysis
effect analysis
alias analysis
scheduling
device placement
cost modelsA modern AD compiler does not treat memory as an afterthought. Memory is part of the program transformation process itself.
Design Questions
A memory planning system for AD should answer:
Which values must survive for backward?
Which values can be recomputed?
When can buffers be reused?
Which operations allow in-place updates?
How are aliases tracked?
How are dynamic shapes handled?
How are buffers distributed across devices?
How is checkpointing chosen?
How are memory and compute tradeoffs modeled?Different systems make different tradeoffs depending on workload.
Summary
Memory planning determines the practical scalability of reverse-mode automatic differentiation.
The main challenge is that backward computation depends on forward intermediates. A naive system stores everything. A scalable system analyzes lifetimes, reuses buffers, rematerializes values, checkpoints computations, and coordinates memory with execution scheduling.
In modern AD systems, memory planning is not separate from differentiation. The structure of the derivative program directly determines what must be stored, when it must remain alive, and what can be discarded or recomputed.