# Memory Planning

## Memory Planning

Memory planning determines where values are stored, how long they remain alive, and when storage can be reused.

For automatic differentiation, memory planning is a central systems problem because reverse mode often requires large numbers of intermediate values from the forward pass.

A simple mathematical expression may appear small:

```text
y = sin(x * x)
```

but a deep neural network, PDE solver, or simulation may produce millions of intermediate tensors. Reverse mode may need many of them later.

Without careful planning, memory usage becomes the limiting factor rather than arithmetic throughput.

### Why Reverse Mode Uses So Much Memory

Forward mode propagates tangent information alongside primal values. Once an operation finishes, earlier intermediates can often be discarded.

Reverse mode works differently.

The backward pass for an operation usually needs primal values computed during the forward pass.

For multiplication:

```text
z = x * y
```

the reverse rule is:

```text
bar_x += y * bar_z
bar_y += x * bar_z
```

The backward pass needs `x` and `y`.

For `sin(x)`:

```text
bar_x += cos(x) * bar_y
```

the backward pass needs `x`, or equivalently `sin(x)` if the derivative rule is rewritten.

Thus reverse mode must preserve enough information from the forward pass to compute the reverse pass later.

A naive reverse-mode system stores every intermediate value.

### The Tape

Many reverse-mode systems use a tape.

The tape is a runtime record of operations and saved values.

A tape entry may contain:

| Field | Purpose |
|---|---|
| operation type | Select backward rule |
| input references | Locate parents |
| output references | Locate produced value |
| saved primal values | Needed during backward |
| shape or dtype metadata | Needed for tensor rules |
| control-flow metadata | Needed to replay branches and loops |

Example:

```text
tape:
    op=mul, save=(x, y)
    op=sin, save=(x)
    op=add, save=()
```

During backward execution, the tape is traversed in reverse order.

### Activation Memory

In deep learning, saved forward intermediates are often called activations.

For a network:

```text
x -> layer1 -> layer2 -> layer3 -> loss
```

reverse mode may need outputs from every layer.

The activation memory often scales with:

```text
batch size
sequence length
hidden dimension
number of layers
dtype size
```

Modern models can require gigabytes or even terabytes of activation storage.

This is why memory planning is a first-class concern in AD systems.

### Liveness

A value is live if it may still be needed later.

Memory planning starts with liveness analysis.

For a primal program:

```text
x1 = mul x0 x0
x2 = sin x1
x3 = add x2 x0
return x3
```

the compiler asks:

```text
When is x1 last needed?
When is x2 last needed?
Will reverse mode require them later?
```

If `x1` is needed by the backward rule for `sin`, it remains live until the reverse pass consumes it.

Once a value is dead, its storage can be reused.

### Buffer Reuse

Two values whose lifetimes do not overlap can share the same memory buffer.

Example:

```text
x1 live during steps 1-3
x2 live during steps 4-6
```

The compiler may allocate both into the same buffer.

```text
buffer A:
    stores x1
    later reused for x2
```

This reduces peak memory usage.

Tensor programs benefit greatly from reuse because tensor allocations are large and allocation overhead can be expensive.

### In-Place Updates

An operation may overwrite an input buffer instead of allocating a new output.

Example:

```text
x = relu(x)
```

If the old value of `x` is no longer needed, the output may reuse the same storage.

However, reverse mode complicates this.

The backward rule for ReLU may need to know which elements were positive during the forward pass.

```text
bar_x = bar_y * (x > 0)
```

If the in-place update destroyed the original `x`, the backward pass may fail.

Therefore memory planning must know:

```text
Which values are needed later?
Can overwritten values be recomputed?
Can metadata summarize the needed information?
```

Many AD systems either avoid unsafe in-place mutation or track it explicitly.

### Saved Values vs Recomputation

A value needed during backward can be:

| Strategy | Meaning |
|---|---|
| Saved | Store the forward value |
| Recomputation | Recompute it later |
| Checkpointed | Save selected boundaries only |
| Compressed | Store reduced representation |

The optimal choice depends on compute cost and memory cost.

Example:

```text
x1 = expensive(x0)
x2 = cheap(x1)
```

Saving `x1` may use large memory. Recomputation may repeat expensive work.

A compiler may choose different strategies for different operations.

### Checkpointing

Checkpointing trades computation for memory.

Instead of saving every intermediate, the system saves selected checkpoints and recomputes missing values during backward.

Example:

```text
x0 -> x1 -> x2 -> x3 -> x4
```

Naive reverse mode saves:

```text
x1, x2, x3, x4
```

Checkpointing may save only:

```text
x2, x4
```

During backward:

```text
recompute x3 from x2
run backward for x4
run backward for x3
```

This reduces memory at the cost of extra forward computation.

Checkpointing is essential for long sequences, large transformer models, neural ODEs, and scientific simulations.

### Rematerialization

Recomputation of discarded values is often called rematerialization.

The compiler decides:

```text
store value
or
recompute value later
```

This decision depends on:

| Factor | Effect |
|---|---|
| Tensor size | Large tensors are expensive to store |
| Compute cost | Expensive operations are costly to recompute |
| Reuse count | Frequently reused values favor storage |
| Device bandwidth | Memory movement may dominate runtime |
| Parallel execution | Recomputation may block scheduling |

A good planner balances arithmetic cost against memory pressure.

### Reverse-Mode Scheduling

The backward pass itself has memory requirements.

Adjoints must remain alive until all downstream contributions arrive.

Example:

```text
x1 = f(x0)
x2 = g(x0)
x3 = add(x1, x2)
```

The adjoint for `x0` receives contributions from both paths:

```text
bar_x0 += contribution_from_f
bar_x0 += contribution_from_g
```

Thus `bar_x0` must stay live until all contributions are accumulated.

The order of reverse execution affects peak memory use.

### Graph Scheduling

Graph IRs permit flexible execution order.

A graph scheduler chooses:

```text
operation order
buffer allocation
buffer reuse
device placement
communication overlap
recompute strategy
```

For AD, the scheduler may jointly optimize forward and backward graphs.

For example:

```text
forward kernel
save activation
launch backward kernel
free activation
```

A better schedule may overlap computation and freeing more aggressively.

### Tensor Lifetimes

Large tensor systems often model lifetimes explicitly.

Example:

```text
tensor A:
    allocated at step 10
    last used at step 35

tensor B:
    allocated at step 36
    last used at step 50
```

The planner can place both in the same memory region.

This becomes more important on accelerators where memory is limited and allocation is expensive.

GPU memory fragmentation can significantly reduce usable capacity even when nominal free memory exists.

### Static vs Dynamic Memory Planning

Static planning determines memory layout before execution.

This is possible when:

```text
shapes are known
control flow is fixed
graph structure is static
```

Static planning enables aggressive optimization.

Dynamic planning allocates memory during execution.

This is necessary when:

```text
shapes vary at runtime
control flow depends on data
graphs are dynamic
```

Dynamic planning is more flexible but may have higher runtime overhead and fragmentation.

Modern systems often combine both:

```text
static allocation for stable graph regions
dynamic allocation for variable regions
```

### Memory Pools

Repeated allocation and deallocation are expensive.

Many AD runtimes use memory pools.

Instead of returning freed memory to the operating system, buffers remain in a reusable pool.

```text
allocate buffer
use buffer
mark free
reuse buffer later
```

This reduces allocator overhead and fragmentation.

Memory pools are especially important for GPU execution because device allocations can be slow.

### Alias Analysis

Mutation introduces aliasing problems.

Example:

```text
a = b
a[i] = 1
```

The update affects both names.

The memory planner must know:

```text
Which buffers may alias?
Which updates are visible elsewhere?
Which saved values become invalid after mutation?
```

Incorrect alias analysis can produce wrong gradients.

Compiler-based systems often use SSA-style memory models or effect systems to make aliasing explicit.

### Compression

Saved activations may be compressed.

Strategies include:

| Compression method | Tradeoff |
|---|---|
| Lower precision | Less memory, reduced numerical accuracy |
| Quantization | Smaller storage, reconstruction overhead |
| Sparsification | Reduced memory for sparse activations |
| Delta encoding | Effective for slowly changing states |
| Recomputation from seeds | Minimal storage, extra compute |

Compression is increasingly important for very large models.

### Distributed Memory

Large-scale training distributes activations and gradients across devices.

The planner must coordinate:

```text
device-local buffers
communication buffers
gradient synchronization
pipeline stages
activation sharding
optimizer state placement
```

The derivative graph is therefore also a communication graph.

Memory planning becomes coupled with network scheduling.

### Forward-Mode Memory

Forward mode uses less memory because derivatives propagate immediately.

A forward-mode computation usually needs only:

```text
current primal values
current tangent values
```

No backward replay is required.

This is one reason forward mode is attractive for small-input problems and higher-order local differentiation.

### Persistent vs Ephemeral Values

Some values persist across calls:

```text
model parameters
optimizer state
compiled kernels
memory pools
```

Others are ephemeral:

```text
temporary activations
adjoints
intermediate reductions
loop-local buffers
```

The planner must distinguish them.

Persistent values may justify expensive layout optimization. Ephemeral values favor rapid allocation and reuse.

### Memory Planning and Numerical Stability

Recomputation changes floating-point behavior.

Example:

```text
save exact forward value
vs
recompute approximately equivalent value
```

Because floating-point arithmetic is not associative, recomputation may produce slightly different results.

For most machine learning workloads this is acceptable. For high-precision scientific computing, deterministic replay may matter.

The planner therefore influences both performance and numerical reproducibility.

### Compiler Integration

Memory planning is closely tied to compiler optimization.

The compiler needs:

```text
shape inference
liveness analysis
effect analysis
alias analysis
scheduling
device placement
cost models
```

A modern AD compiler does not treat memory as an afterthought. Memory is part of the program transformation process itself.

### Design Questions

A memory planning system for AD should answer:

```text
Which values must survive for backward?
Which values can be recomputed?
When can buffers be reused?
Which operations allow in-place updates?
How are aliases tracked?
How are dynamic shapes handled?
How are buffers distributed across devices?
How is checkpointing chosen?
How are memory and compute tradeoffs modeled?
```

Different systems make different tradeoffs depending on workload.

### Summary

Memory planning determines the practical scalability of reverse-mode automatic differentiation.

The main challenge is that backward computation depends on forward intermediates. A naive system stores everything. A scalable system analyzes lifetimes, reuses buffers, rematerializes values, checkpoints computations, and coordinates memory with execution scheduling.

In modern AD systems, memory planning is not separate from differentiation. The structure of the derivative program directly determines what must be stored, when it must remain alive, and what can be discarded or recomputed.

