# Memory Management

## Memory Management

Memory management is the main systems problem in reverse mode automatic differentiation. The derivative rules are usually small. The hard part is deciding which primal values, intermediate tensors, gradients, graph nodes, and tape records must stay alive until the backward pass finishes.

A minimal engine can ignore most of this at first. It can allocate every intermediate value and keep everything until `Backward` returns. That design is correct, but it does not scale.

## What Must Be Stored

Reverse mode needs enough information to run each local backward rule.

For an operation:

$$
y = g(x_1, \dots, x_k)
$$

the backward pass applies:

$$
\bar{x_i} \mathrel{+}= \bar{y}\frac{\partial g}{\partial x_i}
$$

To do this, the engine may need:
- the output gradient
- the input values
- the output value
- operation parameters
- shape and broadcasting metadata
- references to parent nodes or slots

For example, multiplication needs both input values:

```go
case OpMul:
    grads[a] += values[b] * grads[out]
    grads[b] += values[a] * grads[out]
```

The sine operation needs the input value:

```go
case OpSin:
    grads[x] += math.Cos(values[x]) * grads[out]
```

The exponential operation can use either the input value or the already-computed output value:

```go
case OpExp:
    grads[x] += values[out] * grads[out]
```

This choice affects memory. Saving the output may be cheaper than recomputing it. Saving every output may also be too expensive.

## Minimal Ownership Model

A small tape-based engine can give ownership of all memory to the tape:

```go
type Tape struct {
    Values []float64
    Grads  []float64
    Instrs []Instr
}
```

The tape owns:
- primal values
- gradients
- operation records

Slots are integer indexes:

```go
type Slot int
```

Operations allocate fresh slots:

```go
func (t *Tape) Var(x float64) Slot {
    id := Slot(len(t.Values))

    t.Values = append(t.Values, x)
    t.Grads = append(t.Grads, 0)

    return id
}
```

This model is simple because there are no per-node heap allocations. The tape can be reset after the backward pass:

```go
func (t *Tape) Reset() {
    t.Values = t.Values[:0]
    t.Grads = t.Grads[:0]
    t.Instrs = t.Instrs[:0]
}
```

The allocated capacity stays available for the next run.

## Fresh Slots and SSA Discipline

The safest memory rule is:

```text
Every differentiable operation writes to a fresh slot.
```

This gives the tape single-assignment semantics.

For example:

```text
x0 = input
x1 = x0 + 1
x2 = x1 * x1
```

Even if the source program reuses the variable name `x`, the AD engine stores each differentiable value separately.

This avoids a common reverse-mode bug: overwriting a value that the backward pass still needs.

In-place mutation can be supported later, but the minimal engine should use fresh slots everywhere.

## Gradient Storage

The simplest design allocates one gradient slot for every value slot:

```go
Values []float64
Grads  []float64
```

This is wasteful for constants and temporary values whose gradients are never inspected, but it makes the backward pass uniform.

A slightly richer design tracks slot kind:

```go
type SlotKind uint8

const (
    SlotVar SlotKind = iota
    SlotConst
    SlotTemp
)

type Tape struct {
    Values []float64
    Grads  []float64
    Kinds  []SlotKind
    Instrs []Instr
}
```

Then the engine can avoid returning or reporting gradients for constants. It may still allocate gradient storage for them to keep the backward implementation simple.

A production engine often uses sparse gradient allocation. Gradients are allocated only for values that require them.

## Requires-Grad Flags

Most computations contain values that do not need derivatives. A minimal engine can add a `RequiresGrad` bit per slot.

```go
type Tape struct {
    Values       []float64
    Grads        []float64
    RequiresGrad []bool
    Instrs       []Instr
}
```

Constants set it to false:

```go
func (t *Tape) Const(x float64) Slot {
    id := Slot(len(t.Values))

    t.Values = append(t.Values, x)
    t.Grads = append(t.Grads, 0)
    t.RequiresGrad = append(t.RequiresGrad, false)

    return id
}
```

Variables set it to true:

```go
func (t *Tape) Var(x float64) Slot {
    id := Slot(len(t.Values))

    t.Values = append(t.Values, x)
    t.Grads = append(t.Grads, 0)
    t.RequiresGrad = append(t.RequiresGrad, true)

    return id
}
```

An operation requires gradients if any parent requires gradients:

```go
needGrad := t.RequiresGrad[a] || t.RequiresGrad[b]
```

If `needGrad` is false, the operation can skip recording a tape instruction.

```go
func (t *Tape) Add(a, b Slot) Slot {
    out := t.alloc(t.Values[a] + t.Values[b], t.RequiresGrad[a] || t.RequiresGrad[b])

    if t.RequiresGrad[out] {
        t.Instrs = append(t.Instrs, Instr{
            Op: OpAdd,
            Out: out,
            A: a,
            B: b,
        })
    }

    return out
}
```

This single flag is one of the most important memory optimizations. It prevents the tape from recording irrelevant computation.

## Clearing Gradients

Gradients usually accumulate during one backward pass. Before another backward pass, the engine must either reset them or intentionally accumulate them.

Minimal reset:

```go
func (t *Tape) ZeroGrad() {
    for i := range t.Grads {
        t.Grads[i] = 0
    }
}
```

A better version clears only active slots:

```go
func (t *Tape) ZeroGradUsed() {
    for i := range t.Values {
        t.Grads[i] = 0
    }
}
```

If the tape is reset after each step, gradient clearing is implicit because the arrays are reused from length zero and new gradients are appended as zero.

## Lifetime of Primal Values

Reverse mode usually needs primal values during backward. Therefore values must outlive forward evaluation.

The minimal lifetime is:

```text
allocate during forward
keep until backward finishes
release or reset after backward
```

For large tensors, this becomes expensive. A neural network may store activations for every layer. A long sequence model may store activations for every time step.

The basic tradeoff:

| Strategy | Memory | Compute |
|---|---:|---:|
| Save all values | High | Low |
| Recompute values | Low | High |
| Checkpoint selected values | Medium | Medium |

The minimal engine should save all values. Checkpointing belongs in a later layer.

## Reference Counting

A graph engine can release values when no future backward rule needs them. This requires knowing how many downstream uses remain.

A tape can compute use counts:

```go
func (t *Tape) UseCounts() []int {
    uses := make([]int, len(t.Values))

    for _, ins := range t.Instrs {
        uses[ins.A]++
        uses[ins.B]++
    }

    return uses
}
```

During reverse traversal, the engine decrements counts after using an input. When a count reaches zero, that value may be freed or returned to a pool.

For scalar values this is unnecessary. For tensors it is important.

## Tensor Buffers and Pools

Tensor AD engines manage large buffers. A tensor object usually contains metadata and a pointer to storage:

```go
type Tensor struct {
    Data  []float64
    Shape []int
}
```

Allocating a new backing array for every operation is expensive. A memory pool can reuse buffers:

```go
type BufferPool struct {
    free map[int][][]float64
}

func (p *BufferPool) Get(n int) []float64 {
    xs := p.free[n]
    if len(xs) == 0 {
        return make([]float64, n)
    }

    buf := xs[len(xs)-1]
    p.free[n] = xs[:len(xs)-1]
    return buf
}

func (p *BufferPool) Put(buf []float64) {
    n := cap(buf)
    p.free[n] = append(p.free[n], buf[:n])
}
```

This basic pool groups buffers by size. Production systems use more careful allocation classes, device-specific allocators, and stream-aware reuse.

## In-Place Operations

In-place operations reduce memory but complicate reverse mode.

Example:

```text
x += 1
y = x * x
```

If the backward rule needs the old value of `x`, overwriting it destroys information.

There are three safe approaches:

| Approach | Description |
|---|---|
| Forbid in-place mutation | Simplest and safest |
| Save overwritten values | Correct but costs memory |
| Prove value is no longer needed | Requires alias and liveness analysis |

A minimal engine should forbid in-place mutation for differentiable values.

A practical error message is better than silent wrong gradients:

```go
func (t *Tape) AddInPlace(dst, src Slot) {
    panic("in-place differentiable operations are not supported")
}
```

## Retaining a Graph

Some systems allow repeated backward passes through the same graph. That requires retaining intermediate values and tape instructions after the first backward pass.

The default rule should be stricter:

```text
A tape may be used for one backward pass unless explicitly retained.
```

After backward, the engine may release saved values or invalidate the tape.

A retained graph costs memory. The user should request it deliberately.

## Minimal Backward Memory Contract

A small reverse-mode engine should document this contract:

```text
The tape owns all intermediate values and gradients.
Each differentiable operation writes a fresh slot.
The tape keeps all recorded values until Backward completes.
Reset invalidates all slots.
```

This is simple enough to reason about and strong enough to avoid most correctness bugs.

## Practical Minimal Design

A useful minimal implementation has:

```go
type Tape struct {
    Values       []float64
    Grads        []float64
    RequiresGrad []bool
    Instrs       []Instr
}

func (t *Tape) Reset()
func (t *Tape) ZeroGrad()
func (t *Tape) Var(x float64) Slot
func (t *Tape) Const(x float64) Slot
func (t *Tape) Backward(out Slot)
```

This gives:
- compact memory layout
- low allocation overhead
- clear lifetime rules
- optional skipping of irrelevant operations
- a direct path to tensor buffers later

Memory management should start conservative. Correctness depends on preserving the values needed by the backward pass. Optimization can remove storage only after the engine can prove that the value is unused, recomputable, or safe to overwrite.

