Tape Design

A tape is an append-only record of the operations executed during the forward pass. Reverse mode uses the tape to replay derivative rules backward.

A graph stores nodes and edges as linked objects. A tape stores instructions in execution order.

forward pass:
input -> op 0 -> op 1 -> op 2 -> output

tape:
[op 0, op 1, op 2]

backward pass:
op 2 -> op 1 -> op 0

The tape is one of the simplest practical representations for reverse mode AD. It is compact, linear, and easy to traverse in reverse.

Tape Entries

A minimal tape entry records:

the operation kind
the output slot
the input slots
any saved values needed by the backward rule

For scalar AD:

type OpKind int

const (
    OpAdd OpKind = iota
    OpMul
    OpSin
)

type Instr struct {
    Op  OpKind
    Out int
    A   int
    B   int
}

The slots refer to arrays of values and gradients.

type Tape struct {
    Values []float64
    Grads  []float64
    Instrs []Instr
}

A new variable allocates a value slot:

func (t *Tape) Var(x float64) int {
    id := len(t.Values)

    t.Values = append(t.Values, x)
    t.Grads = append(t.Grads, 0)

    return id
}

Each operation creates a new output slot and appends one instruction.

Forward Operations

Addition:

func (t *Tape) Add(a, b int) int {
    out := t.Var(t.Values[a] + t.Values[b])

    t.Instrs = append(t.Instrs, Instr{
        Op:  OpAdd,
        Out: out,
        A:   a,
        B:   b,
    })

    return out
}

Multiplication:

func (t *Tape) Mul(a, b int) int {
    out := t.Var(t.Values[a] * t.Values[b])

    t.Instrs = append(t.Instrs, Instr{
        Op:  OpMul,
        Out: out,
        A:   a,
        B:   b,
    })

    return out
}

Sine:

func (t *Tape) Sin(x int) int {
    out := t.Var(math.Sin(t.Values[x]))

    t.Instrs = append(t.Instrs, Instr{
        Op:  OpSin,
        Out: out,
        A:   x,
    })

    return out
}

The user program works with integer handles, not pointers. The tape owns the storage.

Backward Pass

The backward pass starts by seeding the output gradient:

t.Grads[out] = 1

Then it walks instructions in reverse order.

func (t *Tape) Backward(out int) {
    t.Grads[out] = 1

    for i := len(t.Instrs) - 1; i >= 0; i-- {
        ins := t.Instrs[i]

        switch ins.Op {
        case OpAdd:
            t.Grads[ins.A] += t.Grads[ins.Out]
            t.Grads[ins.B] += t.Grads[ins.Out]

        case OpMul:
            t.Grads[ins.A] += t.Values[ins.B] * t.Grads[ins.Out]
            t.Grads[ins.B] += t.Values[ins.A] * t.Grads[ins.Out]

        case OpSin:
            t.Grads[ins.A] += math.Cos(t.Values[ins.A]) * t.Grads[ins.Out]
        }
    }
}

This is reverse mode without graph traversal. The tape already gives a valid reverse topological order because instructions were appended in forward execution order.

Complete Example

For:

f(x) = x^2 + \sin(x)

func main() {
    var t Tape

    x := t.Var(2)

    x2 := t.Mul(x, x)
    sx := t.Sin(x)
    y := t.Add(x2, sx)

    t.Backward(y)

    fmt.Println("value:", t.Values[y])
    fmt.Println("grad:", t.Grads[x])
}

The derivative is:

f'(x) = 2x + \cos(x)

At x = 2, the tape computes this by replaying local rules backward.

Why a Tape Is Efficient

A tape has good locality. Values and gradients live in contiguous arrays. Instructions live in another contiguous array.

This avoids many costs of pointer-based graph nodes:

fewer heap objects
fewer pointer traversals
simpler traversal
better cache locality
easier serialization
easier memory reuse

A pointer graph is easier to inspect. A tape is often easier to execute quickly.

Saved Values

Some backward rules need values from the forward pass.

For multiplication, the backward pass needs both input values:

a.Grad += b.Value * out.Grad
b.Grad += a.Value * out.Grad

In the tape version, these are read from:

t.Values[ins.A]
t.Values[ins.B]

For operations whose backward rule needs extra data, the instruction can store parameters.

Example for power:

type Instr struct {
    Op  OpKind
    Out int
    A   int
    B   int

    Param float64
}

Then:

func (t *Tape) Pow(x int, p float64) int {
    out := t.Var(math.Pow(t.Values[x], p))

    t.Instrs = append(t.Instrs, Instr{
        Op:    OpPow,
        Out:   out,
        A:     x,
        Param: p,
    })

    return out
}

Backward:

case OpPow:
    x := t.Values[ins.A]
    p := ins.Param
    t.Grads[ins.A] += p * math.Pow(x, p-1) * t.Grads[ins.Out]

The tape must retain whatever the backward rule cannot reconstruct cheaply or safely.

Constants

Constants can be represented as value slots too.

func (t *Tape) Const(x float64) int {
    return t.Var(x)
}

This is simple, but gradients will also be allocated for constants.

A slightly better design separates variables from constants:

type SlotKind int

const (
    SlotVar SlotKind = iota
    SlotConst
    SlotTemp
)

type Tape struct {
    Values []float64
    Grads  []float64
    Kinds  []SlotKind
    Instrs []Instr
}

Then the backward pass may skip accumulating gradients into constants. For a minimal engine, the simpler representation is usually enough.

Reset and Reuse

A tape is naturally reusable if its arrays are cleared between evaluations.

func (t *Tape) Reset() {
    t.Values = t.Values[:0]
    t.Grads = t.Grads[:0]
    t.Instrs = t.Instrs[:0]
}

This preserves allocated capacity and avoids repeated allocation.

For iterative optimization, this matters:

for step := 0; step < steps; step++ {
    t.Reset()

    x := t.Var(param)
    y := loss(&t, x)

    t.Backward(y)

    param -= lr * t.Grads[x]
}

A tape-based design therefore fits tight numerical loops well.

Persistent vs Ephemeral Tapes

There are two broad tape lifetimes.

Tape style	Meaning	Typical use
Ephemeral tape	Built for one forward pass, discarded after backward	Dynamic eager AD
Persistent tape	Recorded once, replayed many times	Static computation, repeated kernels

An ephemeral tape is simple and matches ordinary program execution.

A persistent tape requires stable inputs, stable control flow, and a way to update input values before replay. It can improve performance when the same computation runs many times.

Handling Control Flow

A tape records the operations that actually executed.

func H(t *Tape, x int) int {
    if t.Values[x] > 0 {
        return t.Mul(x, x)
    }

    return t.Mul(t.Const(-1), x)
}

For positive x, the tape contains the square branch. For non-positive x, it contains the negation branch.

The derivative is local to that execution path. This matches dynamic graph AD.

At a branch boundary, the mathematical function may have a kink. The tape still returns the derivative of the executed branch.

Aliasing and Mutation

Tape design becomes harder when the source language allows mutation.

Consider:

x = x + 1
y = x * x

A tape should not overwrite the old value if the backward pass may need it. Each assignment should create a new slot, or the tape must explicitly save overwritten values.

A safe minimal rule is:

Every differentiable operation writes to a fresh slot.

This gives single-assignment behavior, similar to SSA form.

Mutation can be compiled into fresh slots plus environment updates:

x0 = input
x1 = x0 + 1
y  = x1 * x1

This makes reverse mode straightforward.

Tape vs Graph

Property	Tape	Graph
Storage	Linear arrays	Nodes and edges
Traversal	Reverse instruction order	Reverse topological order
Allocation	Compact	Often many objects
Debugging	Less visual	More inspectable
Control flow	Records executed path	Records executed path
Serialization	Simple	More complex
Optimization	Good for local passes	Good for structural passes

A tape is usually the better first production representation. A graph is usually the better first teaching representation.

Minimal Tape Invariant

A correct tape must satisfy one invariant:

For every instruction, all input slots were created before the output slot, and the backward rule has access to every primal value it needs.

Because instructions are appended during forward execution, reverse iteration gives a valid dependency order.

The backward rule for each instruction applies:

\bar{x_i} \mathrel{+}= \bar{y}\frac{\partial y}{\partial x_i}

The tape is therefore a linearized computation graph. Its advantage is that the linearization is already done during execution.