A tape is an append-only record of the operations executed during the forward pass. Reverse mode uses the tape to replay derivative rules backward.
A tape is an append-only record of the operations executed during the forward pass. Reverse mode uses the tape to replay derivative rules backward.
A graph stores nodes and edges as linked objects. A tape stores instructions in execution order.
forward pass:
input -> op 0 -> op 1 -> op 2 -> output
tape:
[op 0, op 1, op 2]
backward pass:
op 2 -> op 1 -> op 0The tape is one of the simplest practical representations for reverse mode AD. It is compact, linear, and easy to traverse in reverse.
Tape Entries
A minimal tape entry records:
- the operation kind
- the output slot
- the input slots
- any saved values needed by the backward rule
For scalar AD:
type OpKind int
const (
OpAdd OpKind = iota
OpMul
OpSin
)
type Instr struct {
Op OpKind
Out int
A int
B int
}The slots refer to arrays of values and gradients.
type Tape struct {
Values []float64
Grads []float64
Instrs []Instr
}A new variable allocates a value slot:
func (t *Tape) Var(x float64) int {
id := len(t.Values)
t.Values = append(t.Values, x)
t.Grads = append(t.Grads, 0)
return id
}Each operation creates a new output slot and appends one instruction.
Forward Operations
Addition:
func (t *Tape) Add(a, b int) int {
out := t.Var(t.Values[a] + t.Values[b])
t.Instrs = append(t.Instrs, Instr{
Op: OpAdd,
Out: out,
A: a,
B: b,
})
return out
}Multiplication:
func (t *Tape) Mul(a, b int) int {
out := t.Var(t.Values[a] * t.Values[b])
t.Instrs = append(t.Instrs, Instr{
Op: OpMul,
Out: out,
A: a,
B: b,
})
return out
}Sine:
func (t *Tape) Sin(x int) int {
out := t.Var(math.Sin(t.Values[x]))
t.Instrs = append(t.Instrs, Instr{
Op: OpSin,
Out: out,
A: x,
})
return out
}The user program works with integer handles, not pointers. The tape owns the storage.
Backward Pass
The backward pass starts by seeding the output gradient:
t.Grads[out] = 1Then it walks instructions in reverse order.
func (t *Tape) Backward(out int) {
t.Grads[out] = 1
for i := len(t.Instrs) - 1; i >= 0; i-- {
ins := t.Instrs[i]
switch ins.Op {
case OpAdd:
t.Grads[ins.A] += t.Grads[ins.Out]
t.Grads[ins.B] += t.Grads[ins.Out]
case OpMul:
t.Grads[ins.A] += t.Values[ins.B] * t.Grads[ins.Out]
t.Grads[ins.B] += t.Values[ins.A] * t.Grads[ins.Out]
case OpSin:
t.Grads[ins.A] += math.Cos(t.Values[ins.A]) * t.Grads[ins.Out]
}
}
}This is reverse mode without graph traversal. The tape already gives a valid reverse topological order because instructions were appended in forward execution order.
Complete Example
For:
func main() {
var t Tape
x := t.Var(2)
x2 := t.Mul(x, x)
sx := t.Sin(x)
y := t.Add(x2, sx)
t.Backward(y)
fmt.Println("value:", t.Values[y])
fmt.Println("grad:", t.Grads[x])
}The derivative is:
At x = 2, the tape computes this by replaying local rules backward.
Why a Tape Is Efficient
A tape has good locality. Values and gradients live in contiguous arrays. Instructions live in another contiguous array.
This avoids many costs of pointer-based graph nodes:
- fewer heap objects
- fewer pointer traversals
- simpler traversal
- better cache locality
- easier serialization
- easier memory reuse
A pointer graph is easier to inspect. A tape is often easier to execute quickly.
Saved Values
Some backward rules need values from the forward pass.
For multiplication, the backward pass needs both input values:
a.Grad += b.Value * out.Grad
b.Grad += a.Value * out.GradIn the tape version, these are read from:
t.Values[ins.A]
t.Values[ins.B]For operations whose backward rule needs extra data, the instruction can store parameters.
Example for power:
type Instr struct {
Op OpKind
Out int
A int
B int
Param float64
}Then:
func (t *Tape) Pow(x int, p float64) int {
out := t.Var(math.Pow(t.Values[x], p))
t.Instrs = append(t.Instrs, Instr{
Op: OpPow,
Out: out,
A: x,
Param: p,
})
return out
}Backward:
case OpPow:
x := t.Values[ins.A]
p := ins.Param
t.Grads[ins.A] += p * math.Pow(x, p-1) * t.Grads[ins.Out]The tape must retain whatever the backward rule cannot reconstruct cheaply or safely.
Constants
Constants can be represented as value slots too.
func (t *Tape) Const(x float64) int {
return t.Var(x)
}This is simple, but gradients will also be allocated for constants.
A slightly better design separates variables from constants:
type SlotKind int
const (
SlotVar SlotKind = iota
SlotConst
SlotTemp
)
type Tape struct {
Values []float64
Grads []float64
Kinds []SlotKind
Instrs []Instr
}Then the backward pass may skip accumulating gradients into constants. For a minimal engine, the simpler representation is usually enough.
Reset and Reuse
A tape is naturally reusable if its arrays are cleared between evaluations.
func (t *Tape) Reset() {
t.Values = t.Values[:0]
t.Grads = t.Grads[:0]
t.Instrs = t.Instrs[:0]
}This preserves allocated capacity and avoids repeated allocation.
For iterative optimization, this matters:
for step := 0; step < steps; step++ {
t.Reset()
x := t.Var(param)
y := loss(&t, x)
t.Backward(y)
param -= lr * t.Grads[x]
}A tape-based design therefore fits tight numerical loops well.
Persistent vs Ephemeral Tapes
There are two broad tape lifetimes.
| Tape style | Meaning | Typical use |
|---|---|---|
| Ephemeral tape | Built for one forward pass, discarded after backward | Dynamic eager AD |
| Persistent tape | Recorded once, replayed many times | Static computation, repeated kernels |
An ephemeral tape is simple and matches ordinary program execution.
A persistent tape requires stable inputs, stable control flow, and a way to update input values before replay. It can improve performance when the same computation runs many times.
Handling Control Flow
A tape records the operations that actually executed.
func H(t *Tape, x int) int {
if t.Values[x] > 0 {
return t.Mul(x, x)
}
return t.Mul(t.Const(-1), x)
}For positive x, the tape contains the square branch. For non-positive x, it contains the negation branch.
The derivative is local to that execution path. This matches dynamic graph AD.
At a branch boundary, the mathematical function may have a kink. The tape still returns the derivative of the executed branch.
Aliasing and Mutation
Tape design becomes harder when the source language allows mutation.
Consider:
x = x + 1
y = x * xA tape should not overwrite the old value if the backward pass may need it. Each assignment should create a new slot, or the tape must explicitly save overwritten values.
A safe minimal rule is:
Every differentiable operation writes to a fresh slot.This gives single-assignment behavior, similar to SSA form.
Mutation can be compiled into fresh slots plus environment updates:
x0 = input
x1 = x0 + 1
y = x1 * x1This makes reverse mode straightforward.
Tape vs Graph
| Property | Tape | Graph |
|---|---|---|
| Storage | Linear arrays | Nodes and edges |
| Traversal | Reverse instruction order | Reverse topological order |
| Allocation | Compact | Often many objects |
| Debugging | Less visual | More inspectable |
| Control flow | Records executed path | Records executed path |
| Serialization | Simple | More complex |
| Optimization | Good for local passes | Good for structural passes |
A tape is usually the better first production representation. A graph is usually the better first teaching representation.
Minimal Tape Invariant
A correct tape must satisfy one invariant:
For every instruction, all input slots were created before the output slot, and the backward rule has access to every primal value it needs.Because instructions are appended during forward execution, reverse iteration gives a valid dependency order.
The backward rule for each instruction applies:
The tape is therefore a linearized computation graph. Its advantage is that the linearization is already done during execution.