Performance Benchmarking

Performance benchmarking measures whether an automatic differentiation engine is fast, memory-efficient, and scalable under realistic workloads. It also protects the engine from regressions. Without benchmarks, small implementation changes can silently increase allocation count, tape size, kernel launch count, or backward latency.

A benchmark should answer concrete questions:

How fast is the forward pass?
How fast is the backward pass?
How much memory is retained?
How many allocations occur?
How does cost scale with input size?
Which operators dominate runtime?

A useful benchmark suite does not try to measure everything. It measures the operations and execution patterns the engine claims to support well.

What to Measure

A minimal AD benchmark should separate primal execution from derivative execution.

Metric	Meaning
Forward time	Time to evaluate the original computation and record AD metadata
Backward time	Time to propagate gradients
Total time	Forward plus backward
Peak memory	Maximum retained memory during evaluation and backward
Allocation count	Number of heap allocations
Tape size	Number of recorded instructions and saved values
Throughput	Operations, samples, or tensors processed per second
Gradient accuracy	Difference from known or checked derivatives

Forward and backward should be reported separately because they stress different parts of the system. Forward pass performance depends on operator kernels and tape recording. Backward pass performance depends on gradient accumulation, traversal order, and saved intermediate access.

Microbenchmarks

Microbenchmarks isolate individual operations.

For scalar AD, examples include:

repeated addition
repeated multiplication
polynomial evaluation
transcendental functions
small expression trees

Example benchmark target:

f(x) = x^2 + \sin(x)

A Go benchmark:

func BenchmarkScalarForwardBackward(b *testing.B) {
    for i := 0; i < b.N; i++ {
        var t Tape

        x := t.Var(2.0)
        x2 := t.Mul(x, x)
        sx := t.Sin(x)
        y := t.Add(x2, sx)

        t.Backward(y)
    }
}

This benchmark is simple, but it may mostly measure allocation overhead. A second version should reuse the tape:

func BenchmarkScalarForwardBackwardReuse(b *testing.B) {
    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2.0)
        x2 := t.Mul(x, x)
        sx := t.Sin(x)
        y := t.Add(x2, sx)

        t.Backward(y)
    }
}

The difference between these two benchmarks shows whether tape reuse matters.

Allocation Benchmarks

A small AD engine should track allocations carefully. Pointer-based graph engines often allocate one object per operation. Tape-based engines can reduce this substantially by using slices.

In Go, allocation reporting is built into benchmarks:

func BenchmarkTapeAllocations(b *testing.B) {
    b.ReportAllocs()

    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2.0)
        y := t.Add(t.Mul(x, x), t.Sin(x))

        t.Backward(y)
    }
}

Expected output shape:

BenchmarkTapeAllocations-10    10000000    120 ns/op    0 B/op    0 allocs/op

The exact numbers depend on the machine and implementation. The important part is the trend:

allocation count should remain stable
reusable tapes should approach zero allocations per iteration
adding new metadata should not create hidden per-op heap allocation

Scaling Benchmarks

A benchmark should test cost as the computation grows.

Polynomial chain:

func Poly(t *Tape, x Slot, n int) Slot {
    y := t.Const(0)

    for i := 0; i < n; i++ {
        y = t.Add(t.Mul(y, x), t.Const(float64(i+1)))
    }

    return y
}

Benchmark:

func BenchmarkPoly(b *testing.B) {
    sizes := []int{10, 100, 1000, 10000}

    for _, n := range sizes {
        b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
            var t Tape

            b.ReportAllocs()

            for i := 0; i < b.N; i++ {
                t.Reset()

                x := t.Var(1.01)
                y := Poly(&t, x, n)

                t.Backward(y)
            }
        })
    }
}

This benchmark should scale roughly linearly with n, because the tape records a linear number of operations and the backward pass traverses them once.

If time grows superlinearly, common causes include:

repeated graph traversal
inefficient slice growth
recursive traversal overhead
duplicate visits of shared nodes
expensive gradient clearing
hidden allocations inside operators

Tape Size

Runtime alone is not enough. A tape can be fast but store too much.

Expose simple counters:

type Stats struct {
    Values int
    Grads  int
    Instrs int
}

func (t *Tape) Stats() Stats {
    return Stats{
        Values: len(t.Values),
        Grads:  len(t.Grads),
        Instrs: len(t.Instrs),
    }
}

Then record them in tests:

func TestTapeStats(t *testing.T) {
    var tape Tape

    x := tape.Var(2)
    y := tape.Add(tape.Mul(x, x), tape.Sin(x))

    stats := tape.Stats()

    t.Logf("values=%d grads=%d instrs=%d", stats.Values, stats.Grads, stats.Instrs)

    tape.Backward(y)
}

For a small expression, these counts should be predictable. Unexpected tape growth usually indicates unnecessary recording of constants or non-differentiable operations.

Forward vs Backward Ratio

For many reverse-mode engines, backward time is often comparable to, or larger than, forward time. The ratio depends on the operator mix.

Measure them separately:

func BenchmarkForwardOnly(b *testing.B) {
    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2)
        _ = t.Add(t.Mul(x, x), t.Sin(x))
    }
}

func BenchmarkBackwardOnly(b *testing.B) {
    var t Tape

    x := t.Var(2)
    y := t.Add(t.Mul(x, x), t.Sin(x))

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        t.ZeroGrad()
        t.Backward(y)
    }
}

This benchmark uses the same tape repeatedly. That only works if the tape is persistent and the values do not change. For dynamic tapes, forward and backward should be measured together.

The forward/backward split helps identify whether the cost is in evaluation or gradient propagation.

Accuracy Benchmarks

Performance benchmarks should not ignore correctness. Fast wrong gradients are worse than slow correct gradients.

A small gradient checker compares AD with finite differences.

func FiniteDiff(f func(float64) float64, x float64) float64 {
    eps := 1e-6
    return (f(x+eps) - f(x-eps)) / (2 * eps)
}

For:

f(x) = x^2 + \sin(x)

the exact derivative is:

f'(x) = 2x + \cos(x)

A check:

func TestGradAccuracy(t *testing.T) {
    f := func(xval float64) float64 {
        var tape Tape

        x := tape.Var(xval)
        y := tape.Add(tape.Mul(x, x), tape.Sin(x))

        return tape.Values[y]
    }

    g := func(xval float64) float64 {
        var tape Tape

        x := tape.Var(xval)
        y := tape.Add(tape.Mul(x, x), tape.Sin(x))

        tape.Backward(y)

        return tape.Grads[x]
    }

    x := 2.0

    got := g(x)
    want := FiniteDiff(f, x)

    if math.Abs(got-want) > 1e-5 {
        t.Fatalf("gradient mismatch: got %g want %g", got, want)
    }
}

Finite differences are approximate, so they should be used as a sanity check, not as a proof.

Tensor Benchmarks

For tensor engines, benchmarks should include both compute-heavy and memory-heavy operators.

Benchmark	Stress point
Elementwise chain	Tape overhead, memory bandwidth
Matrix multiplication	Kernel performance
Reduction	gradient broadcasting
Softmax cross entropy	numerical stability and fusion
Convolution	layout and kernel efficiency
Transformer block	realistic operator mix

Tensor benchmarks should report:

shape
dtype
device
batch size
forward time
backward time
peak memory

Without shape and dtype, tensor benchmark results are mostly meaningless.

Warmup and Measurement

For JIT or GPU systems, benchmark timing must account for warmup.

The first run may include:

compilation
kernel loading
memory pool initialization
autotuning
cache population

A benchmark should separate warmup from measured runs:

for i := 0; i < warmup; i++ {
    run()
}

start := time.Now()

for i := 0; i < iters; i++ {
    run()
}

elapsed := time.Since(start)

For GPU execution, timing must synchronize before stopping the clock. Otherwise the benchmark only measures launch overhead, not actual execution.

Regression Benchmarks

Benchmarks should be part of continuous development. Store baseline results and compare across commits.

Track:

time per operation
allocations per operation
peak memory
tape instruction count
gradient error

A regression budget can be simple:

No benchmark should slow down by more than 5 percent without review.
No benchmark should add allocations in tight scalar loops.
No gradient check should exceed tolerance.

The exact thresholds depend on the project, but the policy should be explicit.

Benchmark Design Rules

A useful benchmark should be:

deterministic
small enough to run often
large enough to expose scaling behavior
separated by operator class
clear about input shapes
clear about whether compilation or warmup is included
paired with correctness checks

Avoid benchmarks that only test one tiny expression. They are good for allocation checks, but they do not represent real workloads.

Avoid benchmarks that only test large end-to-end models. They are realistic, but poor at explaining why performance changed.

A good suite has both.

Minimal Benchmark Suite

For a minimal scalar reverse-mode engine:

Benchmark	Purpose
scalar expression	Basic overhead
tape reuse	Allocation behavior
polynomial chain	Linear scaling
shared subexpression	Gradient accumulation
many constants	Requires-grad filtering
finite difference check	Correctness sanity

For a tensor engine:

Benchmark	Purpose
elementwise chain	Tape and memory bandwidth
matmul	Compute kernel
broadcast add	Shape-aware backward
reduce sum	Reduction backward
softplus or logsumexp	numerical stability
small MLP	End-to-end training step

Interpreting Results

A benchmark result is useful only when connected to a hypothesis.

Example:

Tape reuse reduces allocations.

Then measure with and without reuse.

Example:

Requires-grad filtering reduces tape size.

Then compare tape instruction count when constants dominate the computation.

Example:

Backward is slower than expected.

Then split by operator and inspect which backward rules allocate or perform redundant work.

Benchmarking is not only scorekeeping. It is diagnosis.

Minimal Performance Contract

A small AD engine should make a few clear promises:

Scalar tape recording is O(number of executed differentiable operations).
Backward traversal is O(number of recorded instructions).
Reusable tapes do not allocate in steady state.
Gradient accumulation is exact up to floating point arithmetic.

These promises are easy to test.

Once they hold, optimization can proceed from evidence rather than intuition.