Skip to content

Performance Benchmarking

Performance benchmarking measures whether an automatic differentiation engine is fast, memory-efficient, and scalable under realistic workloads. It also protects the engine...

Performance benchmarking measures whether an automatic differentiation engine is fast, memory-efficient, and scalable under realistic workloads. It also protects the engine from regressions. Without benchmarks, small implementation changes can silently increase allocation count, tape size, kernel launch count, or backward latency.

A benchmark should answer concrete questions:

How fast is the forward pass?
How fast is the backward pass?
How much memory is retained?
How many allocations occur?
How does cost scale with input size?
Which operators dominate runtime?

A useful benchmark suite does not try to measure everything. It measures the operations and execution patterns the engine claims to support well.

What to Measure

A minimal AD benchmark should separate primal execution from derivative execution.

MetricMeaning
Forward timeTime to evaluate the original computation and record AD metadata
Backward timeTime to propagate gradients
Total timeForward plus backward
Peak memoryMaximum retained memory during evaluation and backward
Allocation countNumber of heap allocations
Tape sizeNumber of recorded instructions and saved values
ThroughputOperations, samples, or tensors processed per second
Gradient accuracyDifference from known or checked derivatives

Forward and backward should be reported separately because they stress different parts of the system. Forward pass performance depends on operator kernels and tape recording. Backward pass performance depends on gradient accumulation, traversal order, and saved intermediate access.

Microbenchmarks

Microbenchmarks isolate individual operations.

For scalar AD, examples include:

  • repeated addition
  • repeated multiplication
  • polynomial evaluation
  • transcendental functions
  • small expression trees

Example benchmark target:

f(x)=x2+sin(x) f(x) = x^2 + \sin(x)

A Go benchmark:

func BenchmarkScalarForwardBackward(b *testing.B) {
    for i := 0; i < b.N; i++ {
        var t Tape

        x := t.Var(2.0)
        x2 := t.Mul(x, x)
        sx := t.Sin(x)
        y := t.Add(x2, sx)

        t.Backward(y)
    }
}

This benchmark is simple, but it may mostly measure allocation overhead. A second version should reuse the tape:

func BenchmarkScalarForwardBackwardReuse(b *testing.B) {
    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2.0)
        x2 := t.Mul(x, x)
        sx := t.Sin(x)
        y := t.Add(x2, sx)

        t.Backward(y)
    }
}

The difference between these two benchmarks shows whether tape reuse matters.

Allocation Benchmarks

A small AD engine should track allocations carefully. Pointer-based graph engines often allocate one object per operation. Tape-based engines can reduce this substantially by using slices.

In Go, allocation reporting is built into benchmarks:

func BenchmarkTapeAllocations(b *testing.B) {
    b.ReportAllocs()

    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2.0)
        y := t.Add(t.Mul(x, x), t.Sin(x))

        t.Backward(y)
    }
}

Expected output shape:

BenchmarkTapeAllocations-10    10000000    120 ns/op    0 B/op    0 allocs/op

The exact numbers depend on the machine and implementation. The important part is the trend:

  • allocation count should remain stable
  • reusable tapes should approach zero allocations per iteration
  • adding new metadata should not create hidden per-op heap allocation

Scaling Benchmarks

A benchmark should test cost as the computation grows.

Polynomial chain:

func Poly(t *Tape, x Slot, n int) Slot {
    y := t.Const(0)

    for i := 0; i < n; i++ {
        y = t.Add(t.Mul(y, x), t.Const(float64(i+1)))
    }

    return y
}

Benchmark:

func BenchmarkPoly(b *testing.B) {
    sizes := []int{10, 100, 1000, 10000}

    for _, n := range sizes {
        b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
            var t Tape

            b.ReportAllocs()

            for i := 0; i < b.N; i++ {
                t.Reset()

                x := t.Var(1.01)
                y := Poly(&t, x, n)

                t.Backward(y)
            }
        })
    }
}

This benchmark should scale roughly linearly with n, because the tape records a linear number of operations and the backward pass traverses them once.

If time grows superlinearly, common causes include:

  • repeated graph traversal
  • inefficient slice growth
  • recursive traversal overhead
  • duplicate visits of shared nodes
  • expensive gradient clearing
  • hidden allocations inside operators

Tape Size

Runtime alone is not enough. A tape can be fast but store too much.

Expose simple counters:

type Stats struct {
    Values int
    Grads  int
    Instrs int
}

func (t *Tape) Stats() Stats {
    return Stats{
        Values: len(t.Values),
        Grads:  len(t.Grads),
        Instrs: len(t.Instrs),
    }
}

Then record them in tests:

func TestTapeStats(t *testing.T) {
    var tape Tape

    x := tape.Var(2)
    y := tape.Add(tape.Mul(x, x), tape.Sin(x))

    stats := tape.Stats()

    t.Logf("values=%d grads=%d instrs=%d", stats.Values, stats.Grads, stats.Instrs)

    tape.Backward(y)
}

For a small expression, these counts should be predictable. Unexpected tape growth usually indicates unnecessary recording of constants or non-differentiable operations.

Forward vs Backward Ratio

For many reverse-mode engines, backward time is often comparable to, or larger than, forward time. The ratio depends on the operator mix.

Measure them separately:

func BenchmarkForwardOnly(b *testing.B) {
    var t Tape

    for i := 0; i < b.N; i++ {
        t.Reset()

        x := t.Var(2)
        _ = t.Add(t.Mul(x, x), t.Sin(x))
    }
}

func BenchmarkBackwardOnly(b *testing.B) {
    var t Tape

    x := t.Var(2)
    y := t.Add(t.Mul(x, x), t.Sin(x))

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        t.ZeroGrad()
        t.Backward(y)
    }
}

This benchmark uses the same tape repeatedly. That only works if the tape is persistent and the values do not change. For dynamic tapes, forward and backward should be measured together.

The forward/backward split helps identify whether the cost is in evaluation or gradient propagation.

Accuracy Benchmarks

Performance benchmarks should not ignore correctness. Fast wrong gradients are worse than slow correct gradients.

A small gradient checker compares AD with finite differences.

func FiniteDiff(f func(float64) float64, x float64) float64 {
    eps := 1e-6
    return (f(x+eps) - f(x-eps)) / (2 * eps)
}

For:

f(x)=x2+sin(x) f(x) = x^2 + \sin(x)

the exact derivative is:

f(x)=2x+cos(x) f'(x) = 2x + \cos(x)

A check:

func TestGradAccuracy(t *testing.T) {
    f := func(xval float64) float64 {
        var tape Tape

        x := tape.Var(xval)
        y := tape.Add(tape.Mul(x, x), tape.Sin(x))

        return tape.Values[y]
    }

    g := func(xval float64) float64 {
        var tape Tape

        x := tape.Var(xval)
        y := tape.Add(tape.Mul(x, x), tape.Sin(x))

        tape.Backward(y)

        return tape.Grads[x]
    }

    x := 2.0

    got := g(x)
    want := FiniteDiff(f, x)

    if math.Abs(got-want) > 1e-5 {
        t.Fatalf("gradient mismatch: got %g want %g", got, want)
    }
}

Finite differences are approximate, so they should be used as a sanity check, not as a proof.

Tensor Benchmarks

For tensor engines, benchmarks should include both compute-heavy and memory-heavy operators.

BenchmarkStress point
Elementwise chainTape overhead, memory bandwidth
Matrix multiplicationKernel performance
Reductiongradient broadcasting
Softmax cross entropynumerical stability and fusion
Convolutionlayout and kernel efficiency
Transformer blockrealistic operator mix

Tensor benchmarks should report:

  • shape
  • dtype
  • device
  • batch size
  • forward time
  • backward time
  • peak memory

Without shape and dtype, tensor benchmark results are mostly meaningless.

Warmup and Measurement

For JIT or GPU systems, benchmark timing must account for warmup.

The first run may include:

  • compilation
  • kernel loading
  • memory pool initialization
  • autotuning
  • cache population

A benchmark should separate warmup from measured runs:

for i := 0; i < warmup; i++ {
    run()
}

start := time.Now()

for i := 0; i < iters; i++ {
    run()
}

elapsed := time.Since(start)

For GPU execution, timing must synchronize before stopping the clock. Otherwise the benchmark only measures launch overhead, not actual execution.

Regression Benchmarks

Benchmarks should be part of continuous development. Store baseline results and compare across commits.

Track:

  • time per operation
  • allocations per operation
  • peak memory
  • tape instruction count
  • gradient error

A regression budget can be simple:

No benchmark should slow down by more than 5 percent without review.
No benchmark should add allocations in tight scalar loops.
No gradient check should exceed tolerance.

The exact thresholds depend on the project, but the policy should be explicit.

Benchmark Design Rules

A useful benchmark should be:

  • deterministic
  • small enough to run often
  • large enough to expose scaling behavior
  • separated by operator class
  • clear about input shapes
  • clear about whether compilation or warmup is included
  • paired with correctness checks

Avoid benchmarks that only test one tiny expression. They are good for allocation checks, but they do not represent real workloads.

Avoid benchmarks that only test large end-to-end models. They are realistic, but poor at explaining why performance changed.

A good suite has both.

Minimal Benchmark Suite

For a minimal scalar reverse-mode engine:

BenchmarkPurpose
scalar expressionBasic overhead
tape reuseAllocation behavior
polynomial chainLinear scaling
shared subexpressionGradient accumulation
many constantsRequires-grad filtering
finite difference checkCorrectness sanity

For a tensor engine:

BenchmarkPurpose
elementwise chainTape and memory bandwidth
matmulCompute kernel
broadcast addShape-aware backward
reduce sumReduction backward
softplus or logsumexpnumerical stability
small MLPEnd-to-end training step

Interpreting Results

A benchmark result is useful only when connected to a hypothesis.

Example:

Tape reuse reduces allocations.

Then measure with and without reuse.

Example:

Requires-grad filtering reduces tape size.

Then compare tape instruction count when constants dominate the computation.

Example:

Backward is slower than expected.

Then split by operator and inspect which backward rules allocate or perform redundant work.

Benchmarking is not only scorekeeping. It is diagnosis.

Minimal Performance Contract

A small AD engine should make a few clear promises:

Scalar tape recording is O(number of executed differentiable operations).
Backward traversal is O(number of recorded instructions).
Reusable tapes do not allocate in steady state.
Gradient accumulation is exact up to floating point arithmetic.

These promises are easy to test.

Once they hold, optimization can proceed from evidence rather than intuition.