Skip to content

Testing Derivatives

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or...

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or scientific computation. Derivative testing therefore belongs at the center of AD system design.

A correct primal computation with incorrect gradients is usually worse than a runtime error. Numerical optimization may still appear to progress while converging to meaningless results.

Derivative testing should validate:

  • local operator rules
  • graph traversal logic
  • gradient accumulation
  • broadcasting semantics
  • tensor reductions
  • higher-order behavior
  • numerical stability

The most important principle:

Every primitive operator should be testable independently.

Classes of Errors

Derivative bugs usually fall into a few categories.

Error typeExample
Wrong derivative formulaMissing factor or sign
Incorrect accumulationUsing = instead of +=
Shape mismatchIncorrect reduction axes
Broadcasting bugMissing sum during backward
Aliasing bugOverwritten saved value
Numerical instabilityOverflow in backward
Traversal bugVisiting node multiple times
Missing gradient pathDetached edge
Stateful inconsistencyDifferent random mask

Many of these errors produce gradients with plausible values. That is why testing matters.

Finite Difference Approximation

The simplest validation method compares AD gradients against finite differences.

For:

f:RR f : \mathbb{R} \to \mathbb{R}

central difference approximation:

f(x)f(x+ϵ)f(xϵ)2ϵ f'(x) \approx \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}

A helper:

func FiniteDiff(
    f func(float64) float64,
    x float64,
) float64 {
    eps := 1e-6

    return (
        f(x+eps) -
        f(x-eps)
    ) / (2 * eps)
}

Example function:

f(x)=x2+sin(x) f(x) = x^2 + \sin(x)
func scalarFunction(x float64) float64 {
    return x*x + math.Sin(x)
}

AD version:

func scalarGrad(xval float64) float64 {
    var t Tape

    x := t.Var(xval)

    y := t.Add(
        t.Mul(x, x),
        t.Sin(x),
    )

    t.Backward(y)

    return t.Grads[x]
}

Test:

func TestScalarGrad(t *testing.T) {
    x := 2.0

    fd := FiniteDiff(scalarFunction, x)
    ad := scalarGrad(x)

    if math.Abs(fd-ad) > 1e-5 {
        t.Fatalf(
            "gradient mismatch: fd=%g ad=%g",
            fd,
            ad,
        )
    }
}

This is the basic derivative sanity check.

Choosing Epsilon

Finite differences are approximate. The choice of:

ϵ \epsilon

matters.

Too small:

  • cancellation error dominates

Too large:

  • truncation error dominates

Typical values:

  • 1e-4
  • 1e-5
  • 1e-6

Central differences are preferred over forward differences because they have smaller truncation error.

Forward difference:

f(x+ϵ)f(x)ϵ \frac{f(x+\epsilon)-f(x)}{\epsilon}

Central difference:

f(x+ϵ)f(xϵ)2ϵ \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}

Central difference is usually much more accurate for testing.

Relative Error

Absolute error alone can be misleading.

Better comparison:

relative error=abmax(1,a,b) \text{relative error} = \frac{|a-b|} {\max(1, |a|, |b|)}

Helper:

func RelativeError(a, b float64) float64 {
    return math.Abs(a-b) /
        math.Max(
            1,
            math.Max(math.Abs(a), math.Abs(b)),
        )
}

Then:

if RelativeError(fd, ad) > 1e-5 {
    t.Fatalf("gradient mismatch")
}

Relative error scales better across small and large gradients.

Operator-Level Testing

Each primitive operator should have isolated tests.

Example for multiplication:

f(x,y)=xy f(x, y) = xy

Expected derivatives:

fx=y \frac{\partial f}{\partial x} = y fy=x \frac{\partial f}{\partial y} = x

Test:

func TestMulGrad(t *testing.T) {
    var tape Tape

    x := tape.Var(3)
    y := tape.Var(5)

    z := tape.Mul(x, y)

    tape.Backward(z)

    if tape.Grads[x] != 5 {
        t.Fatalf("wrong grad dx")
    }

    if tape.Grads[y] != 3 {
        t.Fatalf("wrong grad dy")
    }
}

Primitive operator tests should be:

  • small
  • exact when possible
  • easy to inspect

These tests catch most implementation bugs early.

Shared Subexpression Testing

Gradient accumulation is a common failure point.

Example:

f(x)=x2+x2 f(x) = x^2 + x^2

Derivative:

f(x)=4x f'(x) = 4x

Test:

func TestSharedNodeAccumulation(t *testing.T) {
    var tape Tape

    x := tape.Var(3)

    a := tape.Mul(x, x)
    y := tape.Add(a, a)

    tape.Backward(y)

    want := 12.0

    if tape.Grads[x] != want {
        t.Fatalf(
            "got %g want %g",
            tape.Grads[x],
            want,
        )
    }
}

This catches incorrect overwrite behavior:

grad = ...

instead of:

grad += ...

Traversal Testing

Reverse traversal should visit nodes exactly once while still accumulating every edge contribution.

A graph with sharing is a good test case:

x -> a -> y
 \       /
  ------- 

Tests should verify:

  • no duplicate backward execution
  • no missing edge contributions
  • stable topological ordering

An incorrect traversal often produces gradients that are:

  • doubled
  • partially missing
  • dependent on node allocation order

These bugs are subtle without dedicated tests.

Tensor Shape Testing

Tensor AD systems need shape-aware tests.

Broadcasting example:

x shape: [2, 3]
b shape: [3]

y = x + b

Expected backward:

grad_b = reduce_sum(grad_y, axis=0)

A shape test should verify:

  • output shape
  • gradient shape
  • numerical values

Shape mismatches are among the most common tensor AD bugs.

Reduction Testing

Reduction operators require careful backward logic.

Example:

y=ixi y = \sum_i x_i

Backward should broadcast the output gradient back to all inputs.

Test:

func TestSumBackward(t *testing.T) {
    x := Tensor{
        Data: []float64{1, 2, 3},
        Shape: []int{3},
    }

    gradOut := 2.0

    gradX := SumBackward(x, gradOut)

    want := []float64{2, 2, 2}

    for i := range want {
        if gradX.Data[i] != want[i] {
            t.Fatalf("bad sum grad")
        }
    }
}

Reduction tests should cover:

  • axis reduction
  • keepdims behavior
  • empty dimensions
  • singleton dimensions

Numerical Stability Testing

Some operators have mathematically correct but numerically unstable derivatives.

Example:

log(1+ex) \log(1 + e^x)

Large positive x may overflow.

Tests should include:

  • very large inputs
  • very small inputs
  • zero
  • infinities
  • NaNs

Example:

func TestSoftplusStability(t *testing.T) {
    xs := []float64{
        -100,
        -10,
        0,
        10,
        100,
    }

    for _, x := range xs {
        var tape Tape

        v := tape.Var(x)
        y := tape.Softplus(v)

        tape.Backward(y)

        if math.IsNaN(tape.Grads[v]) {
            t.Fatalf("nan gradient at %g", x)
        }
    }
}

Stability tests are often more important than ordinary correctness tests in real optimization systems.

Randomized Gradient Checks

A useful strategy:

  1. generate random inputs
  2. compare AD with finite differences
  3. repeat many times

Example:

func RandomGradCheck(
    f func(*Tape, Slot) Slot,
) {
    for i := 0; i < 1000; i++ {
        x := rand.Float64()*10 - 5

        fd := finiteDiffScalar(f, x)
        ad := adScalar(f, x)

        if RelativeError(fd, ad) > 1e-5 {
            panic("gradient mismatch")
        }
    }
}

Random testing explores edge cases humans often miss.

Useful distributions:

  • small values
  • large values
  • near-zero values
  • mixed signs
  • powers of two
  • denormalized ranges

Higher-Order Testing

Higher-order AD requires additional validation.

Example:

f(x)=x3 f(x) = x^3

Derivatives:

f(x)=3x2 f'(x) = 3x^2 f(x)=6x f''(x) = 6x

A second-order system should be checked against known analytic results.

Higher-order systems often fail because:

  • saved gradients are incorrect
  • perturbation confusion occurs
  • nested tapes interfere
  • backward rules are not themselves differentiable

These bugs usually appear only in nested differentiation tests.

Stateful Operator Testing

Stateful operators require consistency between forward and backward.

Dropout example:

forward uses random mask M
backward must reuse M

Test:

func TestDropoutConsistency(t *testing.T) {
    seed := int64(42)

    forwardMask := GenerateMask(seed)
    backwardMask := GenerateMask(seed)

    if !Equal(forwardMask, backwardMask) {
        t.Fatalf("mask mismatch")
    }
}

Without this consistency, gradients become random noise.

Property Testing

Some derivative properties hold independently of specific values.

Example:

  • derivative of addition is linear
  • derivative of constant is zero
  • chain rule should compose correctly

Property testing checks algebraic invariants rather than individual numeric examples.

Example:

grad(a*x + b*y)
=
a*grad(x) + b*grad(y)

Property tests are useful because they scale across many input combinations.

Graph Integrity Testing

The graph itself should be validated.

Checks:

  • no dangling references
  • no cycles in acyclic graphs
  • valid slot indexes
  • backward only visits reachable nodes
  • gradients reset correctly
  • tape reuse preserves correctness

Minimal invariant:

Every backward rule reads valid primal values and writes valid gradient slots.

Violating this usually produces silent corruption.

Test Organization

A practical AD engine should separate tests by level.

Test typePurpose
Operator unit testsValidate local derivative rules
Graph testsValidate traversal and accumulation
Numerical testsValidate stability
Randomized testsExplore broad input space
Tensor shape testsValidate broadcasting and reductions
End-to-end optimization testsValidate realistic training behavior
Regression testsPrevent performance and correctness regressions

Operator-level tests should dominate. Most AD failures originate there.

End-to-End Validation

Eventually the engine should optimize a real objective successfully.

Simple example:

f(x)=(x3)2 f(x) = (x-3)^2

Gradient descent:

x := 0.0

for i := 0; i < 100; i++ {
    var tape Tape

    xv := tape.Var(x)

    diff := tape.Sub(xv, tape.Const(3))
    loss := tape.Mul(diff, diff)

    tape.Backward(loss)

    x -= 0.1 * tape.Grads[xv]
}

Expected:

x approaches 3

This does not prove the engine is correct, but it validates that gradients are directionally useful.

Minimal Correctness Contract

A small reverse-mode engine should guarantee:

Backward computes gradients consistent with the chain rule and local operator derivatives, up to floating point arithmetic.

Testing exists to validate this contract operationally.

No finite collection of tests proves correctness completely. But a layered testing strategy:

  • operator tests
  • finite differences
  • randomized checks
  • graph invariants
  • end-to-end optimization

can make derivative failures rare, localizable, and reproducible.