Testing Derivatives

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or scientific computation. Derivative testing therefore belongs at the center of AD system design.

A correct primal computation with incorrect gradients is usually worse than a runtime error. Numerical optimization may still appear to progress while converging to meaningless results.

Derivative testing should validate:

local operator rules
graph traversal logic
gradient accumulation
broadcasting semantics
tensor reductions
higher-order behavior
numerical stability

The most important principle:

Every primitive operator should be testable independently.

Classes of Errors

Derivative bugs usually fall into a few categories.

Error type	Example
Wrong derivative formula	Missing factor or sign
Incorrect accumulation	Using `=` instead of `+=`
Shape mismatch	Incorrect reduction axes
Broadcasting bug	Missing sum during backward
Aliasing bug	Overwritten saved value
Numerical instability	Overflow in backward
Traversal bug	Visiting node multiple times
Missing gradient path	Detached edge
Stateful inconsistency	Different random mask

Many of these errors produce gradients with plausible values. That is why testing matters.

Finite Difference Approximation

The simplest validation method compares AD gradients against finite differences.

For:

f : \mathbb{R} \to \mathbb{R}

central difference approximation:

f'(x) \approx \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}

A helper:

func FiniteDiff(
    f func(float64) float64,
    x float64,
) float64 {
    eps := 1e-6

    return (
        f(x+eps) -
        f(x-eps)
    ) / (2 * eps)
}

Example function:

f(x) = x^2 + \sin(x)

func scalarFunction(x float64) float64 {
    return x*x + math.Sin(x)
}

AD version:

func scalarGrad(xval float64) float64 {
    var t Tape

    x := t.Var(xval)

    y := t.Add(
        t.Mul(x, x),
        t.Sin(x),
    )

    t.Backward(y)

    return t.Grads[x]
}

Test:

func TestScalarGrad(t *testing.T) {
    x := 2.0

    fd := FiniteDiff(scalarFunction, x)
    ad := scalarGrad(x)

    if math.Abs(fd-ad) > 1e-5 {
        t.Fatalf(
            "gradient mismatch: fd=%g ad=%g",
            fd,
            ad,
        )
    }
}

This is the basic derivative sanity check.

Choosing Epsilon

Finite differences are approximate. The choice of:

\epsilon

matters.

Too small:

cancellation error dominates

Too large:

truncation error dominates

Typical values:

1e-4
1e-5
1e-6

Central differences are preferred over forward differences because they have smaller truncation error.

Forward difference:

\frac{f(x+\epsilon)-f(x)}{\epsilon}

Central difference:

\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}

Central difference is usually much more accurate for testing.

Relative Error

Absolute error alone can be misleading.

Better comparison:

\text{relative error} = \frac{|a-b|} {\max(1, |a|, |b|)}

Helper:

func RelativeError(a, b float64) float64 {
    return math.Abs(a-b) /
        math.Max(
            1,
            math.Max(math.Abs(a), math.Abs(b)),
        )
}

Then:

if RelativeError(fd, ad) > 1e-5 {
    t.Fatalf("gradient mismatch")
}

Relative error scales better across small and large gradients.

Operator-Level Testing

Each primitive operator should have isolated tests.

Example for multiplication:

f(x, y) = xy

Expected derivatives:

\frac{\partial f}{\partial x} = y

\frac{\partial f}{\partial y} = x

Test:

func TestMulGrad(t *testing.T) {
    var tape Tape

    x := tape.Var(3)
    y := tape.Var(5)

    z := tape.Mul(x, y)

    tape.Backward(z)

    if tape.Grads[x] != 5 {
        t.Fatalf("wrong grad dx")
    }

    if tape.Grads[y] != 3 {
        t.Fatalf("wrong grad dy")
    }
}

Primitive operator tests should be:

small
exact when possible
easy to inspect

These tests catch most implementation bugs early.

Shared Subexpression Testing

Gradient accumulation is a common failure point.

Example:

f(x) = x^2 + x^2

Derivative:

f'(x) = 4x

Test:

func TestSharedNodeAccumulation(t *testing.T) {
    var tape Tape

    x := tape.Var(3)

    a := tape.Mul(x, x)
    y := tape.Add(a, a)

    tape.Backward(y)

    want := 12.0

    if tape.Grads[x] != want {
        t.Fatalf(
            "got %g want %g",
            tape.Grads[x],
            want,
        )
    }
}

This catches incorrect overwrite behavior:

grad = ...

instead of:

grad += ...

Traversal Testing

Reverse traversal should visit nodes exactly once while still accumulating every edge contribution.

A graph with sharing is a good test case:

x -> a -> y
 \       /
  -------

Tests should verify:

no duplicate backward execution
no missing edge contributions
stable topological ordering

An incorrect traversal often produces gradients that are:

doubled
partially missing
dependent on node allocation order

These bugs are subtle without dedicated tests.

Tensor Shape Testing

Tensor AD systems need shape-aware tests.

Broadcasting example:

x shape: [2, 3]
b shape: [3]

y = x + b

Expected backward:

grad_b = reduce_sum(grad_y, axis=0)

A shape test should verify:

output shape
gradient shape
numerical values

Shape mismatches are among the most common tensor AD bugs.

Reduction Testing

Reduction operators require careful backward logic.

Example:

y = \sum_i x_i

Backward should broadcast the output gradient back to all inputs.

Test:

func TestSumBackward(t *testing.T) {
    x := Tensor{
        Data: []float64{1, 2, 3},
        Shape: []int{3},
    }

    gradOut := 2.0

    gradX := SumBackward(x, gradOut)

    want := []float64{2, 2, 2}

    for i := range want {
        if gradX.Data[i] != want[i] {
            t.Fatalf("bad sum grad")
        }
    }
}

Reduction tests should cover:

axis reduction
keepdims behavior
empty dimensions
singleton dimensions

Numerical Stability Testing

Some operators have mathematically correct but numerically unstable derivatives.

Example:

\log(1 + e^x)

Large positive x may overflow.

Tests should include:

very large inputs
very small inputs
zero
infinities
NaNs

Example:

func TestSoftplusStability(t *testing.T) {
    xs := []float64{
        -100,
        -10,
        0,
        10,
        100,
    }

    for _, x := range xs {
        var tape Tape

        v := tape.Var(x)
        y := tape.Softplus(v)

        tape.Backward(y)

        if math.IsNaN(tape.Grads[v]) {
            t.Fatalf("nan gradient at %g", x)
        }
    }
}

Stability tests are often more important than ordinary correctness tests in real optimization systems.

Randomized Gradient Checks

A useful strategy:

generate random inputs
compare AD with finite differences
repeat many times

Example:

func RandomGradCheck(
    f func(*Tape, Slot) Slot,
) {
    for i := 0; i < 1000; i++ {
        x := rand.Float64()*10 - 5

        fd := finiteDiffScalar(f, x)
        ad := adScalar(f, x)

        if RelativeError(fd, ad) > 1e-5 {
            panic("gradient mismatch")
        }
    }
}

Random testing explores edge cases humans often miss.

Useful distributions:

small values
large values
near-zero values
mixed signs
powers of two
denormalized ranges

Higher-Order Testing

Higher-order AD requires additional validation.

Example:

f(x) = x^3

Derivatives:

f'(x) = 3x^2

f''(x) = 6x

A second-order system should be checked against known analytic results.

Higher-order systems often fail because:

saved gradients are incorrect
perturbation confusion occurs
nested tapes interfere
backward rules are not themselves differentiable

These bugs usually appear only in nested differentiation tests.

Stateful Operator Testing

Stateful operators require consistency between forward and backward.

Dropout example:

forward uses random mask M
backward must reuse M

Test:

func TestDropoutConsistency(t *testing.T) {
    seed := int64(42)

    forwardMask := GenerateMask(seed)
    backwardMask := GenerateMask(seed)

    if !Equal(forwardMask, backwardMask) {
        t.Fatalf("mask mismatch")
    }
}

Without this consistency, gradients become random noise.

Property Testing

Some derivative properties hold independently of specific values.

Example:

derivative of addition is linear
derivative of constant is zero
chain rule should compose correctly

Property testing checks algebraic invariants rather than individual numeric examples.

Example:

grad(a*x + b*y)
=
a*grad(x) + b*grad(y)

Property tests are useful because they scale across many input combinations.

Graph Integrity Testing

The graph itself should be validated.

Checks:

no dangling references
no cycles in acyclic graphs
valid slot indexes
backward only visits reachable nodes
gradients reset correctly
tape reuse preserves correctness

Minimal invariant:

Every backward rule reads valid primal values and writes valid gradient slots.

Violating this usually produces silent corruption.

Test Organization

A practical AD engine should separate tests by level.

Test type	Purpose
Operator unit tests	Validate local derivative rules
Graph tests	Validate traversal and accumulation
Numerical tests	Validate stability
Randomized tests	Explore broad input space
Tensor shape tests	Validate broadcasting and reductions
End-to-end optimization tests	Validate realistic training behavior
Regression tests	Prevent performance and correctness regressions

Operator-level tests should dominate. Most AD failures originate there.

End-to-End Validation

Eventually the engine should optimize a real objective successfully.

Simple example:

f(x) = (x-3)^2

Gradient descent:

x := 0.0

for i := 0; i < 100; i++ {
    var tape Tape

    xv := tape.Var(x)

    diff := tape.Sub(xv, tape.Const(3))
    loss := tape.Mul(diff, diff)

    tape.Backward(loss)

    x -= 0.1 * tape.Grads[xv]
}

Expected:

x approaches 3

This does not prove the engine is correct, but it validates that gradients are directionally useful.

Minimal Correctness Contract

A small reverse-mode engine should guarantee:

Backward computes gradients consistent with the chain rule and local operator derivatives, up to floating point arithmetic.

Testing exists to validate this contract operationally.

No finite collection of tests proves correctness completely. But a layered testing strategy:

operator tests
finite differences
randomized checks
graph invariants
end-to-end optimization

can make derivative failures rare, localizable, and reproducible.