An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or...
An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or scientific computation. Derivative testing therefore belongs at the center of AD system design.
A correct primal computation with incorrect gradients is usually worse than a runtime error. Numerical optimization may still appear to progress while converging to meaningless results.
Derivative testing should validate:
- local operator rules
- graph traversal logic
- gradient accumulation
- broadcasting semantics
- tensor reductions
- higher-order behavior
- numerical stability
The most important principle:
Every primitive operator should be testable independently.Classes of Errors
Derivative bugs usually fall into a few categories.
| Error type | Example |
|---|---|
| Wrong derivative formula | Missing factor or sign |
| Incorrect accumulation | Using = instead of += |
| Shape mismatch | Incorrect reduction axes |
| Broadcasting bug | Missing sum during backward |
| Aliasing bug | Overwritten saved value |
| Numerical instability | Overflow in backward |
| Traversal bug | Visiting node multiple times |
| Missing gradient path | Detached edge |
| Stateful inconsistency | Different random mask |
Many of these errors produce gradients with plausible values. That is why testing matters.
Finite Difference Approximation
The simplest validation method compares AD gradients against finite differences.
For:
central difference approximation:
A helper:
func FiniteDiff(
f func(float64) float64,
x float64,
) float64 {
eps := 1e-6
return (
f(x+eps) -
f(x-eps)
) / (2 * eps)
}Example function:
func scalarFunction(x float64) float64 {
return x*x + math.Sin(x)
}AD version:
func scalarGrad(xval float64) float64 {
var t Tape
x := t.Var(xval)
y := t.Add(
t.Mul(x, x),
t.Sin(x),
)
t.Backward(y)
return t.Grads[x]
}Test:
func TestScalarGrad(t *testing.T) {
x := 2.0
fd := FiniteDiff(scalarFunction, x)
ad := scalarGrad(x)
if math.Abs(fd-ad) > 1e-5 {
t.Fatalf(
"gradient mismatch: fd=%g ad=%g",
fd,
ad,
)
}
}This is the basic derivative sanity check.
Choosing Epsilon
Finite differences are approximate. The choice of:
matters.
Too small:
- cancellation error dominates
Too large:
- truncation error dominates
Typical values:
1e-41e-51e-6
Central differences are preferred over forward differences because they have smaller truncation error.
Forward difference:
Central difference:
Central difference is usually much more accurate for testing.
Relative Error
Absolute error alone can be misleading.
Better comparison:
Helper:
func RelativeError(a, b float64) float64 {
return math.Abs(a-b) /
math.Max(
1,
math.Max(math.Abs(a), math.Abs(b)),
)
}Then:
if RelativeError(fd, ad) > 1e-5 {
t.Fatalf("gradient mismatch")
}Relative error scales better across small and large gradients.
Operator-Level Testing
Each primitive operator should have isolated tests.
Example for multiplication:
Expected derivatives:
Test:
func TestMulGrad(t *testing.T) {
var tape Tape
x := tape.Var(3)
y := tape.Var(5)
z := tape.Mul(x, y)
tape.Backward(z)
if tape.Grads[x] != 5 {
t.Fatalf("wrong grad dx")
}
if tape.Grads[y] != 3 {
t.Fatalf("wrong grad dy")
}
}Primitive operator tests should be:
- small
- exact when possible
- easy to inspect
These tests catch most implementation bugs early.
Shared Subexpression Testing
Gradient accumulation is a common failure point.
Example:
Derivative:
Test:
func TestSharedNodeAccumulation(t *testing.T) {
var tape Tape
x := tape.Var(3)
a := tape.Mul(x, x)
y := tape.Add(a, a)
tape.Backward(y)
want := 12.0
if tape.Grads[x] != want {
t.Fatalf(
"got %g want %g",
tape.Grads[x],
want,
)
}
}This catches incorrect overwrite behavior:
grad = ...instead of:
grad += ...Traversal Testing
Reverse traversal should visit nodes exactly once while still accumulating every edge contribution.
A graph with sharing is a good test case:
x -> a -> y
\ /
------- Tests should verify:
- no duplicate backward execution
- no missing edge contributions
- stable topological ordering
An incorrect traversal often produces gradients that are:
- doubled
- partially missing
- dependent on node allocation order
These bugs are subtle without dedicated tests.
Tensor Shape Testing
Tensor AD systems need shape-aware tests.
Broadcasting example:
x shape: [2, 3]
b shape: [3]
y = x + bExpected backward:
grad_b = reduce_sum(grad_y, axis=0)A shape test should verify:
- output shape
- gradient shape
- numerical values
Shape mismatches are among the most common tensor AD bugs.
Reduction Testing
Reduction operators require careful backward logic.
Example:
Backward should broadcast the output gradient back to all inputs.
Test:
func TestSumBackward(t *testing.T) {
x := Tensor{
Data: []float64{1, 2, 3},
Shape: []int{3},
}
gradOut := 2.0
gradX := SumBackward(x, gradOut)
want := []float64{2, 2, 2}
for i := range want {
if gradX.Data[i] != want[i] {
t.Fatalf("bad sum grad")
}
}
}Reduction tests should cover:
- axis reduction
- keepdims behavior
- empty dimensions
- singleton dimensions
Numerical Stability Testing
Some operators have mathematically correct but numerically unstable derivatives.
Example:
Large positive x may overflow.
Tests should include:
- very large inputs
- very small inputs
- zero
- infinities
- NaNs
Example:
func TestSoftplusStability(t *testing.T) {
xs := []float64{
-100,
-10,
0,
10,
100,
}
for _, x := range xs {
var tape Tape
v := tape.Var(x)
y := tape.Softplus(v)
tape.Backward(y)
if math.IsNaN(tape.Grads[v]) {
t.Fatalf("nan gradient at %g", x)
}
}
}Stability tests are often more important than ordinary correctness tests in real optimization systems.
Randomized Gradient Checks
A useful strategy:
- generate random inputs
- compare AD with finite differences
- repeat many times
Example:
func RandomGradCheck(
f func(*Tape, Slot) Slot,
) {
for i := 0; i < 1000; i++ {
x := rand.Float64()*10 - 5
fd := finiteDiffScalar(f, x)
ad := adScalar(f, x)
if RelativeError(fd, ad) > 1e-5 {
panic("gradient mismatch")
}
}
}Random testing explores edge cases humans often miss.
Useful distributions:
- small values
- large values
- near-zero values
- mixed signs
- powers of two
- denormalized ranges
Higher-Order Testing
Higher-order AD requires additional validation.
Example:
Derivatives:
A second-order system should be checked against known analytic results.
Higher-order systems often fail because:
- saved gradients are incorrect
- perturbation confusion occurs
- nested tapes interfere
- backward rules are not themselves differentiable
These bugs usually appear only in nested differentiation tests.
Stateful Operator Testing
Stateful operators require consistency between forward and backward.
Dropout example:
forward uses random mask M
backward must reuse MTest:
func TestDropoutConsistency(t *testing.T) {
seed := int64(42)
forwardMask := GenerateMask(seed)
backwardMask := GenerateMask(seed)
if !Equal(forwardMask, backwardMask) {
t.Fatalf("mask mismatch")
}
}Without this consistency, gradients become random noise.
Property Testing
Some derivative properties hold independently of specific values.
Example:
- derivative of addition is linear
- derivative of constant is zero
- chain rule should compose correctly
Property testing checks algebraic invariants rather than individual numeric examples.
Example:
grad(a*x + b*y)
=
a*grad(x) + b*grad(y)Property tests are useful because they scale across many input combinations.
Graph Integrity Testing
The graph itself should be validated.
Checks:
- no dangling references
- no cycles in acyclic graphs
- valid slot indexes
- backward only visits reachable nodes
- gradients reset correctly
- tape reuse preserves correctness
Minimal invariant:
Every backward rule reads valid primal values and writes valid gradient slots.Violating this usually produces silent corruption.
Test Organization
A practical AD engine should separate tests by level.
| Test type | Purpose |
|---|---|
| Operator unit tests | Validate local derivative rules |
| Graph tests | Validate traversal and accumulation |
| Numerical tests | Validate stability |
| Randomized tests | Explore broad input space |
| Tensor shape tests | Validate broadcasting and reductions |
| End-to-end optimization tests | Validate realistic training behavior |
| Regression tests | Prevent performance and correctness regressions |
Operator-level tests should dominate. Most AD failures originate there.
End-to-End Validation
Eventually the engine should optimize a real objective successfully.
Simple example:
Gradient descent:
x := 0.0
for i := 0; i < 100; i++ {
var tape Tape
xv := tape.Var(x)
diff := tape.Sub(xv, tape.Const(3))
loss := tape.Mul(diff, diff)
tape.Backward(loss)
x -= 0.1 * tape.Grads[xv]
}Expected:
x approaches 3This does not prove the engine is correct, but it validates that gradients are directionally useful.
Minimal Correctness Contract
A small reverse-mode engine should guarantee:
Backward computes gradients consistent with the chain rule and local operator derivatives, up to floating point arithmetic.Testing exists to validate this contract operationally.
No finite collection of tests proves correctness completely. But a layered testing strategy:
- operator tests
- finite differences
- randomized checks
- graph invariants
- end-to-end optimization
can make derivative failures rare, localizable, and reproducible.