# Testing Derivatives

## Testing Derivatives

An automatic differentiation engine is only useful if its derivatives are correct. A small mistake in a backward rule can silently corrupt optimization, training, or scientific computation. Derivative testing therefore belongs at the center of AD system design.

A correct primal computation with incorrect gradients is usually worse than a runtime error. Numerical optimization may still appear to progress while converging to meaningless results.

Derivative testing should validate:
- local operator rules
- graph traversal logic
- gradient accumulation
- broadcasting semantics
- tensor reductions
- higher-order behavior
- numerical stability

The most important principle:

```text
Every primitive operator should be testable independently.
```

## Classes of Errors

Derivative bugs usually fall into a few categories.

| Error type | Example |
|---|---|
| Wrong derivative formula | Missing factor or sign |
| Incorrect accumulation | Using `=` instead of `+=` |
| Shape mismatch | Incorrect reduction axes |
| Broadcasting bug | Missing sum during backward |
| Aliasing bug | Overwritten saved value |
| Numerical instability | Overflow in backward |
| Traversal bug | Visiting node multiple times |
| Missing gradient path | Detached edge |
| Stateful inconsistency | Different random mask |

Many of these errors produce gradients with plausible values. That is why testing matters.

## Finite Difference Approximation

The simplest validation method compares AD gradients against finite differences.

For:

$$
f : \mathbb{R} \to \mathbb{R}
$$

central difference approximation:

$$
f'(x)
\approx
\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}
$$

A helper:

```go
func FiniteDiff(
    f func(float64) float64,
    x float64,
) float64 {
    eps := 1e-6

    return (
        f(x+eps) -
        f(x-eps)
    ) / (2 * eps)
}
```

Example function:

$$
f(x) = x^2 + \sin(x)
$$

```go
func scalarFunction(x float64) float64 {
    return x*x + math.Sin(x)
}
```

AD version:

```go
func scalarGrad(xval float64) float64 {
    var t Tape

    x := t.Var(xval)

    y := t.Add(
        t.Mul(x, x),
        t.Sin(x),
    )

    t.Backward(y)

    return t.Grads[x]
}
```

Test:

```go
func TestScalarGrad(t *testing.T) {
    x := 2.0

    fd := FiniteDiff(scalarFunction, x)
    ad := scalarGrad(x)

    if math.Abs(fd-ad) > 1e-5 {
        t.Fatalf(
            "gradient mismatch: fd=%g ad=%g",
            fd,
            ad,
        )
    }
}
```

This is the basic derivative sanity check.

## Choosing Epsilon

Finite differences are approximate. The choice of:

$$
\epsilon
$$

matters.

Too small:
- cancellation error dominates

Too large:
- truncation error dominates

Typical values:
- `1e-4`
- `1e-5`
- `1e-6`

Central differences are preferred over forward differences because they have smaller truncation error.

Forward difference:

$$
\frac{f(x+\epsilon)-f(x)}{\epsilon}
$$

Central difference:

$$
\frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}
$$

Central difference is usually much more accurate for testing.

## Relative Error

Absolute error alone can be misleading.

Better comparison:

$$
\text{relative error} =
\frac{|a-b|}
{\max(1, |a|, |b|)}
$$

Helper:

```go
func RelativeError(a, b float64) float64 {
    return math.Abs(a-b) /
        math.Max(
            1,
            math.Max(math.Abs(a), math.Abs(b)),
        )
}
```

Then:

```go
if RelativeError(fd, ad) > 1e-5 {
    t.Fatalf("gradient mismatch")
}
```

Relative error scales better across small and large gradients.

## Operator-Level Testing

Each primitive operator should have isolated tests.

Example for multiplication:

$$
f(x, y) = xy
$$

Expected derivatives:

$$
\frac{\partial f}{\partial x} = y
$$

$$
\frac{\partial f}{\partial y} = x
$$

Test:

```go
func TestMulGrad(t *testing.T) {
    var tape Tape

    x := tape.Var(3)
    y := tape.Var(5)

    z := tape.Mul(x, y)

    tape.Backward(z)

    if tape.Grads[x] != 5 {
        t.Fatalf("wrong grad dx")
    }

    if tape.Grads[y] != 3 {
        t.Fatalf("wrong grad dy")
    }
}
```

Primitive operator tests should be:
- small
- exact when possible
- easy to inspect

These tests catch most implementation bugs early.

## Shared Subexpression Testing

Gradient accumulation is a common failure point.

Example:

$$
f(x) = x^2 + x^2
$$

Derivative:

$$
f'(x) = 4x
$$

Test:

```go
func TestSharedNodeAccumulation(t *testing.T) {
    var tape Tape

    x := tape.Var(3)

    a := tape.Mul(x, x)
    y := tape.Add(a, a)

    tape.Backward(y)

    want := 12.0

    if tape.Grads[x] != want {
        t.Fatalf(
            "got %g want %g",
            tape.Grads[x],
            want,
        )
    }
}
```

This catches incorrect overwrite behavior:

```go
grad = ...
```

instead of:

```go
grad += ...
```

## Traversal Testing

Reverse traversal should visit nodes exactly once while still accumulating every edge contribution.

A graph with sharing is a good test case:

```text
x -> a -> y
 \       /
  ------- 
```

Tests should verify:
- no duplicate backward execution
- no missing edge contributions
- stable topological ordering

An incorrect traversal often produces gradients that are:
- doubled
- partially missing
- dependent on node allocation order

These bugs are subtle without dedicated tests.

## Tensor Shape Testing

Tensor AD systems need shape-aware tests.

Broadcasting example:

```text
x shape: [2, 3]
b shape: [3]

y = x + b
```

Expected backward:

```text
grad_b = reduce_sum(grad_y, axis=0)
```

A shape test should verify:
- output shape
- gradient shape
- numerical values

Shape mismatches are among the most common tensor AD bugs.

## Reduction Testing

Reduction operators require careful backward logic.

Example:

$$
y = \sum_i x_i
$$

Backward should broadcast the output gradient back to all inputs.

Test:

```go
func TestSumBackward(t *testing.T) {
    x := Tensor{
        Data: []float64{1, 2, 3},
        Shape: []int{3},
    }

    gradOut := 2.0

    gradX := SumBackward(x, gradOut)

    want := []float64{2, 2, 2}

    for i := range want {
        if gradX.Data[i] != want[i] {
            t.Fatalf("bad sum grad")
        }
    }
}
```

Reduction tests should cover:
- axis reduction
- keepdims behavior
- empty dimensions
- singleton dimensions

## Numerical Stability Testing

Some operators have mathematically correct but numerically unstable derivatives.

Example:

$$
\log(1 + e^x)
$$

Large positive `x` may overflow.

Tests should include:
- very large inputs
- very small inputs
- zero
- infinities
- NaNs

Example:

```go
func TestSoftplusStability(t *testing.T) {
    xs := []float64{
        -100,
        -10,
        0,
        10,
        100,
    }

    for _, x := range xs {
        var tape Tape

        v := tape.Var(x)
        y := tape.Softplus(v)

        tape.Backward(y)

        if math.IsNaN(tape.Grads[v]) {
            t.Fatalf("nan gradient at %g", x)
        }
    }
}
```

Stability tests are often more important than ordinary correctness tests in real optimization systems.

## Randomized Gradient Checks

A useful strategy:
1. generate random inputs
2. compare AD with finite differences
3. repeat many times

Example:

```go
func RandomGradCheck(
    f func(*Tape, Slot) Slot,
) {
    for i := 0; i < 1000; i++ {
        x := rand.Float64()*10 - 5

        fd := finiteDiffScalar(f, x)
        ad := adScalar(f, x)

        if RelativeError(fd, ad) > 1e-5 {
            panic("gradient mismatch")
        }
    }
}
```

Random testing explores edge cases humans often miss.

Useful distributions:
- small values
- large values
- near-zero values
- mixed signs
- powers of two
- denormalized ranges

## Higher-Order Testing

Higher-order AD requires additional validation.

Example:

$$
f(x) = x^3
$$

Derivatives:

$$
f'(x) = 3x^2
$$

$$
f''(x) = 6x
$$

A second-order system should be checked against known analytic results.

Higher-order systems often fail because:
- saved gradients are incorrect
- perturbation confusion occurs
- nested tapes interfere
- backward rules are not themselves differentiable

These bugs usually appear only in nested differentiation tests.

## Stateful Operator Testing

Stateful operators require consistency between forward and backward.

Dropout example:

```text
forward uses random mask M
backward must reuse M
```

Test:

```go
func TestDropoutConsistency(t *testing.T) {
    seed := int64(42)

    forwardMask := GenerateMask(seed)
    backwardMask := GenerateMask(seed)

    if !Equal(forwardMask, backwardMask) {
        t.Fatalf("mask mismatch")
    }
}
```

Without this consistency, gradients become random noise.

## Property Testing

Some derivative properties hold independently of specific values.

Example:
- derivative of addition is linear
- derivative of constant is zero
- chain rule should compose correctly

Property testing checks algebraic invariants rather than individual numeric examples.

Example:

```text
grad(a*x + b*y)
=
a*grad(x) + b*grad(y)
```

Property tests are useful because they scale across many input combinations.

## Graph Integrity Testing

The graph itself should be validated.

Checks:
- no dangling references
- no cycles in acyclic graphs
- valid slot indexes
- backward only visits reachable nodes
- gradients reset correctly
- tape reuse preserves correctness

Minimal invariant:

```text
Every backward rule reads valid primal values and writes valid gradient slots.
```

Violating this usually produces silent corruption.

## Test Organization

A practical AD engine should separate tests by level.

| Test type | Purpose |
|---|---|
| Operator unit tests | Validate local derivative rules |
| Graph tests | Validate traversal and accumulation |
| Numerical tests | Validate stability |
| Randomized tests | Explore broad input space |
| Tensor shape tests | Validate broadcasting and reductions |
| End-to-end optimization tests | Validate realistic training behavior |
| Regression tests | Prevent performance and correctness regressions |

Operator-level tests should dominate. Most AD failures originate there.

## End-to-End Validation

Eventually the engine should optimize a real objective successfully.

Simple example:

$$
f(x) = (x-3)^2
$$

Gradient descent:

```go
x := 0.0

for i := 0; i < 100; i++ {
    var tape Tape

    xv := tape.Var(x)

    diff := tape.Sub(xv, tape.Const(3))
    loss := tape.Mul(diff, diff)

    tape.Backward(loss)

    x -= 0.1 * tape.Grads[xv]
}
```

Expected:

```text
x approaches 3
```

This does not prove the engine is correct, but it validates that gradients are directionally useful.

## Minimal Correctness Contract

A small reverse-mode engine should guarantee:

```text
Backward computes gradients consistent with the chain rule and local operator derivatives, up to floating point arithmetic.
```

Testing exists to validate this contract operationally.

No finite collection of tests proves correctness completely. But a layered testing strategy:
- operator tests
- finite differences
- randomized checks
- graph invariants
- end-to-end optimization

can make derivative failures rare, localizable, and reproducible.

