Perturbation Confusion

Perturbation confusion is a correctness bug that appears in nested automatic differentiation, especially nested forward mode. It happens when two derivative computations accidentally use the same infinitesimal perturbation.

Forward mode is often explained with dual numbers:

x + \epsilon \dot{x},

where

\epsilon^2 = 0.

The symbol $\epsilon$ marks the tangent part of the computation. In a single forward-mode pass, this works cleanly. In nested AD, there may be several active derivative computations at once. Each one needs its own perturbation identity.

If two levels share the same $\epsilon$ , the system cannot distinguish their tangents. The result can be mathematically wrong.

The Basic Failure

Suppose a system represents all forward-mode derivatives with the same dual symbol:

a + b\epsilon.

Now an outer derivative computation calls an inner derivative computation. If both use the same $\epsilon$ , their tangent components are merged.

Instead of having two independent infinitesimals,

\epsilon_1 \quad \text{and} \quad \epsilon_2,

the system uses only one:

\epsilon.

Then terms from different derivative levels collapse into the same coefficient.

The system loses the distinction between:

b\epsilon_1

and

c\epsilon_2.

Both become:

(b+c)\epsilon.

This is perturbation confusion.

Why Nested AD Needs Fresh Perturbations

Each invocation of forward-mode AD should introduce a fresh perturbation tag.

For an outer derivative, we might use:

\epsilon_1.

For an inner derivative, we should use:

\epsilon_2.

These obey:

\epsilon_1^2 = 0, \quad \epsilon_2^2 = 0,

but their product may carry mixed derivative information:

\epsilon_1\epsilon_2.

So the correct nested algebra is:

a + b\epsilon_1 + c\epsilon_2 + d\epsilon_1\epsilon_2.

This representation preserves derivative levels separately.

A Small Example

Let

f(x) = x^2.

The derivative is

f'(x) = 2x.

The second derivative is

f''(x) = 2.

To compute the second derivative using nested forward mode, we differentiate the derivative computation. This requires two independent perturbations.

Use:

x + \epsilon_1.

Then:

f(x+\epsilon_1) = (x+\epsilon_1)^2 = x^2 + 2x\epsilon_1.

The first derivative is the coefficient of $\epsilon_1$ :

2x.

Now differentiate $2x$ with respect to $x$ . This needs a new perturbation:

x + \epsilon_2.

Then:

2(x+\epsilon_2)=2x+2\epsilon_2.

So the second derivative is:

2.

If the inner and outer perturbations are confused, the system may extract the wrong coefficient or mix tangent information across levels.

Function Closures and Hidden Perturbations

Perturbation confusion often appears with higher-order functions.

Consider the pattern:

def derivative(f):
    return lambda x: tangent_part(f(dual(x, 1)))

This looks harmless.

Now suppose it is used inside another derivative computation:

g = derivative(lambda x: x * x)
h = derivative(g)

The outer call to derivative creates a perturbation. The inner call also creates a perturbation. If both use the same global perturbation identity, the nested result can be wrong.

The bug is subtle because the first derivative may work correctly. The error appears only when differentiation is nested.

Lexical Scope of Perturbations

Fresh perturbation tags should behave like lexically scoped variables.

A derivative transform introduces a tag, runs the function, extracts the derivative for that tag, and then closes the scope.

derivative(f):
    tag = fresh()
    y = f(lift(x, tag, 1))
    return tangent(y, tag)

The tag should not be visible outside the derivative transform except through the computed result.

This mirrors variable binding in programming languages. Reusing a tag across scopes is like accidentally reusing a local variable from another function call.

Higher-Order Terms

For second-order mixed derivatives, the system must also represent products of perturbations.

For example:

\epsilon_1\epsilon_2

is distinct from both:

\epsilon_1

and

\epsilon_2.

A simple map from tag to tangent is enough for independent first-order nested derivatives, but full higher-order forward mode needs a richer representation.

One possible representation is keyed by sets or multisets of perturbation tags:

coeff[{}]          = primal
coeff[{tag1}]      = first derivative for tag1
coeff[{tag2}]      = first derivative for tag2
coeff[{tag1,tag2}] = mixed second derivative

For higher orders, the number of tag combinations grows quickly.

This is why Taylor mode and jet representations are often preferable for systematic higher-order forward AD.

Reverse Mode Analogue

Perturbation confusion is usually discussed for forward mode, but nested reverse mode has a related issue.

Reverse mode uses cotangents instead of tangents. A nested reverse-mode system must keep cotangent levels separate.

An outer gradient computation may contain an inner gradient computation. The cotangents from the inner reverse pass should not be accumulated into the cotangent storage of the outer pass unless the semantics require it.

This requires explicit derivative levels:

level 0: primal computation
level 1: first derivative
level 2: second derivative

If cotangent storage is global or poorly scoped, nested reverse mode can produce incorrect gradients, duplicate accumulation, or missing contributions.

Dynamic Graphs

Perturbation confusion is especially dangerous in dynamic graph systems.

In a dynamic graph, the derivative trace is built while the program runs. Nested derivative calls may occur inside ordinary control flow.

Example:

def f(x):
    if derivative(g)(x) > 0:
        return x * x
    return x

Here the derivative computation affects which branch is taken.

A correct system must know which AD level each value belongs to. Otherwise, tangents from the inner derivative may leak into the outer trace or change the meaning of branch conditions.

Custom Derivative Rules

Custom gradients can also introduce perturbation confusion.

A custom rule may call derivative operations internally:

backward(x, y_bar):
    return derivative(auxiliary)(x) * y_bar

If this backward rule is used under nested AD, it may allocate perturbations while another derivative transform is already active.

The custom rule must respect the active AD levels and allocate fresh tags. Otherwise, higher-order derivatives become unreliable.

Production systems should document whether custom derivative rules are higher-order safe.

Common Symptoms

Perturbation confusion can be difficult to diagnose because results are often plausible.

Typical symptoms include:

Symptom	Likely cause
first derivatives correct, second derivatives wrong	shared perturbation tag
nested gradients depend on call order	global AD state
custom gradient works alone, fails when nested	derivative level leak
Hessian asymmetric for smooth function	mixed derivative bookkeeping bug
derivative of constant expression is nonzero	tangent leakage
zero higher-order derivative when it should be nonzero	inner tangent consumed too early

The bug often appears only in small nested examples, so test suites should include explicit nested AD cases.

Testing for Perturbation Confusion

A minimal test is:

f(x)=x^2.

A nested derivative should give:

\frac{d}{dx}\frac{d}{dx}x^2 = 2.

Another useful test is:

f(x)=x^3.

Then:

f'(x)=3x^2, \quad f''(x)=6x, \quad f^{(3)}(x)=6.

A system should compute these correctly under repeated nesting.

For multivariate tests, use:

f(x,y)=x^2y.

The mixed derivative is:

\frac{\partial^2 f}{\partial x \partial y}=2x.

The result should be independent of whether the system computes the $x$ derivative first or the $y$ derivative first.

Implementation Discipline

A robust nested AD implementation needs four properties.

Property	Requirement
fresh tags	every AD transform gets a distinct identity
scoped extraction	derivative calls extract only their own tag
level-aware storage	tangents and cotangents belong to explicit levels
compositional rules	primitives preserve derivative structure under nesting

These rules prevent accidental mixing of derivative information.

Practical Design Principle

Perturbation confusion is a scoping error.

The perturbation introduced by a derivative transform should behave like a local variable. It must be fresh, scoped, and inaccessible to unrelated derivative computations.

When this discipline is followed, nested AD becomes compositional. When it is not, higher-order derivatives may be wrong even though all first-order tests pass.

Perturbation Confusion

The Basic Failure

Why Nested AD Needs Fresh Perturbations

A Small Example

Function Closures and Hidden Perturbations

Tags

Lexical Scope of Perturbations

Higher-Order Terms

Reverse Mode Analogue

Dynamic Graphs

Custom Derivative Rules

Common Symptoms

Testing for Perturbation Confusion

Implementation Discipline

Practical Design Principle