Skip to content

Perturbation Confusion

Perturbation confusion is a correctness bug that appears in nested automatic differentiation, especially nested forward mode. It happens when two derivative computations...

Perturbation confusion is a correctness bug that appears in nested automatic differentiation, especially nested forward mode. It happens when two derivative computations accidentally use the same infinitesimal perturbation.

Forward mode is often explained with dual numbers:

x+ϵx˙, x + \epsilon \dot{x},

where

ϵ2=0. \epsilon^2 = 0.

The symbol ϵ\epsilon marks the tangent part of the computation. In a single forward-mode pass, this works cleanly. In nested AD, there may be several active derivative computations at once. Each one needs its own perturbation identity.

If two levels share the same ϵ\epsilon, the system cannot distinguish their tangents. The result can be mathematically wrong.

The Basic Failure

Suppose a system represents all forward-mode derivatives with the same dual symbol:

a+bϵ. a + b\epsilon.

Now an outer derivative computation calls an inner derivative computation. If both use the same ϵ\epsilon, their tangent components are merged.

Instead of having two independent infinitesimals,

ϵ1andϵ2, \epsilon_1 \quad \text{and} \quad \epsilon_2,

the system uses only one:

ϵ. \epsilon.

Then terms from different derivative levels collapse into the same coefficient.

The system loses the distinction between:

bϵ1 b\epsilon_1

and

cϵ2. c\epsilon_2.

Both become:

(b+c)ϵ. (b+c)\epsilon.

This is perturbation confusion.

Why Nested AD Needs Fresh Perturbations

Each invocation of forward-mode AD should introduce a fresh perturbation tag.

For an outer derivative, we might use:

ϵ1. \epsilon_1.

For an inner derivative, we should use:

ϵ2. \epsilon_2.

These obey:

ϵ12=0,ϵ22=0, \epsilon_1^2 = 0, \quad \epsilon_2^2 = 0,

but their product may carry mixed derivative information:

ϵ1ϵ2. \epsilon_1\epsilon_2.

So the correct nested algebra is:

a+bϵ1+cϵ2+dϵ1ϵ2. a + b\epsilon_1 + c\epsilon_2 + d\epsilon_1\epsilon_2.

This representation preserves derivative levels separately.

A Small Example

Let

f(x)=x2. f(x) = x^2.

The derivative is

f(x)=2x. f'(x) = 2x.

The second derivative is

f(x)=2. f''(x) = 2.

To compute the second derivative using nested forward mode, we differentiate the derivative computation. This requires two independent perturbations.

Use:

x+ϵ1. x + \epsilon_1.

Then:

f(x+ϵ1)=(x+ϵ1)2=x2+2xϵ1. f(x+\epsilon_1) = (x+\epsilon_1)^2 = x^2 + 2x\epsilon_1.

The first derivative is the coefficient of ϵ1\epsilon_1:

2x. 2x.

Now differentiate 2x2x with respect to xx. This needs a new perturbation:

x+ϵ2. x + \epsilon_2.

Then:

2(x+ϵ2)=2x+2ϵ2. 2(x+\epsilon_2)=2x+2\epsilon_2.

So the second derivative is:

2. 2.

If the inner and outer perturbations are confused, the system may extract the wrong coefficient or mix tangent information across levels.

Function Closures and Hidden Perturbations

Perturbation confusion often appears with higher-order functions.

Consider the pattern:

def derivative(f):
    return lambda x: tangent_part(f(dual(x, 1)))

This looks harmless.

Now suppose it is used inside another derivative computation:

g = derivative(lambda x: x * x)
h = derivative(g)

The outer call to derivative creates a perturbation. The inner call also creates a perturbation. If both use the same global perturbation identity, the nested result can be wrong.

The bug is subtle because the first derivative may work correctly. The error appears only when differentiation is nested.

Tags

The standard solution is tagging.

Each AD transform allocates a fresh tag:

tag = fresh()

A dual number then stores tangent components by tag:

Dual {
    primal
    tangents: map[tag]value
}

When arithmetic combines dual numbers, tangent components with the same tag combine. Tangent components with different tags remain separate.

Conceptually:

x+x˙1ϵ1+x˙2ϵ2+. x + \dot{x}_1\epsilon_1 + \dot{x}_2\epsilon_2 + \cdots.

A derivative transform extracts only the tangent associated with its own tag.

This prevents inner derivatives from consuming outer tangents, and outer derivatives from confusing inner tangent values with their own.

Lexical Scope of Perturbations

Fresh perturbation tags should behave like lexically scoped variables.

A derivative transform introduces a tag, runs the function, extracts the derivative for that tag, and then closes the scope.

derivative(f):
    tag = fresh()
    y = f(lift(x, tag, 1))
    return tangent(y, tag)

The tag should not be visible outside the derivative transform except through the computed result.

This mirrors variable binding in programming languages. Reusing a tag across scopes is like accidentally reusing a local variable from another function call.

Higher-Order Terms

For second-order mixed derivatives, the system must also represent products of perturbations.

For example:

ϵ1ϵ2 \epsilon_1\epsilon_2

is distinct from both:

ϵ1 \epsilon_1

and

ϵ2. \epsilon_2.

A simple map from tag to tangent is enough for independent first-order nested derivatives, but full higher-order forward mode needs a richer representation.

One possible representation is keyed by sets or multisets of perturbation tags:

coeff[{}]          = primal
coeff[{tag1}]      = first derivative for tag1
coeff[{tag2}]      = first derivative for tag2
coeff[{tag1,tag2}] = mixed second derivative

For higher orders, the number of tag combinations grows quickly.

This is why Taylor mode and jet representations are often preferable for systematic higher-order forward AD.

Reverse Mode Analogue

Perturbation confusion is usually discussed for forward mode, but nested reverse mode has a related issue.

Reverse mode uses cotangents instead of tangents. A nested reverse-mode system must keep cotangent levels separate.

An outer gradient computation may contain an inner gradient computation. The cotangents from the inner reverse pass should not be accumulated into the cotangent storage of the outer pass unless the semantics require it.

This requires explicit derivative levels:

level 0: primal computation
level 1: first derivative
level 2: second derivative

If cotangent storage is global or poorly scoped, nested reverse mode can produce incorrect gradients, duplicate accumulation, or missing contributions.

Dynamic Graphs

Perturbation confusion is especially dangerous in dynamic graph systems.

In a dynamic graph, the derivative trace is built while the program runs. Nested derivative calls may occur inside ordinary control flow.

Example:

def f(x):
    if derivative(g)(x) > 0:
        return x * x
    return x

Here the derivative computation affects which branch is taken.

A correct system must know which AD level each value belongs to. Otherwise, tangents from the inner derivative may leak into the outer trace or change the meaning of branch conditions.

Custom Derivative Rules

Custom gradients can also introduce perturbation confusion.

A custom rule may call derivative operations internally:

backward(x, y_bar):
    return derivative(auxiliary)(x) * y_bar

If this backward rule is used under nested AD, it may allocate perturbations while another derivative transform is already active.

The custom rule must respect the active AD levels and allocate fresh tags. Otherwise, higher-order derivatives become unreliable.

Production systems should document whether custom derivative rules are higher-order safe.

Common Symptoms

Perturbation confusion can be difficult to diagnose because results are often plausible.

Typical symptoms include:

SymptomLikely cause
first derivatives correct, second derivatives wrongshared perturbation tag
nested gradients depend on call orderglobal AD state
custom gradient works alone, fails when nestedderivative level leak
Hessian asymmetric for smooth functionmixed derivative bookkeeping bug
derivative of constant expression is nonzerotangent leakage
zero higher-order derivative when it should be nonzeroinner tangent consumed too early

The bug often appears only in small nested examples, so test suites should include explicit nested AD cases.

Testing for Perturbation Confusion

A minimal test is:

f(x)=x2. f(x)=x^2.

A nested derivative should give:

ddxddxx2=2. \frac{d}{dx}\frac{d}{dx}x^2 = 2.

Another useful test is:

f(x)=x3. f(x)=x^3.

Then:

f(x)=3x2,f(x)=6x,f(3)(x)=6. f'(x)=3x^2, \quad f''(x)=6x, \quad f^{(3)}(x)=6.

A system should compute these correctly under repeated nesting.

For multivariate tests, use:

f(x,y)=x2y. f(x,y)=x^2y.

The mixed derivative is:

2fxy=2x. \frac{\partial^2 f}{\partial x \partial y}=2x.

The result should be independent of whether the system computes the xx derivative first or the yy derivative first.

Implementation Discipline

A robust nested AD implementation needs four properties.

PropertyRequirement
fresh tagsevery AD transform gets a distinct identity
scoped extractionderivative calls extract only their own tag
level-aware storagetangents and cotangents belong to explicit levels
compositional rulesprimitives preserve derivative structure under nesting

These rules prevent accidental mixing of derivative information.

Practical Design Principle

Perturbation confusion is a scoping error.

The perturbation introduced by a derivative transform should behave like a local variable. It must be fresh, scoped, and inaccessible to unrelated derivative computations.

When this discipline is followed, nested AD becomes compositional. When it is not, higher-order derivatives may be wrong even though all first-order tests pass.