# Perturbation Confusion

## Perturbation Confusion

Perturbation confusion is a correctness bug that appears in nested automatic differentiation, especially nested forward mode. It happens when two derivative computations accidentally use the same infinitesimal perturbation.

Forward mode is often explained with dual numbers:

$$
x + \epsilon \dot{x},
$$

where

$$
\epsilon^2 = 0.
$$

The symbol $\epsilon$ marks the tangent part of the computation. In a single forward-mode pass, this works cleanly. In nested AD, there may be several active derivative computations at once. Each one needs its own perturbation identity.

If two levels share the same $\epsilon$, the system cannot distinguish their tangents. The result can be mathematically wrong.

## The Basic Failure

Suppose a system represents all forward-mode derivatives with the same dual symbol:

$$
a + b\epsilon.
$$

Now an outer derivative computation calls an inner derivative computation. If both use the same $\epsilon$, their tangent components are merged.

Instead of having two independent infinitesimals,

$$
\epsilon_1
\quad \text{and} \quad
\epsilon_2,
$$

the system uses only one:

$$
\epsilon.
$$

Then terms from different derivative levels collapse into the same coefficient.

The system loses the distinction between:

$$
b\epsilon_1
$$

and

$$
c\epsilon_2.
$$

Both become:

$$
(b+c)\epsilon.
$$

This is perturbation confusion.

## Why Nested AD Needs Fresh Perturbations

Each invocation of forward-mode AD should introduce a fresh perturbation tag.

For an outer derivative, we might use:

$$
\epsilon_1.
$$

For an inner derivative, we should use:

$$
\epsilon_2.
$$

These obey:

$$
\epsilon_1^2 = 0,
\quad
\epsilon_2^2 = 0,
$$

but their product may carry mixed derivative information:

$$
\epsilon_1\epsilon_2.
$$

So the correct nested algebra is:

$$
a
+
b\epsilon_1
+
c\epsilon_2
+
d\epsilon_1\epsilon_2.
$$

This representation preserves derivative levels separately.

## A Small Example

Let

$$
f(x) = x^2.
$$

The derivative is

$$
f'(x) = 2x.
$$

The second derivative is

$$
f''(x) = 2.
$$

To compute the second derivative using nested forward mode, we differentiate the derivative computation. This requires two independent perturbations.

Use:

$$
x + \epsilon_1.
$$

Then:

$$
f(x+\epsilon_1) =
(x+\epsilon_1)^2 =
x^2 + 2x\epsilon_1.
$$

The first derivative is the coefficient of $\epsilon_1$:

$$
2x.
$$

Now differentiate $2x$ with respect to $x$. This needs a new perturbation:

$$
x + \epsilon_2.
$$

Then:

$$
2(x+\epsilon_2)=2x+2\epsilon_2.
$$

So the second derivative is:

$$
2.
$$

If the inner and outer perturbations are confused, the system may extract the wrong coefficient or mix tangent information across levels.

## Function Closures and Hidden Perturbations

Perturbation confusion often appears with higher-order functions.

Consider the pattern:

```text
def derivative(f):
    return lambda x: tangent_part(f(dual(x, 1)))
```

This looks harmless.

Now suppose it is used inside another derivative computation:

```text
g = derivative(lambda x: x * x)
h = derivative(g)
```

The outer call to `derivative` creates a perturbation. The inner call also creates a perturbation. If both use the same global perturbation identity, the nested result can be wrong.

The bug is subtle because the first derivative may work correctly. The error appears only when differentiation is nested.

## Tags

The standard solution is tagging.

Each AD transform allocates a fresh tag:

```text
tag = fresh()
```

A dual number then stores tangent components by tag:

```text
Dual {
    primal
    tangents: map[tag]value
}
```

When arithmetic combines dual numbers, tangent components with the same tag combine. Tangent components with different tags remain separate.

Conceptually:

$$
x
+
\dot{x}_1\epsilon_1
+
\dot{x}_2\epsilon_2
+
\cdots.
$$

A derivative transform extracts only the tangent associated with its own tag.

This prevents inner derivatives from consuming outer tangents, and outer derivatives from confusing inner tangent values with their own.

## Lexical Scope of Perturbations

Fresh perturbation tags should behave like lexically scoped variables.

A derivative transform introduces a tag, runs the function, extracts the derivative for that tag, and then closes the scope.

```text
derivative(f):
    tag = fresh()
    y = f(lift(x, tag, 1))
    return tangent(y, tag)
```

The `tag` should not be visible outside the derivative transform except through the computed result.

This mirrors variable binding in programming languages. Reusing a tag across scopes is like accidentally reusing a local variable from another function call.

## Higher-Order Terms

For second-order mixed derivatives, the system must also represent products of perturbations.

For example:

$$
\epsilon_1\epsilon_2
$$

is distinct from both:

$$
\epsilon_1
$$

and

$$
\epsilon_2.
$$

A simple map from tag to tangent is enough for independent first-order nested derivatives, but full higher-order forward mode needs a richer representation.

One possible representation is keyed by sets or multisets of perturbation tags:

```text
coeff[{}]          = primal
coeff[{tag1}]      = first derivative for tag1
coeff[{tag2}]      = first derivative for tag2
coeff[{tag1,tag2}] = mixed second derivative
```

For higher orders, the number of tag combinations grows quickly.

This is why Taylor mode and jet representations are often preferable for systematic higher-order forward AD.

## Reverse Mode Analogue

Perturbation confusion is usually discussed for forward mode, but nested reverse mode has a related issue.

Reverse mode uses cotangents instead of tangents. A nested reverse-mode system must keep cotangent levels separate.

An outer gradient computation may contain an inner gradient computation. The cotangents from the inner reverse pass should not be accumulated into the cotangent storage of the outer pass unless the semantics require it.

This requires explicit derivative levels:

```text
level 0: primal computation
level 1: first derivative
level 2: second derivative
```

If cotangent storage is global or poorly scoped, nested reverse mode can produce incorrect gradients, duplicate accumulation, or missing contributions.

## Dynamic Graphs

Perturbation confusion is especially dangerous in dynamic graph systems.

In a dynamic graph, the derivative trace is built while the program runs. Nested derivative calls may occur inside ordinary control flow.

Example:

```text
def f(x):
    if derivative(g)(x) > 0:
        return x * x
    return x
```

Here the derivative computation affects which branch is taken.

A correct system must know which AD level each value belongs to. Otherwise, tangents from the inner derivative may leak into the outer trace or change the meaning of branch conditions.

## Custom Derivative Rules

Custom gradients can also introduce perturbation confusion.

A custom rule may call derivative operations internally:

```text
backward(x, y_bar):
    return derivative(auxiliary)(x) * y_bar
```

If this backward rule is used under nested AD, it may allocate perturbations while another derivative transform is already active.

The custom rule must respect the active AD levels and allocate fresh tags. Otherwise, higher-order derivatives become unreliable.

Production systems should document whether custom derivative rules are higher-order safe.

## Common Symptoms

Perturbation confusion can be difficult to diagnose because results are often plausible.

Typical symptoms include:

| Symptom | Likely cause |
|---|---|
| first derivatives correct, second derivatives wrong | shared perturbation tag |
| nested gradients depend on call order | global AD state |
| custom gradient works alone, fails when nested | derivative level leak |
| Hessian asymmetric for smooth function | mixed derivative bookkeeping bug |
| derivative of constant expression is nonzero | tangent leakage |
| zero higher-order derivative when it should be nonzero | inner tangent consumed too early |

The bug often appears only in small nested examples, so test suites should include explicit nested AD cases.

## Testing for Perturbation Confusion

A minimal test is:

$$
f(x)=x^2.
$$

A nested derivative should give:

$$
\frac{d}{dx}\frac{d}{dx}x^2 = 2.
$$

Another useful test is:

$$
f(x)=x^3.
$$

Then:

$$
f'(x)=3x^2,
\quad
f''(x)=6x,
\quad
f^{(3)}(x)=6.
$$

A system should compute these correctly under repeated nesting.

For multivariate tests, use:

$$
f(x,y)=x^2y.
$$

The mixed derivative is:

$$
\frac{\partial^2 f}{\partial x \partial y}=2x.
$$

The result should be independent of whether the system computes the $x$ derivative first or the $y$ derivative first.

## Implementation Discipline

A robust nested AD implementation needs four properties.

| Property | Requirement |
|---|---|
| fresh tags | every AD transform gets a distinct identity |
| scoped extraction | derivative calls extract only their own tag |
| level-aware storage | tangents and cotangents belong to explicit levels |
| compositional rules | primitives preserve derivative structure under nesting |

These rules prevent accidental mixing of derivative information.

## Practical Design Principle

Perturbation confusion is a scoping error.

The perturbation introduced by a derivative transform should behave like a local variable. It must be fresh, scoped, and inaccessible to unrelated derivative computations.

When this discipline is followed, nested AD becomes compositional. When it is not, higher-order derivatives may be wrong even though all first-order tests pass.