# Mixed-Mode Differentiation

Mixed-mode differentiation combines forward accumulation and reverse accumulation in the same derivative computation. It is used when neither pure forward mode nor pure reverse mode gives the best cost model.

Forward mode computes Jacobian-vector products:

$$
J_f(x)v
$$

Reverse mode computes vector-Jacobian products:

$$
u^T J_f(x)
$$

Mixed mode composes these two operations to compute more structured derivatives efficiently.

## Why Mix Modes?

For a function:

$$
f:\mathbb{R}^n \to \mathbb{R}^m
$$

forward mode is efficient when $n$ is small.

Reverse mode is efficient when $m$ is small.

But many computations require objects more complex than one gradient:

| Target | Common Efficient Method |
|---|---|
| directional derivative | forward mode |
| gradient of scalar function | reverse mode |
| Jacobian-vector product | forward mode |
| vector-Jacobian product | reverse mode |
| Hessian-vector product | forward-over-reverse |
| Jacobian of gradients | reverse-over-forward or forward-over-reverse |
| full Hessian | mixed repeated passes |

Mixed mode appears when the derivative object has nested structure.

## Derivatives as Linear Programs

The derivative of a function at a point is a linear map.

If:

$$
f:\mathbb{R}^n \to \mathbb{R}^m
$$

then:

$$
Df(x):\mathbb{R}^n \to \mathbb{R}^m
$$

Forward mode applies this linear map to a tangent vector.

Reverse mode applies its transpose to a cotangent vector.

Mixed mode differentiates programs that already compute derivatives.

That gives nested maps such as:

$$
D(Df)
$$

or:

$$
D(\nabla f)
$$

These are the basis of Hessian-vector products and higher-order sensitivities.

## Forward Over Reverse

Forward-over-reverse means:

1. build a reverse-mode computation for a gradient
2. differentiate that gradient computation using forward mode

For a scalar function:

$$
f:\mathbb{R}^n \to \mathbb{R}
$$

reverse mode computes:

$$
\nabla f(x)
$$

Applying forward mode to this gradient in direction $v$ gives:

$$
D(\nabla f)(x)v
$$

which is the Hessian-vector product:

$$
H_f(x)v
$$

This avoids forming the full Hessian matrix.

## Hessian-Vector Product

The Hessian is:

$$
H_f(x) =
\nabla^2 f(x)
$$

For large $n$, explicitly constructing $H_f(x)\in\mathbb{R}^{n\times n}$ is usually infeasible.

Mixed mode computes:

$$
H_f(x)v
$$

directly.

This is critical in:
- Newton-CG methods
- second-order optimization
- implicit differentiation
- sensitivity analysis
- uncertainty estimation
- meta-learning

The cost is usually comparable to a small constant number of gradient evaluations.

## Reverse Over Forward

Reverse-over-forward means:

1. build a forward-mode computation for a directional derivative
2. differentiate that scalar directional derivative using reverse mode

For scalar:

$$
f:\mathbb{R}^n \to \mathbb{R}
$$

forward mode computes:

$$
g(x)=J_f(x)v
$$

Since $f$ is scalar, this is:

$$
g(x)=\nabla f(x)^T v
$$

Then reverse mode computes:

$$
\nabla g(x)
$$

which equals:

$$
H_f(x)v
$$

when $H_f$ is symmetric and $f$ is sufficiently smooth.

So both forward-over-reverse and reverse-over-forward can compute Hessian-vector products, but their practical costs differ depending on the implementation and shape of the program.

## Jacobian of a Vector Function

For:

$$
f:\mathbb{R}^n \to \mathbb{R}^m
$$

the full Jacobian can be computed by repeated passes.

Forward mode gives columns:

$$
J_f(x)e_i
$$

Reverse mode gives rows:

$$
e_j^T J_f(x)
$$

Mixed strategies are useful when the Jacobian has block structure.

For example, if $f$ decomposes into independent groups:

$$
f(x)=
\begin{bmatrix}
f_1(x_1)\\
f_2(x_2)\\
\cdots\\
f_k(x_k)
\end{bmatrix}
$$

then a system can use forward mode within blocks and reverse mode across scalar losses built from those blocks.

## Nested AD

Mixed mode is often implemented by nesting AD transformations.

Example notation:

```text
jvp(grad(f), x, v)
```

means:
- compute the gradient function with reverse mode
- apply a forward-mode JVP to that gradient

Another form:

```text
grad(lambda x: jvp(f, x, v))
```

means:
- compute a JVP with forward mode
- differentiate the resulting scalar with reverse mode

These nested transformations must carefully separate derivative levels.

A common implementation issue is perturbation confusion, where tangent information from different derivative levels is accidentally mixed.

## Perturbation Tags

Forward mode often uses perturbation tags to separate derivative contexts.

Each tangent belongs to a specific derivative level.

Conceptually:

$$
x + \dot x \varepsilon_1
$$

and:

$$
x + \dot x \varepsilon_2
$$

must remain distinct if they arise from different nested AD calls.

The system must enforce:

$$
\varepsilon_1 \neq \varepsilon_2
$$

and usually:

$$
\varepsilon_1\varepsilon_2
$$

represents a higher-order mixed term, not an accidental collision.

This matters for correct higher-order differentiation.

## Mode Choice by Shape

A useful rule is based on input and output dimensions.

For:

$$
f:\mathbb{R}^n\to\mathbb{R}^m
$$

| Shape | Preferred Mode |
|---|---|
| $n$ small, $m$ large | forward |
| $n$ large, $m$ small | reverse |
| both small | either |
| both large | exploit structure |
| scalar loss, many parameters | reverse |
| Hessian-vector product | mixed |
| sparse Jacobian | colored forward or structured mixed |

Mixed mode becomes necessary when the object being computed has shape beyond one row or one column of the Jacobian.

## Example: Hessian-Vector Product

Let:

$$
f(x,y)=x^2y+\sin(y)
$$

Gradient:

$$
\nabla f(x,y) =
\begin{bmatrix}
2xy\\
x^2+\cos(y)
\end{bmatrix}
$$

Hessian:

$$
H_f(x,y) =
\begin{bmatrix}
2y & 2x\\
2x & -\sin(y)
\end{bmatrix}
$$

For direction:

$$
v=
\begin{bmatrix}
v_x\\
v_y
\end{bmatrix}
$$

the Hessian-vector product is:

$$
H_f(x,y)v =
\begin{bmatrix}
2yv_x+2xv_y\\
2xv_x-\sin(y)v_y
\end{bmatrix}
$$

Mixed-mode AD computes this product directly, without constructing the Hessian explicitly.

## Implementation Pattern

A practical mixed-mode system exposes small derivative combinators:

```text
grad(f)
jvp(f, x, v)
vjp(f, x, u)
jacobian(f)
hvp(f, x, v)
```

Then higher derivatives are assembled by composition:

```text
hvp(f, x, v) = jvp(grad(f), x, v)
```

or:

```text
hvp(f, x, v) = grad(lambda x: jvp(f, x, v), x)
```

This design keeps the API small while allowing complex derivative objects.

## Cost and Memory Tradeoffs

Mixed mode inherits costs from both modes.

Forward-over-reverse:
- requires reverse-mode tape or graph
- propagates tangents through the reverse computation
- usually efficient for Hessian-vector products of scalar functions

Reverse-over-forward:
- records a forward-mode derivative computation
- applies reverse accumulation to that computation
- can be useful when the directional derivative is scalar

The best implementation depends on:
- program structure
- input dimension
- output dimension
- memory availability
- compiler support
- operator coverage

## Practical Uses

Mixed-mode differentiation is used when first-order gradients are insufficient.

Common cases include:
- second-order optimization
- natural gradient approximations
- curvature-vector products
- implicit differentiation through fixed points
- differentiating through optimization algorithms
- meta-learning inner loops
- sensitivity of gradients to hyperparameters
- scientific computing with parameter sensitivities

These workloads need derivative operations as first-class program transformations.

## Summary

Mixed-mode differentiation combines forward and reverse accumulation.

Forward mode supplies Jacobian-vector products.

Reverse mode supplies vector-Jacobian products.

Their composition gives efficient methods for Hessian-vector products, structured Jacobians, and higher-order derivatives.

The main design requirement is clean nesting: each derivative level must keep its tangents, adjoints, tapes, and perturbation tags separate.