Mixed-Mode Differentiation

Mixed-mode differentiation combines forward accumulation and reverse accumulation in the same derivative computation. It is used when neither pure forward mode nor pure reverse mode gives the best cost model.

Forward mode computes Jacobian-vector products:

J_f(x)v

Reverse mode computes vector-Jacobian products:

u^T J_f(x)

Mixed mode composes these two operations to compute more structured derivatives efficiently.

Why Mix Modes?

For a function:

f:\mathbb{R}^n \to \mathbb{R}^m

forward mode is efficient when $n$ is small.

Reverse mode is efficient when $m$ is small.

But many computations require objects more complex than one gradient:

Target	Common Efficient Method
directional derivative	forward mode
gradient of scalar function	reverse mode
Jacobian-vector product	forward mode
vector-Jacobian product	reverse mode
Hessian-vector product	forward-over-reverse
Jacobian of gradients	reverse-over-forward or forward-over-reverse
full Hessian	mixed repeated passes

Mixed mode appears when the derivative object has nested structure.

Derivatives as Linear Programs

The derivative of a function at a point is a linear map.

If:

f:\mathbb{R}^n \to \mathbb{R}^m

then:

Df(x):\mathbb{R}^n \to \mathbb{R}^m

Forward mode applies this linear map to a tangent vector.

Reverse mode applies its transpose to a cotangent vector.

Mixed mode differentiates programs that already compute derivatives.

That gives nested maps such as:

D(Df)

or:

D(\nabla f)

These are the basis of Hessian-vector products and higher-order sensitivities.

Forward Over Reverse

Forward-over-reverse means:

build a reverse-mode computation for a gradient
differentiate that gradient computation using forward mode

For a scalar function:

f:\mathbb{R}^n \to \mathbb{R}

reverse mode computes:

\nabla f(x)

Applying forward mode to this gradient in direction $v$ gives:

D(\nabla f)(x)v

which is the Hessian-vector product:

H_f(x)v

This avoids forming the full Hessian matrix.

Hessian-Vector Product

The Hessian is:

H_f(x) = \nabla^2 f(x)

For large $n$ , explicitly constructing $H_f(x)\in\mathbb{R}^{n\times n}$ is usually infeasible.

Mixed mode computes:

H_f(x)v

directly.

This is critical in:

Newton-CG methods
second-order optimization
implicit differentiation
sensitivity analysis
uncertainty estimation
meta-learning

The cost is usually comparable to a small constant number of gradient evaluations.

Reverse Over Forward

Reverse-over-forward means:

build a forward-mode computation for a directional derivative
differentiate that scalar directional derivative using reverse mode

For scalar:

f:\mathbb{R}^n \to \mathbb{R}

forward mode computes:

g(x)=J_f(x)v

Since $f$ is scalar, this is:

g(x)=\nabla f(x)^T v

Then reverse mode computes:

\nabla g(x)

which equals:

H_f(x)v

when $H_f$ is symmetric and $f$ is sufficiently smooth.

So both forward-over-reverse and reverse-over-forward can compute Hessian-vector products, but their practical costs differ depending on the implementation and shape of the program.

Jacobian of a Vector Function

For:

f:\mathbb{R}^n \to \mathbb{R}^m

the full Jacobian can be computed by repeated passes.

Forward mode gives columns:

J_f(x)e_i

Reverse mode gives rows:

e_j^T J_f(x)

Mixed strategies are useful when the Jacobian has block structure.

For example, if $f$ decomposes into independent groups:

f(x)= \begin{bmatrix} f_1(x_1)\\ f_2(x_2)\\ \cdots\\ f_k(x_k) \end{bmatrix}

then a system can use forward mode within blocks and reverse mode across scalar losses built from those blocks.

Nested AD

Mixed mode is often implemented by nesting AD transformations.

Example notation:

jvp(grad(f), x, v)

means:

compute the gradient function with reverse mode
apply a forward-mode JVP to that gradient

Another form:

grad(lambda x: jvp(f, x, v))

means:

compute a JVP with forward mode
differentiate the resulting scalar with reverse mode

These nested transformations must carefully separate derivative levels.

A common implementation issue is perturbation confusion, where tangent information from different derivative levels is accidentally mixed.

Perturbation Tags

Forward mode often uses perturbation tags to separate derivative contexts.

Each tangent belongs to a specific derivative level.

Conceptually:

x + \dot x \varepsilon_1

and:

x + \dot x \varepsilon_2

must remain distinct if they arise from different nested AD calls.

The system must enforce:

\varepsilon_1 \neq \varepsilon_2

and usually:

\varepsilon_1\varepsilon_2

represents a higher-order mixed term, not an accidental collision.

This matters for correct higher-order differentiation.

Mode Choice by Shape

A useful rule is based on input and output dimensions.

For:

f:\mathbb{R}^n\to\mathbb{R}^m

Shape	Preferred Mode
$n$ small, $m$ large	forward
$n$ large, $m$ small	reverse
both small	either
both large	exploit structure
scalar loss, many parameters	reverse
Hessian-vector product	mixed
sparse Jacobian	colored forward or structured mixed

Mixed mode becomes necessary when the object being computed has shape beyond one row or one column of the Jacobian.

Example: Hessian-Vector Product

Let:

f(x,y)=x^2y+\sin(y)

Gradient:

\nabla f(x,y) = \begin{bmatrix} 2xy\\ x^2+\cos(y) \end{bmatrix}

Hessian:

H_f(x,y) = \begin{bmatrix} 2y & 2x\\ 2x & -\sin(y) \end{bmatrix}

For direction:

v= \begin{bmatrix} v_x\\ v_y \end{bmatrix}

the Hessian-vector product is:

H_f(x,y)v = \begin{bmatrix} 2yv_x+2xv_y\\ 2xv_x-\sin(y)v_y \end{bmatrix}

Mixed-mode AD computes this product directly, without constructing the Hessian explicitly.

Implementation Pattern

A practical mixed-mode system exposes small derivative combinators:

grad(f)
jvp(f, x, v)
vjp(f, x, u)
jacobian(f)
hvp(f, x, v)

Then higher derivatives are assembled by composition:

hvp(f, x, v) = jvp(grad(f), x, v)

or:

hvp(f, x, v) = grad(lambda x: jvp(f, x, v), x)

This design keeps the API small while allowing complex derivative objects.

Cost and Memory Tradeoffs

Mixed mode inherits costs from both modes.

Forward-over-reverse:

requires reverse-mode tape or graph
propagates tangents through the reverse computation
usually efficient for Hessian-vector products of scalar functions

Reverse-over-forward:

records a forward-mode derivative computation
applies reverse accumulation to that computation
can be useful when the directional derivative is scalar

The best implementation depends on:

program structure
input dimension
output dimension
memory availability
compiler support
operator coverage

Practical Uses

Mixed-mode differentiation is used when first-order gradients are insufficient.

Common cases include:

second-order optimization
natural gradient approximations
curvature-vector products
implicit differentiation through fixed points
differentiating through optimization algorithms
meta-learning inner loops
sensitivity of gradients to hyperparameters
scientific computing with parameter sensitivities

These workloads need derivative operations as first-class program transformations.

Summary

Mixed-mode differentiation combines forward and reverse accumulation.

Forward mode supplies Jacobian-vector products.

Reverse mode supplies vector-Jacobian products.

Their composition gives efficient methods for Hessian-vector products, structured Jacobians, and higher-order derivatives.

The main design requirement is clean nesting: each derivative level must keep its tangents, adjoints, tapes, and perturbation tags separate.