Skip to content

Mixed-Mode Differentiation

Mixed-mode differentiation combines forward accumulation and reverse accumulation in the same derivative computation. It is used when neither pure forward mode nor pure...

Mixed-mode differentiation combines forward accumulation and reverse accumulation in the same derivative computation. It is used when neither pure forward mode nor pure reverse mode gives the best cost model.

Forward mode computes Jacobian-vector products:

Jf(x)v J_f(x)v

Reverse mode computes vector-Jacobian products:

uTJf(x) u^T J_f(x)

Mixed mode composes these two operations to compute more structured derivatives efficiently.

Why Mix Modes?

For a function:

f:RnRm f:\mathbb{R}^n \to \mathbb{R}^m

forward mode is efficient when nn is small.

Reverse mode is efficient when mm is small.

But many computations require objects more complex than one gradient:

TargetCommon Efficient Method
directional derivativeforward mode
gradient of scalar functionreverse mode
Jacobian-vector productforward mode
vector-Jacobian productreverse mode
Hessian-vector productforward-over-reverse
Jacobian of gradientsreverse-over-forward or forward-over-reverse
full Hessianmixed repeated passes

Mixed mode appears when the derivative object has nested structure.

Derivatives as Linear Programs

The derivative of a function at a point is a linear map.

If:

f:RnRm f:\mathbb{R}^n \to \mathbb{R}^m

then:

Df(x):RnRm Df(x):\mathbb{R}^n \to \mathbb{R}^m

Forward mode applies this linear map to a tangent vector.

Reverse mode applies its transpose to a cotangent vector.

Mixed mode differentiates programs that already compute derivatives.

That gives nested maps such as:

D(Df) D(Df)

or:

D(f) D(\nabla f)

These are the basis of Hessian-vector products and higher-order sensitivities.

Forward Over Reverse

Forward-over-reverse means:

  1. build a reverse-mode computation for a gradient
  2. differentiate that gradient computation using forward mode

For a scalar function:

f:RnR f:\mathbb{R}^n \to \mathbb{R}

reverse mode computes:

f(x) \nabla f(x)

Applying forward mode to this gradient in direction vv gives:

D(f)(x)v D(\nabla f)(x)v

which is the Hessian-vector product:

Hf(x)v H_f(x)v

This avoids forming the full Hessian matrix.

Hessian-Vector Product

The Hessian is:

Hf(x)=2f(x) H_f(x) = \nabla^2 f(x)

For large nn, explicitly constructing Hf(x)Rn×nH_f(x)\in\mathbb{R}^{n\times n} is usually infeasible.

Mixed mode computes:

Hf(x)v H_f(x)v

directly.

This is critical in:

  • Newton-CG methods
  • second-order optimization
  • implicit differentiation
  • sensitivity analysis
  • uncertainty estimation
  • meta-learning

The cost is usually comparable to a small constant number of gradient evaluations.

Reverse Over Forward

Reverse-over-forward means:

  1. build a forward-mode computation for a directional derivative
  2. differentiate that scalar directional derivative using reverse mode

For scalar:

f:RnR f:\mathbb{R}^n \to \mathbb{R}

forward mode computes:

g(x)=Jf(x)v g(x)=J_f(x)v

Since ff is scalar, this is:

g(x)=f(x)Tv g(x)=\nabla f(x)^T v

Then reverse mode computes:

g(x) \nabla g(x)

which equals:

Hf(x)v H_f(x)v

when HfH_f is symmetric and ff is sufficiently smooth.

So both forward-over-reverse and reverse-over-forward can compute Hessian-vector products, but their practical costs differ depending on the implementation and shape of the program.

Jacobian of a Vector Function

For:

f:RnRm f:\mathbb{R}^n \to \mathbb{R}^m

the full Jacobian can be computed by repeated passes.

Forward mode gives columns:

Jf(x)ei J_f(x)e_i

Reverse mode gives rows:

ejTJf(x) e_j^T J_f(x)

Mixed strategies are useful when the Jacobian has block structure.

For example, if ff decomposes into independent groups:

f(x)=[f1(x1)f2(x2)fk(xk)] f(x)= \begin{bmatrix} f_1(x_1)\\ f_2(x_2)\\ \cdots\\ f_k(x_k) \end{bmatrix}

then a system can use forward mode within blocks and reverse mode across scalar losses built from those blocks.

Nested AD

Mixed mode is often implemented by nesting AD transformations.

Example notation:

jvp(grad(f), x, v)

means:

  • compute the gradient function with reverse mode
  • apply a forward-mode JVP to that gradient

Another form:

grad(lambda x: jvp(f, x, v))

means:

  • compute a JVP with forward mode
  • differentiate the resulting scalar with reverse mode

These nested transformations must carefully separate derivative levels.

A common implementation issue is perturbation confusion, where tangent information from different derivative levels is accidentally mixed.

Perturbation Tags

Forward mode often uses perturbation tags to separate derivative contexts.

Each tangent belongs to a specific derivative level.

Conceptually:

x+x˙ε1 x + \dot x \varepsilon_1

and:

x+x˙ε2 x + \dot x \varepsilon_2

must remain distinct if they arise from different nested AD calls.

The system must enforce:

ε1ε2 \varepsilon_1 \neq \varepsilon_2

and usually:

ε1ε2 \varepsilon_1\varepsilon_2

represents a higher-order mixed term, not an accidental collision.

This matters for correct higher-order differentiation.

Mode Choice by Shape

A useful rule is based on input and output dimensions.

For:

f:RnRm f:\mathbb{R}^n\to\mathbb{R}^m
ShapePreferred Mode
nn small, mm largeforward
nn large, mm smallreverse
both smalleither
both largeexploit structure
scalar loss, many parametersreverse
Hessian-vector productmixed
sparse Jacobiancolored forward or structured mixed

Mixed mode becomes necessary when the object being computed has shape beyond one row or one column of the Jacobian.

Example: Hessian-Vector Product

Let:

f(x,y)=x2y+sin(y) f(x,y)=x^2y+\sin(y)

Gradient:

f(x,y)=[2xyx2+cos(y)] \nabla f(x,y) = \begin{bmatrix} 2xy\\ x^2+\cos(y) \end{bmatrix}

Hessian:

Hf(x,y)=[2y2x2xsin(y)] H_f(x,y) = \begin{bmatrix} 2y & 2x\\ 2x & -\sin(y) \end{bmatrix}

For direction:

v=[vxvy] v= \begin{bmatrix} v_x\\ v_y \end{bmatrix}

the Hessian-vector product is:

Hf(x,y)v=[2yvx+2xvy2xvxsin(y)vy] H_f(x,y)v = \begin{bmatrix} 2yv_x+2xv_y\\ 2xv_x-\sin(y)v_y \end{bmatrix}

Mixed-mode AD computes this product directly, without constructing the Hessian explicitly.

Implementation Pattern

A practical mixed-mode system exposes small derivative combinators:

grad(f)
jvp(f, x, v)
vjp(f, x, u)
jacobian(f)
hvp(f, x, v)

Then higher derivatives are assembled by composition:

hvp(f, x, v) = jvp(grad(f), x, v)

or:

hvp(f, x, v) = grad(lambda x: jvp(f, x, v), x)

This design keeps the API small while allowing complex derivative objects.

Cost and Memory Tradeoffs

Mixed mode inherits costs from both modes.

Forward-over-reverse:

  • requires reverse-mode tape or graph
  • propagates tangents through the reverse computation
  • usually efficient for Hessian-vector products of scalar functions

Reverse-over-forward:

  • records a forward-mode derivative computation
  • applies reverse accumulation to that computation
  • can be useful when the directional derivative is scalar

The best implementation depends on:

  • program structure
  • input dimension
  • output dimension
  • memory availability
  • compiler support
  • operator coverage

Practical Uses

Mixed-mode differentiation is used when first-order gradients are insufficient.

Common cases include:

  • second-order optimization
  • natural gradient approximations
  • curvature-vector products
  • implicit differentiation through fixed points
  • differentiating through optimization algorithms
  • meta-learning inner loops
  • sensitivity of gradients to hyperparameters
  • scientific computing with parameter sensitivities

These workloads need derivative operations as first-class program transformations.

Summary

Mixed-mode differentiation combines forward and reverse accumulation.

Forward mode supplies Jacobian-vector products.

Reverse mode supplies vector-Jacobian products.

Their composition gives efficient methods for Hessian-vector products, structured Jacobians, and higher-order derivatives.

The main design requirement is clean nesting: each derivative level must keep its tangents, adjoints, tapes, and perturbation tags separate.