Skip to content

Complexity Analysis

Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive...

Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive operation computes its primal result and its derivative contribution.

For one tangent direction, the runtime is usually within a small constant factor of the original function evaluation.

If the primal computation costs

Cf, C_f,

then one forward-mode derivative pass usually costs

O(Cf). O(C_f).

The exact constant depends on the primitive operations, data layout, compiler optimizations, and tangent representation.

Scalar tangent cost

Consider a scalar operation:

z=xy. z = xy.

The primal operation needs one multiplication. The tangent rule is

z˙=x˙y+xy˙. \dot{z} = \dot{x}y + x\dot{y}.

This adds two multiplications and one addition.

So multiplication changes from:

z = x * y

to:

z  = x * y
dz = dx * y + x * dy

The operation count increases, but the asymptotic structure is unchanged. The differentiated program follows the same execution path as the primal program.

For many scalar numerical programs, a forward-mode pass costs roughly two to five times the primal evaluation. This is a practical rule of thumb, not a theorem. Expensive primitives may have lower relative overhead because the derivative can reuse primal work.

For example, in

z=exp(x), z = \exp(x),

the primal result

z=exp(x) z = \exp(x)

can be reused in the tangent rule:

z˙=zx˙. \dot{z} = z\dot{x}.

The derivative adds only a multiplication after computing the exponential.

Cost for one JVP

For

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

one forward-mode pass computes

Jf(x)v. J_f(x)v.

The cost is approximately one primal evaluation plus tangent propagation:

CjvpαCf, C_{\mathrm{jvp}} \approx \alpha C_f,

where α\alpha is a small implementation-dependent constant.

The memory cost is also close to the primal memory cost. Each live variable needs its primal value and its tangent value. If both are scalar floating point values, the live numerical storage is roughly doubled.

Forward mode does not need to retain the whole execution trace for a later pass. This makes its memory behavior much simpler than reverse mode.

Cost for full Jacobians

A single forward-mode pass computes one Jacobian-vector product. To build the full Jacobian, seed the inputs with basis vectors:

e1,e2,,en. e_1, e_2, \ldots, e_n.

Each pass computes one column:

Jf(x)ei. J_f(x)e_i.

Therefore, the cost of a full dense Jacobian by scalar forward mode is approximately

nCjvp. n C_{\mathrm{jvp}}.

So

CJacobianO(nCf). C_{\mathrm{Jacobian}} \approx O(n C_f).

The full Jacobian itself has

mn mn

entries, so storing it also costs

O(mn). O(mn).

This is acceptable for small nn. It is often too expensive when the input dimension is large.

Comparison with reverse mode

Forward and reverse mode have complementary complexity.

For

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

forward mode computes one column-oriented product:

Jv. Jv.

Reverse mode computes one row-oriented product:

wJ. w^\top J.

The cost profiles are:

GoalForward mode costReverse mode cost
One JVP JvJvO(Cf)O(C_f)inefficient or requires extra transforms
One VJP wJw^\top Jinefficient or requires extra transformsO(Cf)O(C_f)
Full JacobianO(nCf)O(nC_f)O(mCf)O(mC_f)
Gradient of f:RnRf:\mathbb{R}^n \to \mathbb{R}O(nCf)O(nC_f)O(Cf)O(C_f)
Derivative of f:RRmf:\mathbb{R} \to \mathbb{R}^mO(Cf)O(C_f)O(mCf)O(mC_f)

This is the central rule:

Forward mode is best when the number of input directions is small. Reverse mode is best when the number of output directions is small.

For scalar-output optimization, reverse mode usually wins. For low-dimensional parameter studies, sensitivity analysis, and functions with many outputs but few inputs, forward mode is often better.

Multiple tangent directions

Forward mode can propagate kk tangent directions in one run. Instead of storing

x˙R, \dot{x} \in \mathbb{R},

each variable stores

x˙Rk. \dot{x} \in \mathbb{R}^k.

The output gives

Jf(x)V, J_f(x)V,

where

VRn×k. V \in \mathbb{R}^{n \times k}.

The runtime is roughly proportional to kk:

Cjvp,kO(kCf). C_{\mathrm{jvp}, k} \approx O(k C_f).

However, vectorized hardware can reduce the constant factor. A tangent vector of length kk may map well to SIMD lanes, GPU tensor operations, or batched kernels.

The memory cost also scales with kk, since each live variable carries kk tangent components:

Mforward,kO(kMf), M_{\mathrm{forward}, k} \approx O(k M_f),

where MfM_f is the live memory footprint of the primal computation.

Dense versus sparse tangent storage

If tangent vectors are dense, the cost grows linearly with the number of propagated directions.

If tangent vectors are sparse, many tangent components are zero. A sparse forward-mode implementation can exploit this by storing only nonzero tangent entries.

For example, if each variable depends on only a small subset of inputs, then its tangent vector may stay sparse. This occurs in:

Problem typeStructure
PDE discretizationslocal stencil dependencies
circuit simulationsparse component coupling
graph algorithmslocal neighborhood dependencies
robotics kinematicschain or tree sparsity
structured optimizationblock-sparse variables

Sparse tangent propagation can reduce both runtime and memory, but it introduces overhead for sparse data structures. It works best when the derivative sparsity is strong and stable.

Memory complexity

Forward mode has streaming memory behavior. Once a primitive has produced its output tangent, it does not need to keep local derivative information unless the primal program itself needs it.

For straight-line scalar code, if the primal evaluation uses live memory

Mf, M_f,

then scalar forward mode uses roughly

O(Mf) O(M_f)

memory, with a constant factor for tangents.

For kk tangent directions:

O(kMf). O(kM_f).

This contrasts with reverse mode, which usually needs to store intermediate primal values for the backward pass. Reverse mode may require memory proportional to the length of the computation unless checkpointing or recomputation is used.

Forward mode therefore behaves well for long-running simulations, streaming computations, and programs with many loop iterations.

Cost through loops

Loops do not change the basic model. If the primal loop executes TT iterations and each iteration costs cc, then the primal cost is

Cf=O(Tc). C_f = O(Tc).

Forward mode adds tangent work to each iteration:

Cjvp=O(Tc). C_{\mathrm{jvp}} = O(Tc).

The constant is larger, but the loop count is the same.

For a recurrence

xt+1=g(xt,θ), x_{t+1} = g(x_t, \theta),

forward mode propagates

x˙t+1=gxtx˙t+gθθ˙. \dot{x}_{t+1} = \frac{\partial g}{\partial x_t}\dot{x}_t + \frac{\partial g}{\partial \theta}\dot{\theta}.

This is efficient when the number of tangent directions for θ\theta is small. It becomes expensive when θ\theta has many dimensions and all sensitivities are required.

Cost through branches

Forward mode follows the branch taken by the primal program. If the primal program executes one branch of a conditional, the tangent program executes the corresponding derivative branch.

Thus the cost is tied to the realized execution path, not to all possible paths.

For example:

if x > 0:
    y = expensive_a(x)
else:
    y = expensive_b(x)

Forward mode differentiates only the branch selected by the primal value of xx. It does not evaluate both branches unless the original program does.

At branch boundaries, derivative existence is a mathematical question. The complexity model still follows the executed program path.

Cost of primitive operations

Different primitives have different derivative overhead.

PrimitivePrimal costTangent overhead
Additionlowlow
Multiplicationlowmoderate
Divisionmoderatemoderate
Exponentialhighlow if primal result reused
Logarithmhighlow
Matrix multiplicationhighone or two extra matrix products
Linear solvehighone extra solve using same factorization
SVD/eigendecompositionhighderivative rule can be costly and unstable

For linear algebra primitives, the quality of the derivative rule matters. A naive rule may be much slower than necessary.

For example, for

Ax=b, Ax = b,

the tangent equation is

Ax˙=b˙A˙x. A\dot{x} = \dot{b} - \dot{A}x.

If the primal solve already factored AA, the tangent solve can often reuse that factorization. Reuse can make the tangent much cheaper than an entirely separate solve.

Full gradient cost in forward mode

For a scalar function

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the gradient has nn components:

f(x)=[fx1fxn]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

Forward mode with scalar tangents computes one component per basis seed. Therefore:

CfforwardO(nCf). C_{\nabla f}^{\mathrm{forward}} \approx O(nC_f).

This is why forward mode is usually a poor default for training large neural networks. A model may have millions or billions of parameters. Computing one gradient component per forward pass would be infeasible.

Reverse mode computes the gradient of a scalar loss with cost proportional to one or a few primal evaluations, so it dominates this use case.

When forward mode wins

Forward mode is effective when the derivative query has few input directions.

Examples:

Function shapeForward-mode advantage
f:RRmf:\mathbb{R}\to\mathbb{R}^mone pass gives all output derivatives
f:RkRmf:\mathbb{R}^k\to\mathbb{R}^m, small kkfull Jacobian costs kk passes
directional sensitivityone pass per direction
Jacobian-free solversdirect JVP operator
nested higher-order methodsuseful inside reverse mode
parameter sweeps with few parameterscheap sensitivities

Forward mode is also simpler to memory-manage. When memory is the limiting factor, a forward-mode method may be preferable even if it performs more arithmetic.

Summary

The complexity of forward mode is governed by the number of input tangent directions. One forward pass computes one Jacobian-vector product at cost roughly proportional to the original program evaluation. Full Jacobian construction requires one pass per input basis direction unless multiple directions are vectorized.

Forward mode has favorable memory behavior because it propagates derivatives with the primal computation and does not require a separate backward pass. Its main weakness is high-dimensional input sensitivity, where reverse mode is usually more efficient.