Complexity Analysis

Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive operation computes its primal result and its derivative contribution.

For one tangent direction, the runtime is usually within a small constant factor of the original function evaluation.

If the primal computation costs

C_f,

then one forward-mode derivative pass usually costs

O(C_f).

The exact constant depends on the primitive operations, data layout, compiler optimizations, and tangent representation.

Scalar tangent cost

Consider a scalar operation:

z = xy.

The primal operation needs one multiplication. The tangent rule is

\dot{z} = \dot{x}y + x\dot{y}.

This adds two multiplications and one addition.

So multiplication changes from:

z = x * y

to:

z  = x * y
dz = dx * y + x * dy

The operation count increases, but the asymptotic structure is unchanged. The differentiated program follows the same execution path as the primal program.

For many scalar numerical programs, a forward-mode pass costs roughly two to five times the primal evaluation. This is a practical rule of thumb, not a theorem. Expensive primitives may have lower relative overhead because the derivative can reuse primal work.

For example, in

z = \exp(x),

the primal result

z = \exp(x)

can be reused in the tangent rule:

\dot{z} = z\dot{x}.

The derivative adds only a multiplication after computing the exponential.

Cost for one JVP

For

f : \mathbb{R}^n \to \mathbb{R}^m,

one forward-mode pass computes

J_f(x)v.

The cost is approximately one primal evaluation plus tangent propagation:

C_{\mathrm{jvp}} \approx \alpha C_f,

where $\alpha$ is a small implementation-dependent constant.

The memory cost is also close to the primal memory cost. Each live variable needs its primal value and its tangent value. If both are scalar floating point values, the live numerical storage is roughly doubled.

Forward mode does not need to retain the whole execution trace for a later pass. This makes its memory behavior much simpler than reverse mode.

Cost for full Jacobians

A single forward-mode pass computes one Jacobian-vector product. To build the full Jacobian, seed the inputs with basis vectors:

e_1, e_2, \ldots, e_n.

Each pass computes one column:

J_f(x)e_i.

Therefore, the cost of a full dense Jacobian by scalar forward mode is approximately

n C_{\mathrm{jvp}}.

C_{\mathrm{Jacobian}} \approx O(n C_f).

The full Jacobian itself has

mn

entries, so storing it also costs

O(mn).

This is acceptable for small $n$ . It is often too expensive when the input dimension is large.

Comparison with reverse mode

Forward and reverse mode have complementary complexity.

For

f : \mathbb{R}^n \to \mathbb{R}^m,

forward mode computes one column-oriented product:

Jv.

Reverse mode computes one row-oriented product:

w^\top J.

The cost profiles are:

Goal	Forward mode cost	Reverse mode cost
One JVP $Jv$	$O(C_f)$	inefficient or requires extra transforms
One VJP $w^\top J$	inefficient or requires extra transforms	$O(C_f)$
Full Jacobian	$O(nC_f)$	$O(mC_f)$
Gradient of $f:\mathbb{R}^n \to \mathbb{R}$	$O(nC_f)$	$O(C_f)$
Derivative of $f:\mathbb{R} \to \mathbb{R}^m$	$O(C_f)$	$O(mC_f)$

This is the central rule:

Forward mode is best when the number of input directions is small. Reverse mode is best when the number of output directions is small.

For scalar-output optimization, reverse mode usually wins. For low-dimensional parameter studies, sensitivity analysis, and functions with many outputs but few inputs, forward mode is often better.

Multiple tangent directions

Forward mode can propagate $k$ tangent directions in one run. Instead of storing

\dot{x} \in \mathbb{R},

each variable stores

\dot{x} \in \mathbb{R}^k.

The output gives

J_f(x)V,

where

V \in \mathbb{R}^{n \times k}.

The runtime is roughly proportional to $k$ :

C_{\mathrm{jvp}, k} \approx O(k C_f).

However, vectorized hardware can reduce the constant factor. A tangent vector of length $k$ may map well to SIMD lanes, GPU tensor operations, or batched kernels.

The memory cost also scales with $k$ , since each live variable carries $k$ tangent components:

M_{\mathrm{forward}, k} \approx O(k M_f),

where $M_f$ is the live memory footprint of the primal computation.

Dense versus sparse tangent storage

If tangent vectors are dense, the cost grows linearly with the number of propagated directions.

If tangent vectors are sparse, many tangent components are zero. A sparse forward-mode implementation can exploit this by storing only nonzero tangent entries.

For example, if each variable depends on only a small subset of inputs, then its tangent vector may stay sparse. This occurs in:

Problem type	Structure
PDE discretizations	local stencil dependencies
circuit simulation	sparse component coupling
graph algorithms	local neighborhood dependencies
robotics kinematics	chain or tree sparsity
structured optimization	block-sparse variables

Sparse tangent propagation can reduce both runtime and memory, but it introduces overhead for sparse data structures. It works best when the derivative sparsity is strong and stable.

Memory complexity

Forward mode has streaming memory behavior. Once a primitive has produced its output tangent, it does not need to keep local derivative information unless the primal program itself needs it.

For straight-line scalar code, if the primal evaluation uses live memory

M_f,

then scalar forward mode uses roughly

O(M_f)

memory, with a constant factor for tangents.

For $k$ tangent directions:

O(kM_f).

This contrasts with reverse mode, which usually needs to store intermediate primal values for the backward pass. Reverse mode may require memory proportional to the length of the computation unless checkpointing or recomputation is used.

Forward mode therefore behaves well for long-running simulations, streaming computations, and programs with many loop iterations.

Cost through loops

Loops do not change the basic model. If the primal loop executes $T$ iterations and each iteration costs $c$ , then the primal cost is

C_f = O(Tc).

Forward mode adds tangent work to each iteration:

C_{\mathrm{jvp}} = O(Tc).

The constant is larger, but the loop count is the same.

For a recurrence

x_{t+1} = g(x_t, \theta),

forward mode propagates

\dot{x}_{t+1} = \frac{\partial g}{\partial x_t}\dot{x}_t + \frac{\partial g}{\partial \theta}\dot{\theta}.

This is efficient when the number of tangent directions for $\theta$ is small. It becomes expensive when $\theta$ has many dimensions and all sensitivities are required.

Cost through branches

Forward mode follows the branch taken by the primal program. If the primal program executes one branch of a conditional, the tangent program executes the corresponding derivative branch.

Thus the cost is tied to the realized execution path, not to all possible paths.

For example:

if x > 0:
    y = expensive_a(x)
else:
    y = expensive_b(x)

Forward mode differentiates only the branch selected by the primal value of $x$ . It does not evaluate both branches unless the original program does.

At branch boundaries, derivative existence is a mathematical question. The complexity model still follows the executed program path.

Cost of primitive operations

Different primitives have different derivative overhead.

Primitive	Primal cost	Tangent overhead
Addition	low	low
Multiplication	low	moderate
Division	moderate	moderate
Exponential	high	low if primal result reused
Logarithm	high	low
Matrix multiplication	high	one or two extra matrix products
Linear solve	high	one extra solve using same factorization
SVD/eigendecomposition	high	derivative rule can be costly and unstable

For linear algebra primitives, the quality of the derivative rule matters. A naive rule may be much slower than necessary.

For example, for

Ax = b,

the tangent equation is

A\dot{x} = \dot{b} - \dot{A}x.

If the primal solve already factored $A$ , the tangent solve can often reuse that factorization. Reuse can make the tangent much cheaper than an entirely separate solve.

Full gradient cost in forward mode

For a scalar function

f : \mathbb{R}^n \to \mathbb{R},

the gradient has $n$ components:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

Forward mode with scalar tangents computes one component per basis seed. Therefore:

C_{\nabla f}^{\mathrm{forward}} \approx O(nC_f).

This is why forward mode is usually a poor default for training large neural networks. A model may have millions or billions of parameters. Computing one gradient component per forward pass would be infeasible.

Reverse mode computes the gradient of a scalar loss with cost proportional to one or a few primal evaluations, so it dominates this use case.

When forward mode wins

Forward mode is effective when the derivative query has few input directions.

Examples:

Function shape	Forward-mode advantage
$f:\mathbb{R}\to\mathbb{R}^m$	one pass gives all output derivatives
$f:\mathbb{R}^k\to\mathbb{R}^m$ , small $k$	full Jacobian costs $k$ passes
directional sensitivity	one pass per direction
Jacobian-free solvers	direct JVP operator
nested higher-order methods	useful inside reverse mode
parameter sweeps with few parameters	cheap sensitivities

Forward mode is also simpler to memory-manage. When memory is the limiting factor, a forward-mode method may be preferable even if it performs more arithmetic.

Summary

The complexity of forward mode is governed by the number of input tangent directions. One forward pass computes one Jacobian-vector product at cost roughly proportional to the original program evaluation. Full Jacobian construction requires one pass per input basis direction unless multiple directions are vectorized.

Forward mode has favorable memory behavior because it propagates derivatives with the primal computation and does not require a separate backward pass. Its main weakness is high-dimensional input sensitivity, where reverse mode is usually more efficient.