Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive...
Forward mode automatic differentiation has a simple cost model. It evaluates the original program and, at the same time, evaluates the tangent program. Each primitive operation computes its primal result and its derivative contribution.
For one tangent direction, the runtime is usually within a small constant factor of the original function evaluation.
If the primal computation costs
then one forward-mode derivative pass usually costs
The exact constant depends on the primitive operations, data layout, compiler optimizations, and tangent representation.
Scalar tangent cost
Consider a scalar operation:
The primal operation needs one multiplication. The tangent rule is
This adds two multiplications and one addition.
So multiplication changes from:
z = x * yto:
z = x * y
dz = dx * y + x * dyThe operation count increases, but the asymptotic structure is unchanged. The differentiated program follows the same execution path as the primal program.
For many scalar numerical programs, a forward-mode pass costs roughly two to five times the primal evaluation. This is a practical rule of thumb, not a theorem. Expensive primitives may have lower relative overhead because the derivative can reuse primal work.
For example, in
the primal result
can be reused in the tangent rule:
The derivative adds only a multiplication after computing the exponential.
Cost for one JVP
For
one forward-mode pass computes
The cost is approximately one primal evaluation plus tangent propagation:
where is a small implementation-dependent constant.
The memory cost is also close to the primal memory cost. Each live variable needs its primal value and its tangent value. If both are scalar floating point values, the live numerical storage is roughly doubled.
Forward mode does not need to retain the whole execution trace for a later pass. This makes its memory behavior much simpler than reverse mode.
Cost for full Jacobians
A single forward-mode pass computes one Jacobian-vector product. To build the full Jacobian, seed the inputs with basis vectors:
Each pass computes one column:
Therefore, the cost of a full dense Jacobian by scalar forward mode is approximately
So
The full Jacobian itself has
entries, so storing it also costs
This is acceptable for small . It is often too expensive when the input dimension is large.
Comparison with reverse mode
Forward and reverse mode have complementary complexity.
For
forward mode computes one column-oriented product:
Reverse mode computes one row-oriented product:
The cost profiles are:
| Goal | Forward mode cost | Reverse mode cost |
|---|---|---|
| One JVP | inefficient or requires extra transforms | |
| One VJP | inefficient or requires extra transforms | |
| Full Jacobian | ||
| Gradient of | ||
| Derivative of |
This is the central rule:
Forward mode is best when the number of input directions is small. Reverse mode is best when the number of output directions is small.
For scalar-output optimization, reverse mode usually wins. For low-dimensional parameter studies, sensitivity analysis, and functions with many outputs but few inputs, forward mode is often better.
Multiple tangent directions
Forward mode can propagate tangent directions in one run. Instead of storing
each variable stores
The output gives
where
The runtime is roughly proportional to :
However, vectorized hardware can reduce the constant factor. A tangent vector of length may map well to SIMD lanes, GPU tensor operations, or batched kernels.
The memory cost also scales with , since each live variable carries tangent components:
where is the live memory footprint of the primal computation.
Dense versus sparse tangent storage
If tangent vectors are dense, the cost grows linearly with the number of propagated directions.
If tangent vectors are sparse, many tangent components are zero. A sparse forward-mode implementation can exploit this by storing only nonzero tangent entries.
For example, if each variable depends on only a small subset of inputs, then its tangent vector may stay sparse. This occurs in:
| Problem type | Structure |
|---|---|
| PDE discretizations | local stencil dependencies |
| circuit simulation | sparse component coupling |
| graph algorithms | local neighborhood dependencies |
| robotics kinematics | chain or tree sparsity |
| structured optimization | block-sparse variables |
Sparse tangent propagation can reduce both runtime and memory, but it introduces overhead for sparse data structures. It works best when the derivative sparsity is strong and stable.
Memory complexity
Forward mode has streaming memory behavior. Once a primitive has produced its output tangent, it does not need to keep local derivative information unless the primal program itself needs it.
For straight-line scalar code, if the primal evaluation uses live memory
then scalar forward mode uses roughly
memory, with a constant factor for tangents.
For tangent directions:
This contrasts with reverse mode, which usually needs to store intermediate primal values for the backward pass. Reverse mode may require memory proportional to the length of the computation unless checkpointing or recomputation is used.
Forward mode therefore behaves well for long-running simulations, streaming computations, and programs with many loop iterations.
Cost through loops
Loops do not change the basic model. If the primal loop executes iterations and each iteration costs , then the primal cost is
Forward mode adds tangent work to each iteration:
The constant is larger, but the loop count is the same.
For a recurrence
forward mode propagates
This is efficient when the number of tangent directions for is small. It becomes expensive when has many dimensions and all sensitivities are required.
Cost through branches
Forward mode follows the branch taken by the primal program. If the primal program executes one branch of a conditional, the tangent program executes the corresponding derivative branch.
Thus the cost is tied to the realized execution path, not to all possible paths.
For example:
if x > 0:
y = expensive_a(x)
else:
y = expensive_b(x)Forward mode differentiates only the branch selected by the primal value of . It does not evaluate both branches unless the original program does.
At branch boundaries, derivative existence is a mathematical question. The complexity model still follows the executed program path.
Cost of primitive operations
Different primitives have different derivative overhead.
| Primitive | Primal cost | Tangent overhead |
|---|---|---|
| Addition | low | low |
| Multiplication | low | moderate |
| Division | moderate | moderate |
| Exponential | high | low if primal result reused |
| Logarithm | high | low |
| Matrix multiplication | high | one or two extra matrix products |
| Linear solve | high | one extra solve using same factorization |
| SVD/eigendecomposition | high | derivative rule can be costly and unstable |
For linear algebra primitives, the quality of the derivative rule matters. A naive rule may be much slower than necessary.
For example, for
the tangent equation is
If the primal solve already factored , the tangent solve can often reuse that factorization. Reuse can make the tangent much cheaper than an entirely separate solve.
Full gradient cost in forward mode
For a scalar function
the gradient has components:
Forward mode with scalar tangents computes one component per basis seed. Therefore:
This is why forward mode is usually a poor default for training large neural networks. A model may have millions or billions of parameters. Computing one gradient component per forward pass would be infeasible.
Reverse mode computes the gradient of a scalar loss with cost proportional to one or a few primal evaluations, so it dominates this use case.
When forward mode wins
Forward mode is effective when the derivative query has few input directions.
Examples:
| Function shape | Forward-mode advantage |
|---|---|
| one pass gives all output derivatives | |
| , small | full Jacobian costs passes |
| directional sensitivity | one pass per direction |
| Jacobian-free solvers | direct JVP operator |
| nested higher-order methods | useful inside reverse mode |
| parameter sweeps with few parameters | cheap sensitivities |
Forward mode is also simpler to memory-manage. When memory is the limiting factor, a forward-mode method may be preferable even if it performs more arithmetic.
Summary
The complexity of forward mode is governed by the number of input tangent directions. One forward pass computes one Jacobian-vector product at cost roughly proportional to the original program evaluation. Full Jacobian construction requires one pass per input basis direction unless multiple directions are vectorized.
Forward mode has favorable memory behavior because it propagates derivatives with the primal computation and does not require a separate backward pass. Its main weakness is high-dimensional input sensitivity, where reverse mode is usually more efficient.