Forward Accumulation

Forward accumulation is the forward-mode form of automatic differentiation. It propagates derivative information in the same order as ordinary program evaluation. Each intermediate value carries two pieces of information:

Component	Meaning
Primal	the ordinary value computed by the program
Tangent	the derivative of that value with respect to a chosen input direction

If a program computes:

y=f(x)

then forward accumulation computes both:

y

and:

\dot y = J_f(x)\dot x

where:

$J_f(x)$ is the Jacobian of $f$ at $x$
$\dot x$ is the seed tangent
$\dot y$ is the propagated output tangent

Forward mode therefore computes a Jacobian-vector product.

The Tangent Pair

Forward mode evaluates every variable as a pair:

(v,\dot v)

The first component is the primal value. The second component is the tangent value.

For an input $x$ , we initialize:

(x,\dot x)

The choice of $\dot x$ determines which derivative direction we compute.

For a scalar input, set:

\dot x = 1

Then the output tangent equals the ordinary derivative:

\dot y = \frac{dy}{dx}

For a vector input $x\in\mathbb{R}^n$ , set:

\dot x = e_i

where $e_i$ is the $i$ -th standard basis vector. Then the output tangent gives the $i$ -th column of the Jacobian.

Simple Scalar Example

Consider:

f(x)=x^2+3x

Introduce intermediate variables:

v_1=x^2

v_2=3x

v_3=v_1+v_2

Forward accumulation evaluates primal and tangent values together.

Variable	Primal	Tangent
$x$	$x$	$1$
$v_1=x^2$	$x^2$	$2x$
$v_2=3x$	$3x$	$3$
$v_3=v_1+v_2$	$x^2+3x$	$2x+3$

At the end:

f'(x)=2x+3

The derivative appears as a byproduct of ordinary evaluation.

Operational Rules

Every primitive operation is lifted from values to tangent pairs.

Addition

z=x+y

Forward rule:

(z,\dot z) = (x+y,\dot x+\dot y)

Multiplication

z=xy

Forward rule:

(z,\dot z) = (xy,y\dot x+x\dot y)

Sine

z=\sin(x)

Forward rule:

(z,\dot z) = (\sin(x),\cos(x)\dot x)

Exponential

z=e^x

Forward rule:

(z,\dot z) = (e^x,e^x\dot x)

Each rule is local. The AD system only needs the derivative rule for the current primitive.

Forward Mode as Program Transformation

Forward accumulation can be viewed as a program transformation.

Original program:

v1 = x * x
v2 = 3 * x
v3 = v1 + v2
return v3

Forward-transformed program:

v1      = x * x
dot_v1  = x * dot_x + x * dot_x

v2      = 3 * x
dot_v2  = 3 * dot_x

v3      = v1 + v2
dot_v3  = dot_v1 + dot_v2

return v3, dot_v3

This transformation preserves the original computation while adding tangent computation beside it.

The transformed program has the same control flow as the original program. Branches, loops, and function calls execute normally, but each differentiable value carries a tangent.

Jacobian-Vector Products

For a function:

f:\mathbb{R}^n\to\mathbb{R}^m

the derivative at a point is the Jacobian:

J_f(x)\in\mathbb{R}^{m\times n}

Forward mode computes:

J_f(x)\dot x

This is a directional derivative.

If:

\dot x=e_i

then:

J_f(x)e_i

returns the $i$ -th column of the Jacobian.

Thus, computing the full Jacobian by forward mode requires $n$ passes, one for each input dimension.

Cost Model

Let $C(f)$ be the cost of evaluating $f$ .

One forward-mode pass usually costs a small constant multiple of $C(f)$ .

For one seed direction:

C_{\text{forward}} \approx k C(f)

where $k$ is often between 2 and 5, depending on the primitive set and implementation.

For a full Jacobian of:

f:\mathbb{R}^n\to\mathbb{R}^m

forward mode requires $n$ seeded evaluations:

C_{\text{Jacobian}} \approx n k C(f)

Forward mode is therefore well suited when:

the number of inputs is small
the number of outputs is large
directional derivatives are sufficient
Jacobian-vector products are needed directly

Multi-Seed Forward Mode

Instead of carrying one tangent direction, a system may carry several tangents at once.

Each variable becomes:

(v,\dot v_1,\dot v_2,\dots,\dot v_p)

where $p$ is the number of seed directions.

This computes:

J_f(x)S

where:

S\in\mathbb{R}^{n\times p}

contains multiple seed vectors.

Multi-seed forward mode amortizes overhead across directions. It is useful when computing several Jacobian columns together, especially on vector hardware.

Example: Two Inputs

Let:

f(x,y)=xy+\sin(x)

Intermediate variables:

v_1=xy

v_2=\sin(x)

v_3=v_1+v_2

To compute the derivative with respect to $x$ , seed:

\dot x=1,\quad \dot y=0

Propagation:

\dot v_1=y\dot x+x\dot y=y

\dot v_2=\cos(x)\dot x=\cos(x)

\dot v_3=y+\cos(x)

Therefore:

\frac{\partial f}{\partial x}=y+\cos(x)

To compute the derivative with respect to $y$ , seed:

\dot x=0,\quad \dot y=1

Propagation:

\dot v_1=x

\dot v_2=0

\dot v_3=x

Therefore:

\frac{\partial f}{\partial y}=x

Forward Mode and Dual Numbers

Forward accumulation is naturally represented with dual numbers.

A dual number has the form:

a+b\varepsilon

where:

\varepsilon^2=0

Evaluate a function on:

x+\dot x\varepsilon

Then:

f(x+\dot x\varepsilon) = f(x)+f'(x)\dot x\varepsilon

The coefficient of $\varepsilon$ is the forward-mode tangent.

This gives an algebraic interpretation of forward accumulation: ordinary arithmetic is extended so that derivative information propagates automatically.

Control Flow

Forward accumulation follows the same branch decisions as primal execution.

Example:

if x > 0:
    y = x * x
else:
    y = -x

If $x>0$ , the derivative rule is:

\dot y=2x\dot x

If $x<0$ , the derivative rule is:

\dot y=-\dot x

At $x=0$ , the program is not differentiable in the classical sense because the active branch changes. AD returns the derivative of the executed branch, not a symbolic piecewise derivative over all branches.

This distinction matters for programs with comparisons, clipping, thresholding, and discrete control.

Strengths of Forward Accumulation

Forward accumulation is simple and predictable.

It has several practical advantages:

no reverse tape is required
memory usage is low
implementation is straightforward
works naturally with loops and recursion
good for small-input, large-output functions
good for online derivative computation

Because tangents are computed immediately, forward mode does not need to store the entire computation graph for a later backward pass.

Limitations

Forward mode becomes expensive when the input dimension is large and the output dimension is small.

For:

f:\mathbb{R}^n\to\mathbb{R}

computing the full gradient with forward mode requires $n$ passes.

This is inefficient for neural networks, where $n$ may be millions or billions.

Reverse mode is preferred in that case because it computes the full gradient of a scalar-output function in one reverse pass.

Summary

Forward accumulation evaluates a program while propagating tangent values through each primitive operation.

Its core object is the pair:

(v,\dot v)

Its core computation is:

\dot y=J_f(x)\dot x

It is best understood as:

local derivative propagation
program transformation
dual-number arithmetic
Jacobian-vector product evaluation

Forward accumulation is the simplest operational form of automatic differentiation and the basis for many higher-order and mixed-mode techniques.