Efficient Seeding Strategies

Forward mode automatic differentiation computes Jacobian-vector products:

J_f(x)v.

The seed vector

v

determines which derivative information propagates through the program. Choosing good seeds is therefore central to efficiency.

A naive approach computes one Jacobian column at a time using basis vectors:

e_1, e_2, \ldots, e_n.

This always works, but it may waste computation when the Jacobian has structure. Efficient seeding strategies reduce the number of forward passes, reduce tangent storage, or improve hardware utilization.

Basis seeding

The simplest strategy uses standard basis vectors.

For

f : \mathbb{R}^n \to \mathbb{R}^m,

seed:

v = e_i.

Then forward mode computes:

J_f(x)e_i,

which equals the $i$ -th Jacobian column.

For example, if

n = 3,

then:

e_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \qquad e_2 = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \qquad e_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}.

The full Jacobian is assembled column-by-column:

J_f(x) = \begin{bmatrix} J_f(x)e_1 & J_f(x)e_2 & J_f(x)e_3 \end{bmatrix}.

This method is simple and numerically stable. Its cost is proportional to the input dimension:

O(nC_f).

Dense identity seeding

Instead of one basis vector per pass, vector forward mode can seed all basis directions simultaneously.

Use:

V = I_n.

Then:

J_f(x)V = J_f(x).

This computes the entire Jacobian in one execution.

However, each variable now carries an $n$ -dimensional tangent vector. The runtime and memory cost become:

O(nC_f), \qquad O(nM_f).

This strategy is practical only for moderate $n$ .

Arbitrary directional seeding

Sometimes only directional sensitivity is needed.

Suppose we want:

J_f(x)v.

Then we directly seed:

\dot{x} = v.

This avoids constructing the full Jacobian.

Applications include:

Application	Direction meaning
sensitivity analysis	perturbation direction
optimization	search direction
PDE solvers	linearized update
continuation methods	parameter path
implicit differentiation	perturbation constraint

Directional seeding is often the most efficient use of forward mode.

Random seeding

Random seeds are useful for probabilistic estimation and debugging.

Choose:

v_i \sim \mathcal{N}(0,1).

Then compute:

J_f(x)v.

Applications include:

Use case	Goal
Jacobian norm estimation	approximate operator norm
randomized linear algebra	low-rank structure
derivative testing	detect implementation bugs
trace estimation	Hutchinson-type methods
sensitivity sampling	stochastic analysis

Random directional derivatives often reveal structural problems without building the full Jacobian.

Structured seeding

Many Jacobians are sparse. Basis seeding wastes work because most derivative entries are zero.

Suppose:

$$ J_f(x) = \begin{bmatrix}

& 0 & * & 0 \ 0 & * & 0 & * \
& 0 & 0 & 0 \end{bmatrix}. $$

Columns 1 and 2 never contribute to the same output row. Their seed directions can therefore share one tangent dimension.

Instead of:

e_1, \quad e_2, \quad e_3, \quad e_4,

we may use compressed seeds:

v_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}.

Now two forward passes recover all columns.

This is the basis of compressed Jacobian computation.

Graph coloring

Efficient sparse seeding is usually formulated as a graph coloring problem.

Define the column interference graph:

each Jacobian column is a node,
two columns share an edge if they can affect the same output row.

Columns with no conflict may share a tangent direction.

Coloring assigns independent columns the same seed channel.

If the coloring uses $k$ colors, the Jacobian can be recovered in $k$ forward passes instead of $n$ .

For sparse systems:

k \ll n.

This can reduce derivative cost dramatically.

Applications:

Domain	Sparsity source
PDE discretization	local coupling
finite element methods	mesh locality
circuit simulation	component connectivity
robotics	chain topology
optimization	block constraints

Example of compressed seeding

Suppose:

J = \begin{bmatrix} a & 0 & c \\ 0 & b & 0 \end{bmatrix}.

Columns 1 and 2 do not overlap in rows.

Use compressed seed:

v = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}.

Then:

Jv = \begin{bmatrix} a \\ b \end{bmatrix}.

One forward pass recovers both columns simultaneously.

Next seed:

v = \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}.

This gives:

\begin{bmatrix} c \\ 0 \end{bmatrix}.

Only two passes are needed instead of three.

Block seeding

Variables often appear naturally in groups.

Suppose:

x = (x^{(1)}, x^{(2)}, \ldots, x^{(k)}).

Each block may correspond to:

one physical subsystem,
one tensor,
one parameter group,
one mesh region,
one neural layer.

Block seeding assigns tangent dimensions per block rather than per scalar variable.

Advantages:

Benefit	Explanation
better cache locality	contiguous tangent access
vectorized kernels	tensor-friendly layout
simpler memory management	fewer sparse structures
reduced overhead	coarser granularity

This is common in tensor AD systems.

Tensor seeding

Machine learning systems usually work with tensors rather than scalar variables.

Suppose:

W \in \mathbb{R}^{m \times n}.

Instead of seeding each scalar independently, a tangent tensor is attached:

\dot{W} \in \mathbb{R}^{m \times n}.

Tensor-level seeding allows:

batched JVPs,
hardware tensor acceleration,
fused tangent kernels,
layerwise differentiation.

Modern frameworks frequently optimize around tensor tangent propagation rather than scalar dual arithmetic.

Seeding for Hessian-vector products

Forward mode combines naturally with reverse mode.

Suppose:

f : \mathbb{R}^n \to \mathbb{R}.

Its gradient is:

\nabla f(x).

Differentiate the gradient in direction $v$ :

H_f(x)v,

where $H_f(x)$ is the Hessian.

A common strategy:

compute gradient with reverse mode,
apply forward mode to the reverse computation.

The forward seed defines the Hessian-vector direction.

This avoids explicit Hessian construction.

Seed matrices

General vector forward mode uses a seed matrix:

V \in \mathbb{R}^{n \times k}.

Each column is one tangent direction.

The output is:

J_f(x)V.

Choice of $V$ determines:

Seed type	Result
identity	full Jacobian
basis subset	selected columns
random	stochastic directional probes
sparse compressed	sparse Jacobian recovery
low-rank	projected sensitivities
block structured	subsystem derivatives

Thus seeding defines the derivative query algebraically.

Adaptive seeding

Some systems choose seeds dynamically.

Example strategies:

Strategy	Goal
runtime sparsity discovery	reduce tangent dimensions
adaptive coloring	exploit observed dependencies
low-rank updates	reduce directional count
dynamic batching	improve GPU occupancy
selective propagation	avoid inactive paths

Adaptive methods are important when derivative structure changes across executions.

Seed reuse

In iterative algorithms, the same derivative directions often appear repeatedly.

Examples:

Algorithm	Reused direction
Newton-Krylov methods	Krylov basis vectors
continuation methods	path tangent
optimization	search direction
sensitivity analysis	parameter perturbations

Caching or reusing tangent allocations can reduce overhead.

Some systems preallocate tangent buffers to avoid repeated memory allocation during many JVP evaluations.

Forward-over-forward seeding

Nested forward mode introduces hierarchical tangent structures.

Suppose:

x + \epsilon_1 a + \epsilon_2 b + \epsilon_1\epsilon_2 c.

Now seeds exist at multiple derivative levels.

This enables:

Nesting	Result
forward-over-forward	second derivatives
vector-forward-over-forward	Hessians
nested sparse tangents	structured higher-order tensors

Seed management becomes increasingly important as derivative order grows.

Numerical conditioning

The seed direction also affects numerical behavior.

If:

J_f(x)v \approx 0,

small floating point errors may dominate.

Large or badly scaled seed vectors can amplify instability.

Good practice includes:

scaling tangent directions,
normalizing perturbation vectors,
avoiding extremely sparse tiny magnitudes,
using structured seeds aligned with problem geometry.

This becomes important in ill-conditioned scientific problems.

Seeding and hardware layout

Efficient implementations often organize tangent dimensions to match hardware.

CPU implementations may use:

SIMD lane packing,
structure-of-arrays layouts,
cache-aligned tangent blocks.

GPU implementations may use:

batched tensor kernels,
warp-aligned tangent dimensions,
fused primal-tangent kernels.

The algebraic seed structure therefore interacts directly with low-level systems design.

Tradeoffs

Different seeding strategies optimize different goals.

Strategy	Best for	Weakness
basis seeding	simplicity	many passes
dense identity	small full Jacobians	memory explosion
directional	sensitivity queries	incomplete information
sparse compressed	sparse systems	coloring overhead
block seeding	tensor programs	coarse granularity
random	stochastic estimation	approximate results

No single strategy dominates all workloads.

Summary

The seed vector or seed matrix determines which derivative information forward mode computes. Efficient seeding strategies exploit structure in the derivative problem to reduce runtime, memory usage, and tangent dimensionality.

The simplest strategy uses basis vectors to recover Jacobian columns individually. More advanced methods use compressed sparse seeds, graph coloring, block tangents, tensor seeds, or randomized projections. In large-scale systems, the efficiency of forward mode depends as much on seeding strategy as on the differentiation rules themselves.