Sparse Forward Methods

Many real-world Jacobians are sparse. Most derivative entries are zero because outputs depend only on small subsets of inputs.

Forward mode automatic differentiation can exploit this structure to reduce:

runtime,
memory usage,
tangent dimensionality,
cache pressure,
bandwidth cost.

Sparse forward methods are techniques for propagating only the derivative information that actually matters.

Sparse Jacobians

Suppose

f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian is

J_f(x) = \left[ \frac{\partial f_i}{\partial x_j} \right].

In many systems:

\frac{\partial f_i}{\partial x_j} = 0

for most pairs $(i,j)$ .

Example:

f(x_1,x_2,x_3,x_4) = \begin{bmatrix} x_1x_2 \\ x_3 + x_4 \end{bmatrix}.

The Jacobian is

J_f(x) = \begin{bmatrix} x_2 & x_1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix}.

Only four entries are nonzero.

Dense forward mode still propagates tangent information for all variables. Sparse forward mode tries to avoid this unnecessary work.

Sources of sparsity

Sparse derivatives appear naturally in many domains.

Domain	Sparsity source
PDE discretization	local spatial interactions
finite element methods	mesh locality
circuit simulation	component connectivity
robotics	chain topology
graph algorithms	neighborhood structure
optimization	block constraints
databases	local relational dependencies
physical simulation	finite-range forces

In these systems, each output depends on only a small subset of inputs.

Dependency propagation

Forward mode propagates tangent information through dependencies.

If:

z = x + y,

then:

\dot{z} = \dot{x} + \dot{y}.

If both tangents are sparse, the result stays sparse.

Example:

\dot{x} = \{(1,a)\}, \qquad \dot{y} = \{(5,b)\}.

Then:

\dot{z} = \{(1,a),(5,b)\}.

Only active derivative channels propagate.

Sparse tangent representation

Instead of dense tangent vectors:

\dot{x} \in \mathbb{R}^k,

store only nonzero entries.

One representation:

type SparseTangent struct {
    Indices []int
    Values  []float64
}

A dual value becomes:

type SparseDual struct {
    Value   float64
    Tangent SparseTangent
}

Operations merge and update sparse derivative entries.

Sparse addition

Suppose:

z = x + y.

Then:

\dot{z} = \dot{x} + \dot{y}.

Sparse implementation:

func Add(a, b SparseDual) SparseDual {
    return SparseDual{
        Value:   a.Value + b.Value,
        Tangent: MergeAdd(a.Tangent, b.Tangent),
    }
}

Only active derivative channels are touched.

Sparse multiplication

For:

z = xy,

the tangent rule is:

\dot{z} = y\dot{x} + x\dot{y}.

Sparse implementation scales active entries:

func Mul(a, b SparseDual) SparseDual {
    ta := Scale(a.Tangent, b.Value)
    tb := Scale(b.Tangent, a.Value)

    return SparseDual{
        Value:   a.Value * b.Value,
        Tangent: MergeAdd(ta, tb),
    }
}

The complexity depends on the number of active tangent entries, not the total input dimension.

Complexity reduction

Suppose:

primal cost is $C_f$ ,
dense tangent dimension is $n$ ,
each variable depends on only $s \ll n$ active directions.

Dense forward mode costs roughly:

O(nC_f).

Sparse forward mode reduces this toward:

O(sC_f).

The exact gain depends on:

sparsity stability,
tangent overlap,
data structure overhead,
cache locality.

Column compression

Sparse Jacobians often allow compressed seeding.

Suppose columns never overlap in output rows.

Example:

$$ J = \begin{bmatrix}

& 0 & * & 0 \ 0 & * & 0 & * \end{bmatrix}. $$

Columns 1 and 2 are independent. Columns 3 and 4 are independent.

Instead of four basis seeds:

e_1,e_2,e_3,e_4,

use two compressed seeds:

v_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}.

Then:

$$ Jv_1 = \begin{bmatrix}

\end{bmatrix}, \qquad Jv_2 = \begin{bmatrix}

\end{bmatrix}. $$

The original columns can be reconstructed because their row supports do not overlap.

This reduces the number of forward passes.

Graph coloring

Compressed sparse forward mode is usually formulated using graph coloring.

Define the column intersection graph:

each Jacobian column is a node,
two columns are connected if they share a nonzero row.

Columns with no edge may share one tangent channel.

Coloring assigns independent columns the same seed dimension.

If the graph requires $k$ colors, the Jacobian can be computed using $k$ tangent directions instead of $n$ .

For sparse systems:

k \ll n.

This is one of the most important techniques in sparse automatic differentiation.

Example: coloring

Suppose:

$$ J = \begin{bmatrix}

& 0 & * \ 0 & * & * \end{bmatrix}. $$

Columns 1 and 2 do not overlap.

Assign colors:

Column	Color
1	red
2	red
3	blue

Use seed matrix:

V = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}.

Then:

JV

contains enough information to reconstruct all columns.

Only two tangent dimensions are needed instead of three.

Structural versus numerical sparsity

Sparse forward methods usually exploit structural sparsity.

Structural sparsity asks:

Which entries can ever become nonzero?

Numerical sparsity asks:

Which entries happen to be zero for this input?

Structural sparsity is preferred because it remains stable across executions.

Example:

f(x,y)=xy.

At $x=0$ ,

\frac{\partial f}{\partial y}=0.

This is numerical sparsity only. Structurally, the derivative may be nonzero elsewhere.

Sparse AD systems usually optimize using structural sparsity patterns.

Dynamic sparsity

Some programs change dependency structure at runtime.

Example:

if x > 0:
    y = a * b
else:
    y = c * d

The active derivative structure depends on execution.

Dynamic sparsity requires:

runtime dependency discovery,
adaptive tangent structures,
dynamic graph coloring.

This is harder than static sparsity analysis.

Sparse matrix primitives

Linear algebra operations often preserve sparsity structure.

For:

z = Ax,

where $A$ is sparse,

\dot{z} = \dot{A}x + A\dot{x}.

Sparse forward systems should preserve sparse matrix representations throughout tangent propagation.

Naively densifying tangents destroys performance.

Sparse PDE example

Consider a one-dimensional finite difference stencil:

f_i(x) = x_{i-1} - 2x_i + x_{i+1}.

Each output depends only on neighboring inputs.

The Jacobian is tridiagonal:

$$ J = \begin{bmatrix}

& * & 0 & 0 & \cdots \
& * & * & 0 & \cdots \ 0 & * & * & * & \cdots \ \vdots & & & & \ddots \end{bmatrix}. $$

The column intersection graph has low degree. Coloring requires only a few colors independent of problem size.

Thus sparse forward mode computes derivatives with nearly constant tangent dimension even for huge systems.

This is one reason sparse forward AD is important in scientific computing.

Sparse Hessians

Forward mode also helps compute sparse Hessians.

Common strategies include:

Method	Idea
forward-over-reverse	directional Hessian products
sparse coloring	compressed second derivatives
hyper-dual sparsity	exact second-order channels

Sparse Hessian methods are widely used in nonlinear optimization and PDE-constrained optimization.

Tradeoffs

Sparse forward methods introduce overhead.

Benefit	Cost
fewer tangent channels	sparse bookkeeping
reduced arithmetic	indirect memory access
lower memory usage	merge overhead
compressed Jacobians	graph coloring complexity

Sparse methods help only when sparsity is sufficiently strong.

If tangent vectors become dense during propagation, sparse representations may become slower than dense arrays.

Fill-in

A major challenge is fill-in.

Even if inputs are sparse, operations may densify tangents.

Example:

z = x + y,

where:

\dot{x} = \{1,2,3\}, \qquad \dot{y} = \{4,5,6\}.

Then:

\dot{z} = \{1,2,3,4,5,6\}.

Repeated merges can gradually destroy sparsity.

Good sparse forward systems minimize fill-in through:

graph ordering,
block decomposition,
local elimination strategies,
compressed seed scheduling.

Hybrid sparse-dense methods

Practical systems often combine sparse and dense tangent representations.

Examples:

Regime	Representation
few active entries	sparse
moderate density	packed blocks
high density	dense arrays

Adaptive switching avoids pathological overhead.

Compiler support

Sparse forward mode benefits from compiler analysis.

Compilers can infer:

dependency graphs,
sparsity structure,
block patterns,
independent variable groups.

This enables automatic seed compression and optimized tangent layouts.

Modern differentiable compilers increasingly integrate sparse derivative analysis directly into IR optimization.

Sparse forward mode versus reverse mode

Sparse forward mode is especially attractive when:

n \ll m

or when directional derivatives dominate.

Reverse mode can also exploit sparsity, but sparse reverse accumulation is usually more complex because adjoints flow backward through the graph and require accumulation from many paths.

Forward sparsity propagation is often simpler and more local.

Summary

Sparse forward methods exploit derivative sparsity to reduce tangent dimensionality, runtime, and memory cost. Instead of propagating dense tangent vectors, they propagate only active derivative channels.

Core techniques include sparse tangent representations, compressed seeding, graph coloring, block propagation, and adaptive sparsity management. These methods are essential for large scientific and engineering systems where Jacobians are sparse and structured.