Skip to content

Sparse Forward Methods

Many real-world Jacobians are sparse. Most derivative entries are zero because outputs depend only on small subsets of inputs.

Many real-world Jacobians are sparse. Most derivative entries are zero because outputs depend only on small subsets of inputs.

Forward mode automatic differentiation can exploit this structure to reduce:

  • runtime,
  • memory usage,
  • tangent dimensionality,
  • cache pressure,
  • bandwidth cost.

Sparse forward methods are techniques for propagating only the derivative information that actually matters.

Sparse Jacobians

Suppose

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian is

Jf(x)=[fixj]. J_f(x) = \left[ \frac{\partial f_i}{\partial x_j} \right].

In many systems:

fixj=0 \frac{\partial f_i}{\partial x_j} = 0

for most pairs (i,j)(i,j).

Example:

f(x1,x2,x3,x4)=[x1x2x3+x4]. f(x_1,x_2,x_3,x_4) = \begin{bmatrix} x_1x_2 \\ x_3 + x_4 \end{bmatrix}.

The Jacobian is

Jf(x)=[x2x1000011]. J_f(x) = \begin{bmatrix} x_2 & x_1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix}.

Only four entries are nonzero.

Dense forward mode still propagates tangent information for all variables. Sparse forward mode tries to avoid this unnecessary work.

Sources of sparsity

Sparse derivatives appear naturally in many domains.

DomainSparsity source
PDE discretizationlocal spatial interactions
finite element methodsmesh locality
circuit simulationcomponent connectivity
roboticschain topology
graph algorithmsneighborhood structure
optimizationblock constraints
databaseslocal relational dependencies
physical simulationfinite-range forces

In these systems, each output depends on only a small subset of inputs.

Dependency propagation

Forward mode propagates tangent information through dependencies.

If:

z=x+y, z = x + y,

then:

z˙=x˙+y˙. \dot{z} = \dot{x} + \dot{y}.

If both tangents are sparse, the result stays sparse.

Example:

x˙={(1,a)},y˙={(5,b)}. \dot{x} = \{(1,a)\}, \qquad \dot{y} = \{(5,b)\}.

Then:

z˙={(1,a),(5,b)}. \dot{z} = \{(1,a),(5,b)\}.

Only active derivative channels propagate.

Sparse tangent representation

Instead of dense tangent vectors:

x˙Rk, \dot{x} \in \mathbb{R}^k,

store only nonzero entries.

One representation:

type SparseTangent struct {
    Indices []int
    Values  []float64
}

A dual value becomes:

type SparseDual struct {
    Value   float64
    Tangent SparseTangent
}

Operations merge and update sparse derivative entries.

Sparse addition

Suppose:

z=x+y. z = x + y.

Then:

z˙=x˙+y˙. \dot{z} = \dot{x} + \dot{y}.

Sparse implementation:

func Add(a, b SparseDual) SparseDual {
    return SparseDual{
        Value:   a.Value + b.Value,
        Tangent: MergeAdd(a.Tangent, b.Tangent),
    }
}

Only active derivative channels are touched.

Sparse multiplication

For:

z=xy, z = xy,

the tangent rule is:

z˙=yx˙+xy˙. \dot{z} = y\dot{x} + x\dot{y}.

Sparse implementation scales active entries:

func Mul(a, b SparseDual) SparseDual {
    ta := Scale(a.Tangent, b.Value)
    tb := Scale(b.Tangent, a.Value)

    return SparseDual{
        Value:   a.Value * b.Value,
        Tangent: MergeAdd(ta, tb),
    }
}

The complexity depends on the number of active tangent entries, not the total input dimension.

Complexity reduction

Suppose:

  • primal cost is CfC_f,
  • dense tangent dimension is nn,
  • each variable depends on only sns \ll n active directions.

Dense forward mode costs roughly:

O(nCf). O(nC_f).

Sparse forward mode reduces this toward:

O(sCf). O(sC_f).

The exact gain depends on:

  • sparsity stability,
  • tangent overlap,
  • data structure overhead,
  • cache locality.

Column compression

Sparse Jacobians often allow compressed seeding.

Suppose columns never overlap in output rows.

Example:

$$ J = \begin{bmatrix}

  • & 0 & * & 0 \ 0 & * & 0 & * \end{bmatrix}. $$

Columns 1 and 2 are independent. Columns 3 and 4 are independent.

Instead of four basis seeds:

e1,e2,e3,e4, e_1,e_2,e_3,e_4,

use two compressed seeds:

v1=[1100],v2=[0011]. v_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \qquad v_2 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}.

Then:

$$ Jv_1 = \begin{bmatrix}

  • \

\end{bmatrix}, \qquad Jv_2 = \begin{bmatrix}

  • \

\end{bmatrix}. $$

The original columns can be reconstructed because their row supports do not overlap.

This reduces the number of forward passes.

Graph coloring

Compressed sparse forward mode is usually formulated using graph coloring.

Define the column intersection graph:

  • each Jacobian column is a node,
  • two columns are connected if they share a nonzero row.

Columns with no edge may share one tangent channel.

Coloring assigns independent columns the same seed dimension.

If the graph requires kk colors, the Jacobian can be computed using kk tangent directions instead of nn.

For sparse systems:

kn. k \ll n.

This is one of the most important techniques in sparse automatic differentiation.

Example: coloring

Suppose:

$$ J = \begin{bmatrix}

  • & 0 & * \ 0 & * & * \end{bmatrix}. $$

Columns 1 and 2 do not overlap.

Assign colors:

ColumnColor
1red
2red
3blue

Use seed matrix:

V=[101001]. V = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \end{bmatrix}.

Then:

JV JV

contains enough information to reconstruct all columns.

Only two tangent dimensions are needed instead of three.

Structural versus numerical sparsity

Sparse forward methods usually exploit structural sparsity.

Structural sparsity asks:

Which entries can ever become nonzero?

Numerical sparsity asks:

Which entries happen to be zero for this input?

Structural sparsity is preferred because it remains stable across executions.

Example:

f(x,y)=xy. f(x,y)=xy.

At x=0x=0,

fy=0. \frac{\partial f}{\partial y}=0.

This is numerical sparsity only. Structurally, the derivative may be nonzero elsewhere.

Sparse AD systems usually optimize using structural sparsity patterns.

Dynamic sparsity

Some programs change dependency structure at runtime.

Example:

if x > 0:
    y = a * b
else:
    y = c * d

The active derivative structure depends on execution.

Dynamic sparsity requires:

  • runtime dependency discovery,
  • adaptive tangent structures,
  • dynamic graph coloring.

This is harder than static sparsity analysis.

Sparse matrix primitives

Linear algebra operations often preserve sparsity structure.

For:

z=Ax, z = Ax,

where AA is sparse,

z˙=A˙x+Ax˙. \dot{z} = \dot{A}x + A\dot{x}.

Sparse forward systems should preserve sparse matrix representations throughout tangent propagation.

Naively densifying tangents destroys performance.

Sparse PDE example

Consider a one-dimensional finite difference stencil:

fi(x)=xi12xi+xi+1. f_i(x) = x_{i-1} - 2x_i + x_{i+1}.

Each output depends only on neighboring inputs.

The Jacobian is tridiagonal:

$$ J = \begin{bmatrix}

  • & * & 0 & 0 & \cdots \
  • & * & * & 0 & \cdots \ 0 & * & * & * & \cdots \ \vdots & & & & \ddots \end{bmatrix}. $$

The column intersection graph has low degree. Coloring requires only a few colors independent of problem size.

Thus sparse forward mode computes derivatives with nearly constant tangent dimension even for huge systems.

This is one reason sparse forward AD is important in scientific computing.

Sparse Hessians

Forward mode also helps compute sparse Hessians.

Common strategies include:

MethodIdea
forward-over-reversedirectional Hessian products
sparse coloringcompressed second derivatives
hyper-dual sparsityexact second-order channels

Sparse Hessian methods are widely used in nonlinear optimization and PDE-constrained optimization.

Tradeoffs

Sparse forward methods introduce overhead.

BenefitCost
fewer tangent channelssparse bookkeeping
reduced arithmeticindirect memory access
lower memory usagemerge overhead
compressed Jacobiansgraph coloring complexity

Sparse methods help only when sparsity is sufficiently strong.

If tangent vectors become dense during propagation, sparse representations may become slower than dense arrays.

Fill-in

A major challenge is fill-in.

Even if inputs are sparse, operations may densify tangents.

Example:

z=x+y, z = x + y,

where:

x˙={1,2,3},y˙={4,5,6}. \dot{x} = \{1,2,3\}, \qquad \dot{y} = \{4,5,6\}.

Then:

z˙={1,2,3,4,5,6}. \dot{z} = \{1,2,3,4,5,6\}.

Repeated merges can gradually destroy sparsity.

Good sparse forward systems minimize fill-in through:

  • graph ordering,
  • block decomposition,
  • local elimination strategies,
  • compressed seed scheduling.

Hybrid sparse-dense methods

Practical systems often combine sparse and dense tangent representations.

Examples:

RegimeRepresentation
few active entriessparse
moderate densitypacked blocks
high densitydense arrays

Adaptive switching avoids pathological overhead.

Compiler support

Sparse forward mode benefits from compiler analysis.

Compilers can infer:

  • dependency graphs,
  • sparsity structure,
  • block patterns,
  • independent variable groups.

This enables automatic seed compression and optimized tangent layouts.

Modern differentiable compilers increasingly integrate sparse derivative analysis directly into IR optimization.

Sparse forward mode versus reverse mode

Sparse forward mode is especially attractive when:

nm n \ll m

or when directional derivatives dominate.

Reverse mode can also exploit sparsity, but sparse reverse accumulation is usually more complex because adjoints flow backward through the graph and require accumulation from many paths.

Forward sparsity propagation is often simpler and more local.

Summary

Sparse forward methods exploit derivative sparsity to reduce tangent dimensionality, runtime, and memory cost. Instead of propagating dense tangent vectors, they propagate only active derivative channels.

Core techniques include sparse tangent representations, compressed seeding, graph coloring, block propagation, and adaptive sparsity management. These methods are essential for large scientific and engineering systems where Jacobians are sparse and structured.