Skip to content

Chapter 16. Sparse and Structured Differentiation

Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and...

Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and Hessians, but most entries are zero, repeated, blocked, low rank, or constrained by graph structure.

A dense Jacobian for a function

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

has mnmn entries. If mm and nn are large, storing or computing the full matrix is often impossible.

Automatic differentiation usually avoids this by computing products:

Jv Jv

or:

vJ v^\top J

rather than constructing JJ itself.

Sparse and structured differentiation goes further. It uses known structure in the computation to reduce time, memory, communication, and storage.

Sparse Jacobians

A Jacobian is sparse when most entries are zero.

Jij=fixj J_{ij} = \frac{\partial f_i}{\partial x_j}

If output fif_i depends only on a small subset of inputs, then most derivatives in row ii are zero.

Example:

y1 = f1(x1, x2)
y2 = f2(x2, x3)
y3 = f3(x3, x4)

The Jacobian has a banded structure:

[ * * 0 0 ]
[ 0 * * 0 ]
[ 0 0 * * ]

A dense AD method would treat all entries as possible. A sparse method records or exploits the dependency pattern.

Sparsity from Locality

Sparse Jacobians often arise from local interactions.

SystemSparsity Source
Finite difference PDEsEach grid cell depends on nearby cells
Finite element modelsEach element touches local nodes
Graph neural networksNode update depends on neighbors
RoboticsJoint dependencies follow kinematic chains
DatabasesOperators depend on selected columns
CompilersDataflow edges restrict influence
Neural networksStructured layers and masks

The derivative structure mirrors the computation graph.

Structural Sparsity

Structural sparsity describes which derivatives may be nonzero, independent of specific numeric values.

If fif_i does not depend on xjx_j, then:

fixj=0 \frac{\partial f_i}{\partial x_j} = 0

for all values of xx.

This differs from accidental zeros, where a derivative happens to be zero only at a particular input.

Sparse AD systems usually exploit structural sparsity because it is stable across inputs.

Dependency Graph

The sparsity pattern can be represented as a bipartite graph:

inputs:   x1 x2 x3 x4
          | \ | \ | \
outputs:  y1 y2 y3

An edge xjyix_j \to y_i means fif_i depends on xjx_j.

The Jacobian pattern is the adjacency matrix of this dependency graph.

Computing Sparse Jacobians

There are three common strategies.

StrategyIdea
Direct sparse propagationPropagate sparse derivative maps through operations
Seed compressionUse graph coloring to combine columns or rows
Matrix-free productsCompute only JvJv or vJv^\top J as needed

Direct sparse propagation is simple but may allocate many small maps.

Seed compression is efficient when the sparsity pattern is known.

Matrix-free products are best when an optimizer only needs directional derivatives.

Sparse Forward Propagation

In forward mode, each value carries derivatives with respect to active inputs.

For sparse derivatives:

value = number
deriv = map[input_id] -> derivative

For an operation:

z=xy z = x \cdot y

the derivative map is:

dz=ydx+xdy dz = y\,dx + x\,dy

Only nonzero entries are stored.

This works well when each intermediate depends on few inputs. It can become expensive when dependencies grow during computation.

Sparse Reverse Propagation

In reverse mode, each variable accumulates adjoints from downstream users.

Sparse reverse mode can avoid propagating through inactive edges.

If an output does not depend on an intermediate, its adjoint contribution is zero and can be skipped.

The reverse graph already gives useful sparsity: only executed dependencies participate in the backward pass.

For full sparse Jacobians, reverse mode can be run with compressed output seeds.

Sparse Storage Formats

Sparse derivative matrices are stored in formats such as:

FormatGood For
COOSimple construction
CSREfficient row access
CSCEfficient column access
Block sparseRepeated dense blocks
BandedLocal stencil systems
Coordinate mapsDynamic sparsity

The choice depends on access pattern. Optimization solvers often prefer CSR or CSC.

Sparse Jacobian Example

Consider:

f1=x12+x2 f_1 = x_1^2 + x_2 f2=x2x3 f_2 = x_2 x_3 f3=sin(x3)+x4 f_3 = \sin(x_3) + x_4

Then:

J=[2x11000x3x2000cos(x3)1] J = \begin{bmatrix} 2x_1 & 1 & 0 & 0 \\ 0 & x_3 & x_2 & 0 \\ 0 & 0 & \cos(x_3) & 1 \end{bmatrix}

The nonzero structure is banded. Computing all twelve entries wastes work when only seven can be nonzero.

Why Sparsity Changes Complexity

Dense Jacobian construction may require nn forward sweeps or mm reverse sweeps.

With sparsity, the number of sweeps can be reduced by grouping independent columns or rows.

If the graph coloring number is cc, the cost can approach cc sweeps instead of nn or mm.

For many PDE and graph problems, cc is small compared with the number of variables.

Risks

Sparse differentiation has practical risks.

RiskDescription
Pattern overestimationTreating too many entries as possibly nonzero reduces benefit
Pattern underestimationMissing a dependency gives wrong derivatives
Dynamic control flowPattern may change by input
Fill-inSparse intermediates become dense
Small allocation overheadSparse maps can cost more than dense vectors for small dimensions
Hardware mismatchGPUs often prefer dense or block-sparse kernels

Sparse AD is a systems problem as much as a calculus problem.

Core Idea

Sparse Jacobian computation avoids treating every input-output pair as dependent. It uses structural dependency information to compute only derivative entries that can matter.

The result is the same mathematical Jacobian, but represented and computed according to the sparsity of the underlying program.