Chapter 16. Sparse and Structured Differentiation

Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and Hessians, but most entries are zero, repeated, blocked, low rank, or constrained by graph structure.

A dense Jacobian for a function

f : \mathbb{R}^n \to \mathbb{R}^m

has $mn$ entries. If $m$ and $n$ are large, storing or computing the full matrix is often impossible.

Automatic differentiation usually avoids this by computing products:

Jv

or:

v^\top J

rather than constructing $J$ itself.

Sparse and structured differentiation goes further. It uses known structure in the computation to reduce time, memory, communication, and storage.

Sparse Jacobians

A Jacobian is sparse when most entries are zero.

J_{ij} = \frac{\partial f_i}{\partial x_j}

If output $f_i$ depends only on a small subset of inputs, then most derivatives in row $i$ are zero.

Example:

y1 = f1(x1, x2)
y2 = f2(x2, x3)
y3 = f3(x3, x4)

The Jacobian has a banded structure:

[ * * 0 0 ]
[ 0 * * 0 ]
[ 0 0 * * ]

A dense AD method would treat all entries as possible. A sparse method records or exploits the dependency pattern.

Sparsity from Locality

Sparse Jacobians often arise from local interactions.

System	Sparsity Source
Finite difference PDEs	Each grid cell depends on nearby cells
Finite element models	Each element touches local nodes
Graph neural networks	Node update depends on neighbors
Robotics	Joint dependencies follow kinematic chains
Databases	Operators depend on selected columns
Compilers	Dataflow edges restrict influence
Neural networks	Structured layers and masks

The derivative structure mirrors the computation graph.

Structural Sparsity

Structural sparsity describes which derivatives may be nonzero, independent of specific numeric values.

If $f_i$ does not depend on $x_j$ , then:

\frac{\partial f_i}{\partial x_j} = 0

for all values of $x$ .

This differs from accidental zeros, where a derivative happens to be zero only at a particular input.

Sparse AD systems usually exploit structural sparsity because it is stable across inputs.

Dependency Graph

The sparsity pattern can be represented as a bipartite graph:

inputs:   x1 x2 x3 x4
          | \ | \ | \
outputs:  y1 y2 y3

An edge $x_j \to y_i$ means $f_i$ depends on $x_j$ .

The Jacobian pattern is the adjacency matrix of this dependency graph.

Computing Sparse Jacobians

There are three common strategies.

Strategy	Idea
Direct sparse propagation	Propagate sparse derivative maps through operations
Seed compression	Use graph coloring to combine columns or rows
Matrix-free products	Compute only $Jv$ or $v^\top J$ as needed

Direct sparse propagation is simple but may allocate many small maps.

Seed compression is efficient when the sparsity pattern is known.

Matrix-free products are best when an optimizer only needs directional derivatives.

Sparse Forward Propagation

In forward mode, each value carries derivatives with respect to active inputs.

For sparse derivatives:

value = number
deriv = map[input_id] -> derivative

For an operation:

z = x \cdot y

the derivative map is:

dz = y\,dx + x\,dy

Only nonzero entries are stored.

This works well when each intermediate depends on few inputs. It can become expensive when dependencies grow during computation.

Sparse Reverse Propagation

In reverse mode, each variable accumulates adjoints from downstream users.

Sparse reverse mode can avoid propagating through inactive edges.

If an output does not depend on an intermediate, its adjoint contribution is zero and can be skipped.

The reverse graph already gives useful sparsity: only executed dependencies participate in the backward pass.

For full sparse Jacobians, reverse mode can be run with compressed output seeds.

Sparse Storage Formats

Sparse derivative matrices are stored in formats such as:

Format	Good For
COO	Simple construction
CSR	Efficient row access
CSC	Efficient column access
Block sparse	Repeated dense blocks
Banded	Local stencil systems
Coordinate maps	Dynamic sparsity

The choice depends on access pattern. Optimization solvers often prefer CSR or CSC.

Sparse Jacobian Example

Consider:

f_1 = x_1^2 + x_2

f_2 = x_2 x_3

f_3 = \sin(x_3) + x_4

Then:

J = \begin{bmatrix} 2x_1 & 1 & 0 & 0 \\ 0 & x_3 & x_2 & 0 \\ 0 & 0 & \cos(x_3) & 1 \end{bmatrix}

The nonzero structure is banded. Computing all twelve entries wastes work when only seven can be nonzero.

Why Sparsity Changes Complexity

Dense Jacobian construction may require $n$ forward sweeps or $m$ reverse sweeps.

With sparsity, the number of sweeps can be reduced by grouping independent columns or rows.

If the graph coloring number is $c$ , the cost can approach $c$ sweeps instead of $n$ or $m$ .

For many PDE and graph problems, $c$ is small compared with the number of variables.

Risks

Sparse differentiation has practical risks.

Risk	Description
Pattern overestimation	Treating too many entries as possibly nonzero reduces benefit
Pattern underestimation	Missing a dependency gives wrong derivatives
Dynamic control flow	Pattern may change by input
Fill-in	Sparse intermediates become dense
Small allocation overhead	Sparse maps can cost more than dense vectors for small dimensions
Hardware mismatch	GPUs often prefer dense or block-sparse kernels

Sparse AD is a systems problem as much as a calculus problem.

Core Idea

Sparse Jacobian computation avoids treating every input-output pair as dependent. It uses structural dependency information to compute only derivative entries that can matter.

The result is the same mathematical Jacobian, but represented and computed according to the sparsity of the underlying program.