# Chapter 16. Sparse and Structured Differentiation

Sparse and structured differentiation studies how to compute derivatives without materializing dense derivative objects. Many real systems have enormous Jacobians and Hessians, but most entries are zero, repeated, blocked, low rank, or constrained by graph structure.

A dense Jacobian for a function

$$
f : \mathbb{R}^n \to \mathbb{R}^m
$$

has $mn$ entries. If $m$ and $n$ are large, storing or computing the full matrix is often impossible.

Automatic differentiation usually avoids this by computing products:

$$
Jv
$$

or:

$$
v^\top J
$$

rather than constructing $J$ itself.

Sparse and structured differentiation goes further. It uses known structure in the computation to reduce time, memory, communication, and storage.

## Sparse Jacobians

A Jacobian is sparse when most entries are zero.

$$
J_{ij} = \frac{\partial f_i}{\partial x_j}
$$

If output $f_i$ depends only on a small subset of inputs, then most derivatives in row $i$ are zero.

Example:

```text
y1 = f1(x1, x2)
y2 = f2(x2, x3)
y3 = f3(x3, x4)
```

The Jacobian has a banded structure:

```text
[ * * 0 0 ]
[ 0 * * 0 ]
[ 0 0 * * ]
```

A dense AD method would treat all entries as possible. A sparse method records or exploits the dependency pattern.

### Sparsity from Locality

Sparse Jacobians often arise from local interactions.

| System | Sparsity Source |
|---|---|
| Finite difference PDEs | Each grid cell depends on nearby cells |
| Finite element models | Each element touches local nodes |
| Graph neural networks | Node update depends on neighbors |
| Robotics | Joint dependencies follow kinematic chains |
| Databases | Operators depend on selected columns |
| Compilers | Dataflow edges restrict influence |
| Neural networks | Structured layers and masks |

The derivative structure mirrors the computation graph.

### Structural Sparsity

Structural sparsity describes which derivatives may be nonzero, independent of specific numeric values.

If $f_i$ does not depend on $x_j$, then:

$$
\frac{\partial f_i}{\partial x_j} = 0
$$

for all values of $x$.

This differs from accidental zeros, where a derivative happens to be zero only at a particular input.

Sparse AD systems usually exploit structural sparsity because it is stable across inputs.

### Dependency Graph

The sparsity pattern can be represented as a bipartite graph:

```text
inputs:   x1 x2 x3 x4
          | \ | \ | \
outputs:  y1 y2 y3
```

An edge $x_j \to y_i$ means $f_i$ depends on $x_j$.

The Jacobian pattern is the adjacency matrix of this dependency graph.

### Computing Sparse Jacobians

There are three common strategies.

| Strategy | Idea |
|---|---|
| Direct sparse propagation | Propagate sparse derivative maps through operations |
| Seed compression | Use graph coloring to combine columns or rows |
| Matrix-free products | Compute only $Jv$ or $v^\top J$ as needed |

Direct sparse propagation is simple but may allocate many small maps.

Seed compression is efficient when the sparsity pattern is known.

Matrix-free products are best when an optimizer only needs directional derivatives.

### Sparse Forward Propagation

In forward mode, each value carries derivatives with respect to active inputs.

For sparse derivatives:

```text
value = number
deriv = map[input_id] -> derivative
```

For an operation:

$$
z = x \cdot y
$$

the derivative map is:

$$
dz = y\,dx + x\,dy
$$

Only nonzero entries are stored.

This works well when each intermediate depends on few inputs. It can become expensive when dependencies grow during computation.

### Sparse Reverse Propagation

In reverse mode, each variable accumulates adjoints from downstream users.

Sparse reverse mode can avoid propagating through inactive edges.

If an output does not depend on an intermediate, its adjoint contribution is zero and can be skipped.

The reverse graph already gives useful sparsity: only executed dependencies participate in the backward pass.

For full sparse Jacobians, reverse mode can be run with compressed output seeds.

### Sparse Storage Formats

Sparse derivative matrices are stored in formats such as:

| Format | Good For |
|---|---|
| COO | Simple construction |
| CSR | Efficient row access |
| CSC | Efficient column access |
| Block sparse | Repeated dense blocks |
| Banded | Local stencil systems |
| Coordinate maps | Dynamic sparsity |

The choice depends on access pattern. Optimization solvers often prefer CSR or CSC.

### Sparse Jacobian Example

Consider:

$$
f_1 = x_1^2 + x_2
$$

$$
f_2 = x_2 x_3
$$

$$
f_3 = \sin(x_3) + x_4
$$

Then:

$$
J =
\begin{bmatrix}
2x_1 & 1 & 0 & 0 \\
0 & x_3 & x_2 & 0 \\
0 & 0 & \cos(x_3) & 1
\end{bmatrix}
$$

The nonzero structure is banded. Computing all twelve entries wastes work when only seven can be nonzero.

### Why Sparsity Changes Complexity

Dense Jacobian construction may require $n$ forward sweeps or $m$ reverse sweeps.

With sparsity, the number of sweeps can be reduced by grouping independent columns or rows.

If the graph coloring number is $c$, the cost can approach $c$ sweeps instead of $n$ or $m$.

For many PDE and graph problems, $c$ is small compared with the number of variables.

### Risks

Sparse differentiation has practical risks.

| Risk | Description |
|---|---|
| Pattern overestimation | Treating too many entries as possibly nonzero reduces benefit |
| Pattern underestimation | Missing a dependency gives wrong derivatives |
| Dynamic control flow | Pattern may change by input |
| Fill-in | Sparse intermediates become dense |
| Small allocation overhead | Sparse maps can cost more than dense vectors for small dimensions |
| Hardware mismatch | GPUs often prefer dense or block-sparse kernels |

Sparse AD is a systems problem as much as a calculus problem.

### Core Idea

Sparse Jacobian computation avoids treating every input-output pair as dependent. It uses structural dependency information to compute only derivative entries that can matter.

The result is the same mathematical Jacobian, but represented and computed according to the sparsity of the underlying program.