Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce...
Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce memory usage and computational cost by storing only the nonzero structure.
Automatic differentiation must preserve sparsity whenever possible. If differentiation destroys sparsity structure, the resulting gradients can become prohibitively expensive.
Sparse differentiation therefore has two goals:
- Compute correct derivatives.
- Preserve structural sparsity through the derivative pipeline.
Sparsity
A tensor is sparse if most entries are zero.
For a matrix
define:
If
then sparse representations are beneficial.
Examples:
| Domain | Sparse Structure |
|---|---|
| Graphs | Sparse adjacency matrices |
| PDE discretization | Local stencil matrices |
| Recommender systems | Sparse user-item interactions |
| NLP | Sparse bag-of-words vectors |
| Scientific simulation | Sparse Jacobians |
| Optimization | Sparse constraints |
AD systems that ignore sparsity often become memory-bound long before arithmetic becomes expensive.
Sparse Storage Formats
Sparse tensors are stored structurally.
Common formats:
| Format | Structure |
|---|---|
| COO | Coordinate list |
| CSR | Compressed sparse row |
| CSC | Compressed sparse column |
| BSR | Block sparse row |
| ELLPACK | Fixed-width row storage |
| Hybrid sparse | Mixed sparse/dense layout |
For COO:
row_indices
column_indices
valuesrepresent the tensor.
Example:
COO storage:
rows = [0,1,2,2]
cols = [0,2,0,2]
values = [1,2,3,4]The derivative rules depend on whether sparsity structure is fixed or variable.
Sparse Structure vs Sparse Values
This distinction is critical.
A sparse tensor has:
- Structural sparsity pattern.
- Numerical values.
Example:
indices = fixed
values = differentiableMost AD systems differentiate the values but not the indices.
The indices are discrete objects. Ordinary differentiation does not apply to them.
Thus sparse tensor differentiation usually assumes:
This is analogous to pivot choices in matrix factorizations.
Sparse Matrix-Vector Product
Consider:
where is sparse.
The forward differential is:
The reverse rules remain:
But sparse structure changes implementation significantly.
If has fixed sparsity pattern:
- Only stored entries receive gradients.
- Zero entries outside the pattern are not materialized.
Thus:
for structurally absent entries.
The gradient inherits the sparsity mask.
Sparse Gradient Projection
Suppose:
structurally.
Even if the dense gradient formula would produce a nonzero value there, the sparse parameterization prevents that entry from existing.
Thus sparse differentiation often applies a projection:
where:
This means:
The parameter space itself is sparse.
Sparse Jacobians
Many functions have sparse Jacobians.
Suppose:
The Jacobian:
is sparse if most outputs depend only on nearby inputs.
Examples:
| System | Jacobian Structure |
|---|---|
| PDE stencil | Local neighborhood |
| ODE integrator | Banded structure |
| Graph computation | Edge-local dependence |
| Physical simulation | Sparse interactions |
Sparse Jacobians are central to scientific computing.
Jacobian Sparsity Propagation
Sparsity propagates structurally through composition.
Suppose:
Then:
Even if both factors are sparse, the product may become denser.
This is called fill-in.
Fill-in is one of the main computational problems in sparse AD.
Example:
| Matrix Type | Multiplication Result |
|---|---|
| Diagonal × diagonal | Diagonal |
| Banded × banded | Wider band |
| Sparse graph adjacency powers | Increasing density |
AD systems must balance:
sparsity preservation
reordering
factorization cost
memory growthForward Mode and Sparse Jacobians
Forward mode computes:
If the Jacobian is sparse, one can compute multiple directional derivatives simultaneously using graph coloring.
Suppose columns of the Jacobian do not overlap structurally. Then they can share a perturbation seed.
This reduces the number of forward passes.
Example:
seed matrix Sproduces:
Careful seed design reconstructs the sparse Jacobian from fewer evaluations.
This technique is widely used in scientific computing AD systems.
Graph Coloring
Graph coloring reduces directional derivative evaluations.
Define a column-interference graph:
- Each Jacobian column is a node.
- Two columns share an edge if they may contribute to the same output row.
Columns with different colors can be seeded together.
If the graph uses colors, only forward passes are needed instead of .
Example:
| Problem | Variables | Colors Needed |
|---|---|---|
| Dense Jacobian | ||
| Tridiagonal Jacobian | 3 | |
| Grid stencil | Large | Small constant |
Sparse coloring can dramatically reduce AD cost.
Reverse Mode and Sparse Accumulation
Reverse mode computes:
Sparse reverse mode faces different challenges:
| Problem | Cause |
|---|---|
| Sparse accumulation | Many paths contribute |
| Atomic updates | Parallel accumulation |
| Fill-in | Adjoint propagation |
| Memory locality | Irregular structure |
Sparse reverse kernels often require scatter-add operations:
for edge (i,j):
grad_x[j] += local_grad * grad_y[i]Parallel sparse accumulation is difficult because many threads may update the same gradient location.
Sparse Hessians
Second derivatives are often sparse even when first derivatives are not.
Suppose:
The Hessian:
is sparse when variables interact locally.
Examples:
| System | Hessian Structure |
|---|---|
| Mechanical systems | Local coupling |
| PDE discretization | Sparse stencil |
| Graph optimization | Edge-local structure |
Sparse Hessians are important because dense Hessians scale as:
Sparse methods can reduce complexity dramatically.
Hessian-Vector Products
Most systems avoid forming Hessians explicitly.
Instead they compute:
This preserves sparsity implicitly.
Methods include:
| Method | Technique |
|---|---|
| Forward-over-reverse | JVP through reverse pass |
| Reverse-over-forward | VJP through forward pass |
| Pearlmutter trick | Efficient Hessian-vector product |
Sparse Hessian-vector products are fundamental in large-scale optimization.
Sparse Solves in Reverse Mode
Sparse linear solves appear frequently in implicit differentiation.
Suppose:
with sparse .
The reverse rule requires:
Thus the backward pass itself requires a sparse solve.
The cost of reverse mode may therefore approach the cost of another numerical solve.
Efficient sparse AD often depends on:
symbolic factorization reuse
fill-reducing reorderings
cached sparsity structure
iterative solvesSparse Tensor Contraction
Sparse tensor contraction generalizes sparse matrix multiplication.
Suppose:
Only nonzero entries contribute.
Efficient sparse contraction requires:
| Concern | Importance |
|---|---|
| Index ordering | Memory locality |
| Duplicate accumulation | Correctness |
| Intermediate fill-in | Complexity |
| Parallel scheduling | Throughput |
The reverse rules remain algebraically identical to dense contraction. The challenge is structural efficiency.
Dynamic Sparsity
Some systems change sparsity patterns dynamically.
Examples:
| System | Dynamic Structure |
|---|---|
| Attention masks | Input-dependent edges |
| Pruned neural networks | Learned sparsity |
| Adaptive meshes | Refinement/coarsening |
| Sparse routing | Conditional execution |
Dynamic sparsity creates discontinuities.
Changing whether an entry exists is a discrete operation.
AD usually treats the current sparsity pattern as fixed during a backward pass.
Sparse Attention
Transformer sparsity patterns are increasingly important.
Examples:
| Pattern | Structure |
|---|---|
| Local attention | Banded |
| Block sparse attention | Block structure |
| Routing attention | Dynamic graph |
| Long-context attention | Hierarchical sparsity |
Sparse attention derivatives require efficient scatter/gather kernels and structured reductions.
Memory bandwidth often dominates arithmetic cost.
Sparse Graph Operations
Graph neural networks naturally produce sparse operators.
A graph adjacency matrix:
if an edge exists.
Message passing often has form:
Reverse mode propagates gradients through sparse neighborhood aggregation.
Sparse graph AD must handle:
edge accumulation
neighbor reduction
scatter-add
batch graph packing
irregular memory accessGraph differentiation is fundamentally sparse differentiation.
Structured Sparsity
Not all sparsity is random.
Structured patterns include:
| Structure | Example |
|---|---|
| Diagonal | Scaling operators |
| Banded | PDE discretization |
| Block sparse | Attention |
| Toeplitz | Signal processing |
| Kronecker | Tensor decompositions |
Structured sparsity enables specialized derivative kernels.
Example:
A banded matrix with bandwidth allows:
operations instead of:
AD systems that preserve structure can inherit these savings.
Sparse Layout Metadata
A sparse tensor primitive must track:
indices
storage format
shape
batch axes
duplicate semantics
sortedness
coalescing stateTwo sparse tensors may represent the same values but differ operationally because their storage layouts differ.
Gradient kernels must preserve consistency.
Sparse Accumulation Semantics
Sparse systems often permit duplicate coordinates:
(i,j,val1)
(i,j,val2)Semantics may define:
Backward accumulation must then preserve this additive interpretation.
Sparse differentiation therefore depends on precise storage semantics.
Implementation Challenges
Sparse AD introduces difficult systems problems:
| Challenge | Explanation |
|---|---|
| Irregular memory access | Poor cache locality |
| Atomic gradient accumulation | Parallel conflicts |
| Fill-in | Growing intermediate density |
| Layout conversion | COO ↔ CSR ↔ CSC |
| Dynamic sparsity | Discrete structure changes |
| GPU utilization | Irregular workloads |
Sparse kernels are often memory-bound rather than compute-bound.
Practical Guidance
Prefer:
| Good Practice | Avoid |
|---|---|
| Fixed sparsity structure | Constant layout changes |
| Structured sparsity | Arbitrary irregular sparsity |
| Hessian-vector products | Dense Hessian materialization |
| Sparse solves | Dense inversion |
| Graph coloring | Full Jacobian assembly |
| Projector gradients | Materializing dense masks |
For AD implementers:
| Focus Area | Importance |
|---|---|
| Accumulation correctness | Critical |
| Sparse metadata tracking | Critical |
| Efficient scatter-add | Essential |
| Fill-in control | Major performance issue |
| Layout-aware kernels | Essential |
Summary
Sparse tensor differentiation extends ordinary AD with structural constraints. The mathematical derivative rules remain mostly unchanged, but the implementation must preserve sparsity patterns, avoid unnecessary fill-in, and manage irregular memory behavior efficiently.
The key distinction is between sparse values and sparse structure. Values are differentiable. Structural sparsity patterns are usually treated as fixed local topology. Efficient sparse AD depends on preserving that topology throughout forward and reverse computation.