Skip to content

Sparse Tensor Derivatives

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce...

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce memory usage and computational cost by storing only the nonzero structure.

Automatic differentiation must preserve sparsity whenever possible. If differentiation destroys sparsity structure, the resulting gradients can become prohibitively expensive.

Sparse differentiation therefore has two goals:

  1. Compute correct derivatives.
  2. Preserve structural sparsity through the derivative pipeline.

Sparsity

A tensor is sparse if most entries are zero.

For a matrix

ARm×n, A \in \mathbb{R}^{m \times n},

define:

nnz(A)=number of nonzero entries. \operatorname{nnz}(A) = \text{number of nonzero entries}.

If

nnz(A)mn, \operatorname{nnz}(A) \ll mn,

then sparse representations are beneficial.

Examples:

DomainSparse Structure
GraphsSparse adjacency matrices
PDE discretizationLocal stencil matrices
Recommender systemsSparse user-item interactions
NLPSparse bag-of-words vectors
Scientific simulationSparse Jacobians
OptimizationSparse constraints

AD systems that ignore sparsity often become memory-bound long before arithmetic becomes expensive.

Sparse Storage Formats

Sparse tensors are stored structurally.

Common formats:

FormatStructure
COOCoordinate list
CSRCompressed sparse row
CSCCompressed sparse column
BSRBlock sparse row
ELLPACKFixed-width row storage
Hybrid sparseMixed sparse/dense layout

For COO:

row_indices
column_indices
values

represent the tensor.

Example:

A=[100002304] A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 2 \\ 3 & 0 & 4 \end{bmatrix}

COO storage:

rows   = [0,1,2,2]
cols   = [0,2,0,2]
values = [1,2,3,4]

The derivative rules depend on whether sparsity structure is fixed or variable.

Sparse Structure vs Sparse Values

This distinction is critical.

A sparse tensor has:

  1. Structural sparsity pattern.
  2. Numerical values.

Example:

indices = fixed
values  = differentiable

Most AD systems differentiate the values but not the indices.

The indices are discrete objects. Ordinary differentiation does not apply to them.

Thus sparse tensor differentiation usually assumes:

sparsity pattern is fixed locally. \text{sparsity pattern is fixed locally}.

This is analogous to pivot choices in matrix factorizations.

Sparse Matrix-Vector Product

Consider:

y=Ax, y = Ax,

where AA is sparse.

The forward differential is:

dy=dAx+Adx. dy = dA\,x + A\,dx.

The reverse rules remain:

Aˉ+=yˉxT, \bar{A} \mathrel{+}= \bar{y}x^T, xˉ+=ATyˉ. \bar{x} \mathrel{+}= A^T\bar{y}.

But sparse structure changes implementation significantly.

If AA has fixed sparsity pattern:

  • Only stored entries receive gradients.
  • Zero entries outside the pattern are not materialized.

Thus:

Aˉij=0 \bar{A}_{ij} = 0

for structurally absent entries.

The gradient inherits the sparsity mask.

Sparse Gradient Projection

Suppose:

Aij=0 A_{ij} = 0

structurally.

Even if the dense gradient formula would produce a nonzero value there, the sparse parameterization prevents that entry from existing.

Thus sparse differentiation often applies a projection:

AˉPS(Aˉ), \bar{A} \leftarrow P_{\mathcal{S}}(\bar{A}),

where:

S=sparsity pattern. \mathcal{S} = \text{sparsity pattern}.

This means:

(PS(G))ij={Gij,(i,j)S,0,otherwise. (P_{\mathcal{S}}(G))_{ij} = \begin{cases} G_{ij}, & (i,j)\in\mathcal{S}, \\ 0, & \text{otherwise}. \end{cases}

The parameter space itself is sparse.

Sparse Jacobians

Many functions have sparse Jacobians.

Suppose:

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian:

Jij=fixj J_{ij} = \frac{\partial f_i}{\partial x_j}

is sparse if most outputs depend only on nearby inputs.

Examples:

SystemJacobian Structure
PDE stencilLocal neighborhood
ODE integratorBanded structure
Graph computationEdge-local dependence
Physical simulationSparse interactions

Sparse Jacobians are central to scientific computing.

Jacobian Sparsity Propagation

Sparsity propagates structurally through composition.

Suppose:

f(x)=h(g(x)). f(x) = h(g(x)).

Then:

Jf=JhJg. J_f = J_h J_g.

Even if both factors are sparse, the product may become denser.

This is called fill-in.

Fill-in is one of the main computational problems in sparse AD.

Example:

Matrix TypeMultiplication Result
Diagonal × diagonalDiagonal
Banded × bandedWider band
Sparse graph adjacency powersIncreasing density

AD systems must balance:

sparsity preservation
reordering
factorization cost
memory growth

Forward Mode and Sparse Jacobians

Forward mode computes:

Jv. Jv.

If the Jacobian is sparse, one can compute multiple directional derivatives simultaneously using graph coloring.

Suppose columns of the Jacobian do not overlap structurally. Then they can share a perturbation seed.

This reduces the number of forward passes.

Example:

seed matrix S

produces:

JS. JS.

Careful seed design reconstructs the sparse Jacobian from fewer evaluations.

This technique is widely used in scientific computing AD systems.

Graph Coloring

Graph coloring reduces directional derivative evaluations.

Define a column-interference graph:

  • Each Jacobian column is a node.
  • Two columns share an edge if they may contribute to the same output row.

Columns with different colors can be seeded together.

If the graph uses kk colors, only kk forward passes are needed instead of nn.

Example:

ProblemVariablesColors Needed
Dense Jacobiannnnn
Tridiagonal Jacobiannn3
Grid stencilLarge nnSmall constant

Sparse coloring can dramatically reduce AD cost.

Reverse Mode and Sparse Accumulation

Reverse mode computes:

JTu. J^T u.

Sparse reverse mode faces different challenges:

ProblemCause
Sparse accumulationMany paths contribute
Atomic updatesParallel accumulation
Fill-inAdjoint propagation
Memory localityIrregular structure

Sparse reverse kernels often require scatter-add operations:

for edge (i,j):
    grad_x[j] += local_grad * grad_y[i]

Parallel sparse accumulation is difficult because many threads may update the same gradient location.

Sparse Hessians

Second derivatives are often sparse even when first derivatives are not.

Suppose:

f:RnR. f : \mathbb{R}^n \to \mathbb{R}.

The Hessian:

Hij=2fxixj H_{ij} = \frac{\partial^2 f} {\partial x_i \partial x_j}

is sparse when variables interact locally.

Examples:

SystemHessian Structure
Mechanical systemsLocal coupling
PDE discretizationSparse stencil
Graph optimizationEdge-local structure

Sparse Hessians are important because dense Hessians scale as:

O(n2). O(n^2).

Sparse methods can reduce complexity dramatically.

Hessian-Vector Products

Most systems avoid forming Hessians explicitly.

Instead they compute:

Hv. Hv.

This preserves sparsity implicitly.

Methods include:

MethodTechnique
Forward-over-reverseJVP through reverse pass
Reverse-over-forwardVJP through forward pass
Pearlmutter trickEfficient Hessian-vector product

Sparse Hessian-vector products are fundamental in large-scale optimization.

Sparse Solves in Reverse Mode

Sparse linear solves appear frequently in implicit differentiation.

Suppose:

Ax=b, Ax = b,

with sparse AA.

The reverse rule requires:

ATg=xˉ. A^T g = \bar{x}.

Thus the backward pass itself requires a sparse solve.

The cost of reverse mode may therefore approach the cost of another numerical solve.

Efficient sparse AD often depends on:

symbolic factorization reuse
fill-reducing reorderings
cached sparsity structure
iterative solves

Sparse Tensor Contraction

Sparse tensor contraction generalizes sparse matrix multiplication.

Suppose:

Cij=kAikBkj. C_{ij} = \sum_k A_{ik}B_{kj}.

Only nonzero entries contribute.

Efficient sparse contraction requires:

ConcernImportance
Index orderingMemory locality
Duplicate accumulationCorrectness
Intermediate fill-inComplexity
Parallel schedulingThroughput

The reverse rules remain algebraically identical to dense contraction. The challenge is structural efficiency.

Dynamic Sparsity

Some systems change sparsity patterns dynamically.

Examples:

SystemDynamic Structure
Attention masksInput-dependent edges
Pruned neural networksLearned sparsity
Adaptive meshesRefinement/coarsening
Sparse routingConditional execution

Dynamic sparsity creates discontinuities.

Changing whether an entry exists is a discrete operation.

AD usually treats the current sparsity pattern as fixed during a backward pass.

Sparse Attention

Transformer sparsity patterns are increasingly important.

Examples:

PatternStructure
Local attentionBanded
Block sparse attentionBlock structure
Routing attentionDynamic graph
Long-context attentionHierarchical sparsity

Sparse attention derivatives require efficient scatter/gather kernels and structured reductions.

Memory bandwidth often dominates arithmetic cost.

Sparse Graph Operations

Graph neural networks naturally produce sparse operators.

A graph adjacency matrix:

Aij=1 A_{ij} = 1

if an edge exists.

Message passing often has form:

H=AHW. H' = AH W.

Reverse mode propagates gradients through sparse neighborhood aggregation.

Sparse graph AD must handle:

edge accumulation
neighbor reduction
scatter-add
batch graph packing
irregular memory access

Graph differentiation is fundamentally sparse differentiation.

Structured Sparsity

Not all sparsity is random.

Structured patterns include:

StructureExample
DiagonalScaling operators
BandedPDE discretization
Block sparseAttention
ToeplitzSignal processing
KroneckerTensor decompositions

Structured sparsity enables specialized derivative kernels.

Example:

A banded matrix with bandwidth kk allows:

O(kn) O(kn)

operations instead of:

O(n2). O(n^2).

AD systems that preserve structure can inherit these savings.

Sparse Layout Metadata

A sparse tensor primitive must track:

indices
storage format
shape
batch axes
duplicate semantics
sortedness
coalescing state

Two sparse tensors may represent the same values but differ operationally because their storage layouts differ.

Gradient kernels must preserve consistency.

Sparse Accumulation Semantics

Sparse systems often permit duplicate coordinates:

(i,j,val1)
(i,j,val2)

Semantics may define:

Aij=val1+val2. A_{ij} = val1 + val2.

Backward accumulation must then preserve this additive interpretation.

Sparse differentiation therefore depends on precise storage semantics.

Implementation Challenges

Sparse AD introduces difficult systems problems:

ChallengeExplanation
Irregular memory accessPoor cache locality
Atomic gradient accumulationParallel conflicts
Fill-inGrowing intermediate density
Layout conversionCOO ↔ CSR ↔ CSC
Dynamic sparsityDiscrete structure changes
GPU utilizationIrregular workloads

Sparse kernels are often memory-bound rather than compute-bound.

Practical Guidance

Prefer:

Good PracticeAvoid
Fixed sparsity structureConstant layout changes
Structured sparsityArbitrary irregular sparsity
Hessian-vector productsDense Hessian materialization
Sparse solvesDense inversion
Graph coloringFull Jacobian assembly
Projector gradientsMaterializing dense masks

For AD implementers:

Focus AreaImportance
Accumulation correctnessCritical
Sparse metadata trackingCritical
Efficient scatter-addEssential
Fill-in controlMajor performance issue
Layout-aware kernelsEssential

Summary

Sparse tensor differentiation extends ordinary AD with structural constraints. The mathematical derivative rules remain mostly unchanged, but the implementation must preserve sparsity patterns, avoid unnecessary fill-in, and manage irregular memory behavior efficiently.

The key distinction is between sparse values and sparse structure. Values are differentiable. Structural sparsity patterns are usually treated as fixed local topology. Efficient sparse AD depends on preserving that topology throughout forward and reverse computation.