Sparse Tensor Derivatives

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce memory usage and computational cost by storing only the nonzero structure.

Automatic differentiation must preserve sparsity whenever possible. If differentiation destroys sparsity structure, the resulting gradients can become prohibitively expensive.

Sparse differentiation therefore has two goals:

Compute correct derivatives.
Preserve structural sparsity through the derivative pipeline.

Sparsity

A tensor is sparse if most entries are zero.

For a matrix

A \in \mathbb{R}^{m \times n},

define:

\operatorname{nnz}(A) = \text{number of nonzero entries}.

\operatorname{nnz}(A) \ll mn,

then sparse representations are beneficial.

Examples:

Domain	Sparse Structure
Graphs	Sparse adjacency matrices
PDE discretization	Local stencil matrices
Recommender systems	Sparse user-item interactions
NLP	Sparse bag-of-words vectors
Scientific simulation	Sparse Jacobians
Optimization	Sparse constraints

AD systems that ignore sparsity often become memory-bound long before arithmetic becomes expensive.

Sparse Storage Formats

Sparse tensors are stored structurally.

Common formats:

Format	Structure
COO	Coordinate list
CSR	Compressed sparse row
CSC	Compressed sparse column
BSR	Block sparse row
ELLPACK	Fixed-width row storage
Hybrid sparse	Mixed sparse/dense layout

For COO:

row_indices
column_indices
values

represent the tensor.

Example:

A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 2 \\ 3 & 0 & 4 \end{bmatrix}

COO storage:

rows   = [0,1,2,2]
cols   = [0,2,0,2]
values = [1,2,3,4]

The derivative rules depend on whether sparsity structure is fixed or variable.

Sparse Structure vs Sparse Values

This distinction is critical.

A sparse tensor has:

Structural sparsity pattern.
Numerical values.

Example:

indices = fixed
values  = differentiable

Most AD systems differentiate the values but not the indices.

The indices are discrete objects. Ordinary differentiation does not apply to them.

Thus sparse tensor differentiation usually assumes:

\text{sparsity pattern is fixed locally}.

This is analogous to pivot choices in matrix factorizations.

Sparse Matrix-Vector Product

Consider:

y = Ax,

where $A$ is sparse.

The forward differential is:

dy = dA\,x + A\,dx.

The reverse rules remain:

\bar{A} \mathrel{+}= \bar{y}x^T,

\bar{x} \mathrel{+}= A^T\bar{y}.

But sparse structure changes implementation significantly.

If $A$ has fixed sparsity pattern:

Only stored entries receive gradients.
Zero entries outside the pattern are not materialized.

Thus:

\bar{A}_{ij} = 0

for structurally absent entries.

The gradient inherits the sparsity mask.

Sparse Gradient Projection

Suppose:

A_{ij} = 0

structurally.

Even if the dense gradient formula would produce a nonzero value there, the sparse parameterization prevents that entry from existing.

Thus sparse differentiation often applies a projection:

\bar{A} \leftarrow P_{\mathcal{S}}(\bar{A}),

where:

\mathcal{S} = \text{sparsity pattern}.

This means:

(P_{\mathcal{S}}(G))_{ij} = \begin{cases} G_{ij}, & (i,j)\in\mathcal{S}, \\ 0, & \text{otherwise}. \end{cases}

The parameter space itself is sparse.

Sparse Jacobians

Many functions have sparse Jacobians.

Suppose:

f : \mathbb{R}^n \to \mathbb{R}^m.

The Jacobian:

J_{ij} = \frac{\partial f_i}{\partial x_j}

is sparse if most outputs depend only on nearby inputs.

Examples:

System	Jacobian Structure
PDE stencil	Local neighborhood
ODE integrator	Banded structure
Graph computation	Edge-local dependence
Physical simulation	Sparse interactions

Sparse Jacobians are central to scientific computing.

Jacobian Sparsity Propagation

Sparsity propagates structurally through composition.

Suppose:

f(x) = h(g(x)).

Then:

J_f = J_h J_g.

Even if both factors are sparse, the product may become denser.

This is called fill-in.

Fill-in is one of the main computational problems in sparse AD.

Example:

Matrix Type	Multiplication Result
Diagonal × diagonal	Diagonal
Banded × banded	Wider band
Sparse graph adjacency powers	Increasing density

AD systems must balance:

sparsity preservation
reordering
factorization cost
memory growth

Forward Mode and Sparse Jacobians

Forward mode computes:

Jv.

If the Jacobian is sparse, one can compute multiple directional derivatives simultaneously using graph coloring.

Suppose columns of the Jacobian do not overlap structurally. Then they can share a perturbation seed.

This reduces the number of forward passes.

Example:

seed matrix S

produces:

JS.

Careful seed design reconstructs the sparse Jacobian from fewer evaluations.

This technique is widely used in scientific computing AD systems.

Graph Coloring

Graph coloring reduces directional derivative evaluations.

Define a column-interference graph:

Each Jacobian column is a node.
Two columns share an edge if they may contribute to the same output row.

Columns with different colors can be seeded together.

If the graph uses $k$ colors, only $k$ forward passes are needed instead of $n$ .

Example:

Problem	Variables	Colors Needed
Dense Jacobian	$n$	$n$
Tridiagonal Jacobian	$n$	3
Grid stencil	Large $n$	Small constant

Sparse coloring can dramatically reduce AD cost.

Reverse Mode and Sparse Accumulation

Reverse mode computes:

J^T u.

Sparse reverse mode faces different challenges:

Problem	Cause
Sparse accumulation	Many paths contribute
Atomic updates	Parallel accumulation
Fill-in	Adjoint propagation
Memory locality	Irregular structure

Sparse reverse kernels often require scatter-add operations:

for edge (i,j):
    grad_x[j] += local_grad * grad_y[i]

Parallel sparse accumulation is difficult because many threads may update the same gradient location.

Sparse Hessians

Second derivatives are often sparse even when first derivatives are not.

Suppose:

f : \mathbb{R}^n \to \mathbb{R}.

The Hessian:

H_{ij} = \frac{\partial^2 f} {\partial x_i \partial x_j}

is sparse when variables interact locally.

Examples:

System	Hessian Structure
Mechanical systems	Local coupling
PDE discretization	Sparse stencil
Graph optimization	Edge-local structure

Sparse Hessians are important because dense Hessians scale as:

O(n^2).

Sparse methods can reduce complexity dramatically.

Hessian-Vector Products

Most systems avoid forming Hessians explicitly.

Instead they compute:

Hv.

This preserves sparsity implicitly.

Methods include:

Method	Technique
Forward-over-reverse	JVP through reverse pass
Reverse-over-forward	VJP through forward pass
Pearlmutter trick	Efficient Hessian-vector product

Sparse Hessian-vector products are fundamental in large-scale optimization.

Sparse Solves in Reverse Mode

Sparse linear solves appear frequently in implicit differentiation.

Suppose:

Ax = b,

with sparse $A$ .

The reverse rule requires:

A^T g = \bar{x}.

Thus the backward pass itself requires a sparse solve.

The cost of reverse mode may therefore approach the cost of another numerical solve.

Efficient sparse AD often depends on:

symbolic factorization reuse
fill-reducing reorderings
cached sparsity structure
iterative solves

Sparse Tensor Contraction

Sparse tensor contraction generalizes sparse matrix multiplication.

Suppose:

C_{ij} = \sum_k A_{ik}B_{kj}.

Only nonzero entries contribute.

Efficient sparse contraction requires:

Concern	Importance
Index ordering	Memory locality
Duplicate accumulation	Correctness
Intermediate fill-in	Complexity
Parallel scheduling	Throughput

The reverse rules remain algebraically identical to dense contraction. The challenge is structural efficiency.

Dynamic Sparsity

Some systems change sparsity patterns dynamically.

Examples:

System	Dynamic Structure
Attention masks	Input-dependent edges
Pruned neural networks	Learned sparsity
Adaptive meshes	Refinement/coarsening
Sparse routing	Conditional execution

Dynamic sparsity creates discontinuities.

Changing whether an entry exists is a discrete operation.

AD usually treats the current sparsity pattern as fixed during a backward pass.

Sparse Attention

Transformer sparsity patterns are increasingly important.

Examples:

Pattern	Structure
Local attention	Banded
Block sparse attention	Block structure
Routing attention	Dynamic graph
Long-context attention	Hierarchical sparsity

Sparse attention derivatives require efficient scatter/gather kernels and structured reductions.

Memory bandwidth often dominates arithmetic cost.

Sparse Graph Operations

Graph neural networks naturally produce sparse operators.

A graph adjacency matrix:

A_{ij} = 1

if an edge exists.

Message passing often has form:

H' = AH W.

Reverse mode propagates gradients through sparse neighborhood aggregation.

Sparse graph AD must handle:

edge accumulation
neighbor reduction
scatter-add
batch graph packing
irregular memory access

Graph differentiation is fundamentally sparse differentiation.

Structured Sparsity

Not all sparsity is random.

Structured patterns include:

Structure	Example
Diagonal	Scaling operators
Banded	PDE discretization
Block sparse	Attention
Toeplitz	Signal processing
Kronecker	Tensor decompositions

Structured sparsity enables specialized derivative kernels.

Example:

A banded matrix with bandwidth $k$ allows:

O(kn)

operations instead of:

O(n^2).

AD systems that preserve structure can inherit these savings.

Sparse Layout Metadata

A sparse tensor primitive must track:

indices
storage format
shape
batch axes
duplicate semantics
sortedness
coalescing state

Two sparse tensors may represent the same values but differ operationally because their storage layouts differ.

Gradient kernels must preserve consistency.

Sparse Accumulation Semantics

Sparse systems often permit duplicate coordinates:

(i,j,val1)
(i,j,val2)

Semantics may define:

A_{ij} = val1 + val2.

Backward accumulation must then preserve this additive interpretation.

Sparse differentiation therefore depends on precise storage semantics.

Implementation Challenges

Sparse AD introduces difficult systems problems:

Challenge	Explanation
Irregular memory access	Poor cache locality
Atomic gradient accumulation	Parallel conflicts
Fill-in	Growing intermediate density
Layout conversion	COO ↔ CSR ↔ CSC
Dynamic sparsity	Discrete structure changes
GPU utilization	Irregular workloads

Sparse kernels are often memory-bound rather than compute-bound.

Practical Guidance

Prefer:

Good Practice	Avoid
Fixed sparsity structure	Constant layout changes
Structured sparsity	Arbitrary irregular sparsity
Hessian-vector products	Dense Hessian materialization
Sparse solves	Dense inversion
Graph coloring	Full Jacobian assembly
Projector gradients	Materializing dense masks

For AD implementers:

Focus Area	Importance
Accumulation correctness	Critical
Sparse metadata tracking	Critical
Efficient scatter-add	Essential
Fill-in control	Major performance issue
Layout-aware kernels	Essential

Summary

Sparse tensor differentiation extends ordinary AD with structural constraints. The mathematical derivative rules remain mostly unchanged, but the implementation must preserve sparsity patterns, avoid unnecessary fill-in, and manage irregular memory behavior efficiently.

The key distinction is between sparse values and sparse structure. Values are differentiable. Structural sparsity patterns are usually treated as fixed local topology. Efficient sparse AD depends on preserving that topology throughout forward and reverse computation.