# Sparse Tensor Derivatives

## Sparse Tensor Derivatives

Most real computational problems are sparse. Large matrices and tensors often contain mostly zeros, structured blocks, or local interactions. Sparse representations reduce memory usage and computational cost by storing only the nonzero structure.

Automatic differentiation must preserve sparsity whenever possible. If differentiation destroys sparsity structure, the resulting gradients can become prohibitively expensive.

Sparse differentiation therefore has two goals:

1. Compute correct derivatives.
2. Preserve structural sparsity through the derivative pipeline.

## Sparsity

A tensor is sparse if most entries are zero.

For a matrix

$$
A \in \mathbb{R}^{m \times n},
$$

define:

$$
\operatorname{nnz}(A) =
\text{number of nonzero entries}.
$$

If

$$
\operatorname{nnz}(A)
\ll
mn,
$$

then sparse representations are beneficial.

Examples:

| Domain | Sparse Structure |
|---|---|
| Graphs | Sparse adjacency matrices |
| PDE discretization | Local stencil matrices |
| Recommender systems | Sparse user-item interactions |
| NLP | Sparse bag-of-words vectors |
| Scientific simulation | Sparse Jacobians |
| Optimization | Sparse constraints |

AD systems that ignore sparsity often become memory-bound long before arithmetic becomes expensive.

## Sparse Storage Formats

Sparse tensors are stored structurally.

Common formats:

| Format | Structure |
|---|---|
| COO | Coordinate list |
| CSR | Compressed sparse row |
| CSC | Compressed sparse column |
| BSR | Block sparse row |
| ELLPACK | Fixed-width row storage |
| Hybrid sparse | Mixed sparse/dense layout |

For COO:

```text
row_indices
column_indices
values
```

represent the tensor.

Example:

$$
A =
\begin{bmatrix}
1 & 0 & 0 \\
0 & 0 & 2 \\
3 & 0 & 4
\end{bmatrix}
$$

COO storage:

```text
rows   = [0,1,2,2]
cols   = [0,2,0,2]
values = [1,2,3,4]
```

The derivative rules depend on whether sparsity structure is fixed or variable.

## Sparse Structure vs Sparse Values

This distinction is critical.

A sparse tensor has:

1. Structural sparsity pattern.
2. Numerical values.

Example:

```text
indices = fixed
values  = differentiable
```

Most AD systems differentiate the values but not the indices.

The indices are discrete objects. Ordinary differentiation does not apply to them.

Thus sparse tensor differentiation usually assumes:

$$
\text{sparsity pattern is fixed locally}.
$$

This is analogous to pivot choices in matrix factorizations.

## Sparse Matrix-Vector Product

Consider:

$$
y = Ax,
$$

where $A$ is sparse.

The forward differential is:

$$
dy = dA\,x + A\,dx.
$$

The reverse rules remain:

$$
\bar{A} \mathrel{+}= \bar{y}x^T,
$$

$$
\bar{x} \mathrel{+}= A^T\bar{y}.
$$

But sparse structure changes implementation significantly.

If $A$ has fixed sparsity pattern:

- Only stored entries receive gradients.
- Zero entries outside the pattern are not materialized.

Thus:

$$
\bar{A}_{ij} =
0
$$

for structurally absent entries.

The gradient inherits the sparsity mask.

## Sparse Gradient Projection

Suppose:

$$
A_{ij} =
0
$$

structurally.

Even if the dense gradient formula would produce a nonzero value there, the sparse parameterization prevents that entry from existing.

Thus sparse differentiation often applies a projection:

$$
\bar{A}
\leftarrow
P_{\mathcal{S}}(\bar{A}),
$$

where:

$$
\mathcal{S} =
\text{sparsity pattern}.
$$

This means:

$$
(P_{\mathcal{S}}(G))_{ij} =
\begin{cases}
G_{ij}, & (i,j)\in\mathcal{S}, \\
0, & \text{otherwise}.
\end{cases}
$$

The parameter space itself is sparse.

## Sparse Jacobians

Many functions have sparse Jacobians.

Suppose:

$$
f : \mathbb{R}^n \to \mathbb{R}^m.
$$

The Jacobian:

$$
J_{ij} =
\frac{\partial f_i}{\partial x_j}
$$

is sparse if most outputs depend only on nearby inputs.

Examples:

| System | Jacobian Structure |
|---|---|
| PDE stencil | Local neighborhood |
| ODE integrator | Banded structure |
| Graph computation | Edge-local dependence |
| Physical simulation | Sparse interactions |

Sparse Jacobians are central to scientific computing.

## Jacobian Sparsity Propagation

Sparsity propagates structurally through composition.

Suppose:

$$
f(x) = h(g(x)).
$$

Then:

$$
J_f = J_h J_g.
$$

Even if both factors are sparse, the product may become denser.

This is called fill-in.

Fill-in is one of the main computational problems in sparse AD.

Example:

| Matrix Type | Multiplication Result |
|---|---|
| Diagonal × diagonal | Diagonal |
| Banded × banded | Wider band |
| Sparse graph adjacency powers | Increasing density |

AD systems must balance:

```text
sparsity preservation
reordering
factorization cost
memory growth
```

## Forward Mode and Sparse Jacobians

Forward mode computes:

$$
Jv.
$$

If the Jacobian is sparse, one can compute multiple directional derivatives simultaneously using graph coloring.

Suppose columns of the Jacobian do not overlap structurally. Then they can share a perturbation seed.

This reduces the number of forward passes.

Example:

```text
seed matrix S
```

produces:

$$
JS.
$$

Careful seed design reconstructs the sparse Jacobian from fewer evaluations.

This technique is widely used in scientific computing AD systems.

## Graph Coloring

Graph coloring reduces directional derivative evaluations.

Define a column-interference graph:

- Each Jacobian column is a node.
- Two columns share an edge if they may contribute to the same output row.

Columns with different colors can be seeded together.

If the graph uses $k$ colors, only $k$ forward passes are needed instead of $n$.

Example:

| Problem | Variables | Colors Needed |
|---|---:|---:|
| Dense Jacobian | $n$ | $n$ |
| Tridiagonal Jacobian | $n$ | 3 |
| Grid stencil | Large $n$ | Small constant |

Sparse coloring can dramatically reduce AD cost.

## Reverse Mode and Sparse Accumulation

Reverse mode computes:

$$
J^T u.
$$

Sparse reverse mode faces different challenges:

| Problem | Cause |
|---|---|
| Sparse accumulation | Many paths contribute |
| Atomic updates | Parallel accumulation |
| Fill-in | Adjoint propagation |
| Memory locality | Irregular structure |

Sparse reverse kernels often require scatter-add operations:

```text
for edge (i,j):
    grad_x[j] += local_grad * grad_y[i]
```

Parallel sparse accumulation is difficult because many threads may update the same gradient location.

## Sparse Hessians

Second derivatives are often sparse even when first derivatives are not.

Suppose:

$$
f : \mathbb{R}^n \to \mathbb{R}.
$$

The Hessian:

$$
H_{ij} =
\frac{\partial^2 f}
{\partial x_i \partial x_j}
$$

is sparse when variables interact locally.

Examples:

| System | Hessian Structure |
|---|---|
| Mechanical systems | Local coupling |
| PDE discretization | Sparse stencil |
| Graph optimization | Edge-local structure |

Sparse Hessians are important because dense Hessians scale as:

$$
O(n^2).
$$

Sparse methods can reduce complexity dramatically.

## Hessian-Vector Products

Most systems avoid forming Hessians explicitly.

Instead they compute:

$$
Hv.
$$

This preserves sparsity implicitly.

Methods include:

| Method | Technique |
|---|---|
| Forward-over-reverse | JVP through reverse pass |
| Reverse-over-forward | VJP through forward pass |
| Pearlmutter trick | Efficient Hessian-vector product |

Sparse Hessian-vector products are fundamental in large-scale optimization.

## Sparse Solves in Reverse Mode

Sparse linear solves appear frequently in implicit differentiation.

Suppose:

$$
Ax = b,
$$

with sparse $A$.

The reverse rule requires:

$$
A^T g = \bar{x}.
$$

Thus the backward pass itself requires a sparse solve.

The cost of reverse mode may therefore approach the cost of another numerical solve.

Efficient sparse AD often depends on:

```text
symbolic factorization reuse
fill-reducing reorderings
cached sparsity structure
iterative solves
```

## Sparse Tensor Contraction

Sparse tensor contraction generalizes sparse matrix multiplication.

Suppose:

$$
C_{ij} =
\sum_k A_{ik}B_{kj}.
$$

Only nonzero entries contribute.

Efficient sparse contraction requires:

| Concern | Importance |
|---|---|
| Index ordering | Memory locality |
| Duplicate accumulation | Correctness |
| Intermediate fill-in | Complexity |
| Parallel scheduling | Throughput |

The reverse rules remain algebraically identical to dense contraction. The challenge is structural efficiency.

## Dynamic Sparsity

Some systems change sparsity patterns dynamically.

Examples:

| System | Dynamic Structure |
|---|---|
| Attention masks | Input-dependent edges |
| Pruned neural networks | Learned sparsity |
| Adaptive meshes | Refinement/coarsening |
| Sparse routing | Conditional execution |

Dynamic sparsity creates discontinuities.

Changing whether an entry exists is a discrete operation.

AD usually treats the current sparsity pattern as fixed during a backward pass.

## Sparse Attention

Transformer sparsity patterns are increasingly important.

Examples:

| Pattern | Structure |
|---|---|
| Local attention | Banded |
| Block sparse attention | Block structure |
| Routing attention | Dynamic graph |
| Long-context attention | Hierarchical sparsity |

Sparse attention derivatives require efficient scatter/gather kernels and structured reductions.

Memory bandwidth often dominates arithmetic cost.

## Sparse Graph Operations

Graph neural networks naturally produce sparse operators.

A graph adjacency matrix:

$$
A_{ij} =
1
$$

if an edge exists.

Message passing often has form:

$$
H' = AH W.
$$

Reverse mode propagates gradients through sparse neighborhood aggregation.

Sparse graph AD must handle:

```text
edge accumulation
neighbor reduction
scatter-add
batch graph packing
irregular memory access
```

Graph differentiation is fundamentally sparse differentiation.

## Structured Sparsity

Not all sparsity is random.

Structured patterns include:

| Structure | Example |
|---|---|
| Diagonal | Scaling operators |
| Banded | PDE discretization |
| Block sparse | Attention |
| Toeplitz | Signal processing |
| Kronecker | Tensor decompositions |

Structured sparsity enables specialized derivative kernels.

Example:

A banded matrix with bandwidth $k$ allows:

$$
O(kn)
$$

operations instead of:

$$
O(n^2).
$$

AD systems that preserve structure can inherit these savings.

## Sparse Layout Metadata

A sparse tensor primitive must track:

```text
indices
storage format
shape
batch axes
duplicate semantics
sortedness
coalescing state
```

Two sparse tensors may represent the same values but differ operationally because their storage layouts differ.

Gradient kernels must preserve consistency.

## Sparse Accumulation Semantics

Sparse systems often permit duplicate coordinates:

```text
(i,j,val1)
(i,j,val2)
```

Semantics may define:

$$
A_{ij} =
val1 + val2.
$$

Backward accumulation must then preserve this additive interpretation.

Sparse differentiation therefore depends on precise storage semantics.

## Implementation Challenges

Sparse AD introduces difficult systems problems:

| Challenge | Explanation |
|---|---|
| Irregular memory access | Poor cache locality |
| Atomic gradient accumulation | Parallel conflicts |
| Fill-in | Growing intermediate density |
| Layout conversion | COO ↔ CSR ↔ CSC |
| Dynamic sparsity | Discrete structure changes |
| GPU utilization | Irregular workloads |

Sparse kernels are often memory-bound rather than compute-bound.

## Practical Guidance

Prefer:

| Good Practice | Avoid |
|---|---|
| Fixed sparsity structure | Constant layout changes |
| Structured sparsity | Arbitrary irregular sparsity |
| Hessian-vector products | Dense Hessian materialization |
| Sparse solves | Dense inversion |
| Graph coloring | Full Jacobian assembly |
| Projector gradients | Materializing dense masks |

For AD implementers:

| Focus Area | Importance |
|---|---|
| Accumulation correctness | Critical |
| Sparse metadata tracking | Critical |
| Efficient scatter-add | Essential |
| Fill-in control | Major performance issue |
| Layout-aware kernels | Essential |

## Summary

Sparse tensor differentiation extends ordinary AD with structural constraints. The mathematical derivative rules remain mostly unchanged, but the implementation must preserve sparsity patterns, avoid unnecessary fill-in, and manage irregular memory behavior efficiently.

The key distinction is between sparse values and sparse structure. Values are differentiable. Structural sparsity patterns are usually treated as fixed local topology. Efficient sparse AD depends on preserving that topology throughout forward and reverse computation.

