Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral...
Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral operations. They are central to automatic differentiation because many numerical programs are built from these operations rather than from scalar arithmetic alone.
An AD system can differentiate these primitives in two ways. It can decompose them into smaller operations and apply generic AD, or it can attach a custom derivative rule to the primitive. Production systems usually prefer custom rules. They are faster, more stable, and avoid materializing intermediate Jacobians.
Why Linear Algebra Needs Primitive Rules
Consider the matrix inverse:
A naive implementation might compute the inverse through Gaussian elimination and differentiate every scalar operation in the elimination algorithm. That approach is correct in principle, but it exposes implementation details that the user did not intend to differentiate.
A better rule comes from the identity
Differentiate both sides:
Solve for :
This rule is compact and independent of the exact inversion algorithm.
The same principle applies broadly: differentiate the mathematical relation satisfied by the output, not necessarily the program used to compute it.
Matrix Multiplication
Matrix multiplication is the basic dense linear algebra primitive.
Let
with
The differential is
Given output adjoint , reverse mode gives:
These rules preserve shape:
Matrix multiplication is also the primitive behind dense neural network layers, attention score computation, least-squares solvers, and many tensor contractions.
Matrix-Vector Product
Let
where
The differential is
Given , the reverse rules are:
This is a special case of matrix multiplication. Many systems still implement it separately because matrix-vector products have different performance characteristics from matrix-matrix products.
Dot Product
Let
Then
Reverse rules:
The dot product reduces two vectors to a scalar. Its reverse rule broadcasts the scalar adjoint back into both inputs.
Outer Product
Let
where
Then
The differential is
Given , the reverse rules are:
Outer products appear in rank-one updates, covariance estimates, attention mechanisms, and low-rank models.
Transpose and Permutation
Transpose is a linear operation:
The differential is
The reverse rule is:
For a general axis permutation,
the reverse rule applies the inverse permutation:
These rules are simple, but they matter for layout. A transpose may be a view with changed strides rather than a materialized copy. The reverse pass must respect the same indexing semantics.
Matrix Inverse
Let
The differential is
Given output adjoint , derive the reverse rule from
Using cyclic trace:
Therefore
In practice, AD systems should avoid explicitly forming when possible. Solves are usually more stable than inverses.
Linear Solve
A linear solve computes
without explicitly forming . Equivalently:
Differentiate:
Solve for :
Let be the output adjoint. Define
Then the reverse rules are:
This rule is central to differentiating least-squares methods, implicit layers, ODE solvers, Gaussian processes, and constrained optimization routines.
It is also a good example of an AD rule that uses another linear solve during the backward pass.
Triangular Solve
Triangular solve is a structured version of linear solve:
where is lower triangular.
The same differential relation holds:
So
Given , define
Then
Because is triangular, only the triangular part of is valid:
For upper triangular solves, use the corresponding upper-triangular projection.
Determinant
Let
For nonsingular ,
Therefore,
Given scalar adjoint :
The determinant can overflow or underflow easily. Numerical systems often prefer the log-determinant.
Log-Determinant
Let
for nonsingular with positive determinant, or more commonly for symmetric positive definite .
The differential is
So
Reverse rule:
For symmetric positive definite matrices, implementations usually compute log-determinants through Cholesky factorization. This is more stable than forming the determinant directly.
Trace
Trace is the sum of diagonal elements:
The differential is
The gradient is the identity matrix:
Reverse rule:
For
the differential is
The gradients are:
Matrix Norms
The Frobenius norm is
For ,
Reverse rule:
For the squared Frobenius norm,
the reverse rule is:
The squared norm is smoother and cheaper. It avoids the division by .
Cholesky Factorization
For a symmetric positive definite matrix , Cholesky factorization computes
where is lower triangular with positive diagonal.
Differentiating this factorization is more involved because is constrained to be triangular. The differential satisfies
A custom AD rule must solve for the lower-triangular part of , or in reverse mode, map back to a symmetric .
Cholesky is important because it appears in:
| Use Case | Role |
|---|---|
| Gaussian processes | Covariance factorization |
| Probabilistic models | Multivariate normal densities |
| Optimization | Newton and quasi-Newton methods |
| Least squares | Stable normal-equation variants |
| Sampling | Reparameterized Gaussian samples |
Most production AD libraries implement a specialized Cholesky backward rule.
QR Factorization
QR factorization writes
where has orthonormal columns and is upper triangular.
QR is used for least squares, orthogonalization, and numerically stable basis construction.
Differentiating QR requires respecting the constraints:
The derivative has singularities when loses rank or when diagonal conventions for change. AD systems should document the assumptions under which QR gradients are valid.
Singular Value Decomposition
The singular value decomposition writes
SVD is powerful but delicate for AD.
Gradients through SVD may be undefined or unstable when singular values are repeated or nearly equal. This happens because singular vectors are not uniquely determined in degenerate subspaces.
For many applications, differentiating singular values alone is more stable than differentiating singular vectors.
AD systems should treat SVD as a high-risk primitive. It requires clear domain assumptions and careful numerical handling.
Eigenvalue Decomposition
For a symmetric matrix,
Gradients through eigenvectors are unstable when eigenvalues are close. Terms of the form
appear in derivative formulas. When eigenvalues collide, the derivative of individual eigenvectors becomes undefined.
For this reason, differentiating spectral decompositions requires attention to degeneracy, ordering, and sign conventions.
Prefer Solves Over Inverses
A common numerical rule is:
should usually be computed as a solve:
The same applies in AD. The backward pass for solve uses triangular or linear solves. This preserves numerical structure and reduces error.
Code that explicitly forms inverses often creates:
| Problem | Cause |
|---|---|
| More floating-point error | Inverse materialization |
| More memory traffic | Dense intermediate matrix |
| Poor conditioning | Explicit inverse amplifies error |
| Slower backward pass | Extra matrix multiplications |
An AD-aware linear algebra API should expose solves as first-class primitives.
Primitive Rule Table
| Primitive | Forward Differential | Reverse Rule |
|---|---|---|
Implementation Notes
A production AD implementation should store enough forward-pass metadata for each primitive:
primitive
input shapes
output shapes
factorization metadata
pivoting information
triangular flags
symmetry flags
batch dimensions
dtype
device
layoutFor example, a Cholesky backward rule needs to know whether the lower or upper factor was returned. A solve backward rule needs to know whether the matrix was transposed, triangular, batched, or factored with pivoting.
The mathematical rule is only part of the implementation. The primitive contract must also specify:
valid input domain
behavior at singular points
shape conventions
batch semantics
gradient convention
numerical stability guaranteesLinear algebra AD is as much API design as calculus.
Practical Guidance
For AD systems and users:
| Prefer | Avoid |
|---|---|
solve(A, b) | inverse(A) @ b |
slogdet(A) | log(det(A)) |
| Cholesky for SPD matrices | Generic factorization when SPD structure is known |
| Custom primitive gradients | Differentiating low-level factorization code blindly |
| Shape-checked batch rules | Implicit rank assumptions |
| Documented degeneracy behavior | Silent spectral gradient failures |
Linear algebra primitives expose the boundary between formal differentiation and numerical analysis. The derivative rule may be exact, while the computed gradient may still be unstable if the underlying problem is ill-conditioned.
Summary
Linear algebra primitives should be differentiated at the level of their mathematical specification. Matrix multiplication, solves, determinants, traces, and factorizations each have compact differential rules. Reverse-mode AD obtains efficient adjoint rules by transposing these local linear maps.
The main design principle is simple: preserve structure. Use solves instead of inverses, use factorizations appropriate to the matrix class, avoid materializing large Jacobians, and define custom backward rules for primitives whose numerical algorithms contain structure that generic AD would obscure.