Linear Algebra Primitives

Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral operations. They are central to automatic differentiation because many numerical programs are built from these operations rather than from scalar arithmetic alone.

An AD system can differentiate these primitives in two ways. It can decompose them into smaller operations and apply generic AD, or it can attach a custom derivative rule to the primitive. Production systems usually prefer custom rules. They are faster, more stable, and avoid materializing intermediate Jacobians.

Why Linear Algebra Needs Primitive Rules

Consider the matrix inverse:

Y = A^{-1}.

A naive implementation might compute the inverse through Gaussian elimination and differentiate every scalar operation in the elimination algorithm. That approach is correct in principle, but it exposes implementation details that the user did not intend to differentiate.

A better rule comes from the identity

AY = I.

Differentiate both sides:

dA\,Y + A\,dY = 0.

Solve for $dY$ :

dY = -A^{-1} dA A^{-1}.

This rule is compact and independent of the exact inversion algorithm.

The same principle applies broadly: differentiate the mathematical relation satisfied by the output, not necessarily the program used to compute it.

Matrix Multiplication

Matrix multiplication is the basic dense linear algebra primitive.

Let

C = AB,

with

A \in \mathbb{R}^{m \times k}, \qquad B \in \mathbb{R}^{k \times n}, \qquad C \in \mathbb{R}^{m \times n}.

The differential is

dC = dA\,B + A\,dB.

Given output adjoint $\bar{C}$ , reverse mode gives:

\bar{A} \mathrel{+}= \bar{C}B^T,

\bar{B} \mathrel{+}= A^T\bar{C}.

These rules preserve shape:

\bar{C}B^T : (m \times n)(n \times k) = m \times k,

A^T\bar{C} : (k \times m)(m \times n) = k \times n.

Matrix multiplication is also the primitive behind dense neural network layers, attention score computation, least-squares solvers, and many tensor contractions.

Matrix-Vector Product

Let

y = Ax,

where

A \in \mathbb{R}^{m \times n}, \qquad x \in \mathbb{R}^n, \qquad y \in \mathbb{R}^m.

The differential is

dy = dA\,x + A\,dx.

Given $\bar{y}$ , the reverse rules are:

\bar{A} \mathrel{+}= \bar{y}x^T,

\bar{x} \mathrel{+}= A^T\bar{y}.

This is a special case of matrix multiplication. Many systems still implement it separately because matrix-vector products have different performance characteristics from matrix-matrix products.

Dot Product

Let

y = a^T b.

Then

dy = da^T b + a^T db.

Reverse rules:

\bar{a} \mathrel{+}= \bar{y}b,

\bar{b} \mathrel{+}= \bar{y}a.

The dot product reduces two vectors to a scalar. Its reverse rule broadcasts the scalar adjoint back into both inputs.

Outer Product

Let

C = ab^T,

where

a \in \mathbb{R}^m, \qquad b \in \mathbb{R}^n.

Then

C_{ij} = a_i b_j.

The differential is

dC = da\,b^T + a\,db^T.

Given $\bar{C}$ , the reverse rules are:

\bar{a} \mathrel{+}= \bar{C}b,

\bar{b} \mathrel{+}= \bar{C}^T a.

Outer products appear in rank-one updates, covariance estimates, attention mechanisms, and low-rank models.

Transpose and Permutation

Transpose is a linear operation:

Y = X^T.

The differential is

dY = dX^T.

The reverse rule is:

\bar{X} \mathrel{+}= \bar{Y}^T.

For a general axis permutation,

Y = \operatorname{permute}(X,\pi),

the reverse rule applies the inverse permutation:

\bar{X} \mathrel{+}= \operatorname{permute}(\bar{Y},\pi^{-1}).

These rules are simple, but they matter for layout. A transpose may be a view with changed strides rather than a materialized copy. The reverse pass must respect the same indexing semantics.

Matrix Inverse

Let

Y = A^{-1}.

The differential is

dY = -A^{-1} dA A^{-1}.

Given output adjoint $\bar{Y}$ , derive the reverse rule from

\langle \bar{Y}, dY \rangle_F = -\operatorname{tr}(\bar{Y}^T A^{-1} dA A^{-1}).

Using cyclic trace:

\langle \bar{Y}, dY \rangle_F = -\operatorname{tr}\left((A^{-T}\bar{Y}A^{-T})^T dA\right).

Therefore

\bar{A} \mathrel{+}= -A^{-T}\bar{Y}A^{-T}.

In practice, AD systems should avoid explicitly forming $A^{-1}$ when possible. Solves are usually more stable than inverses.

Linear Solve

A linear solve computes

X = A^{-1}B

without explicitly forming $A^{-1}$ . Equivalently:

AX = B.

Differentiate:

dA\,X + A\,dX = dB.

Solve for $dX$ :

dX = A^{-1}(dB - dA\,X).

Let $\bar{X}$ be the output adjoint. Define

G = A^{-T}\bar{X}.

Then the reverse rules are:

\bar{B} \mathrel{+}= G,

\bar{A} \mathrel{+}= -G X^T.

This rule is central to differentiating least-squares methods, implicit layers, ODE solvers, Gaussian processes, and constrained optimization routines.

It is also a good example of an AD rule that uses another linear solve during the backward pass.

Triangular Solve

Triangular solve is a structured version of linear solve:

LX = B,

where $L$ is lower triangular.

The same differential relation holds:

dL\,X + L\,dX = dB.

dX = L^{-1}(dB - dL\,X).

Given $\bar{X}$ , define

G = L^{-T}\bar{X}.

Then

\bar{B} \mathrel{+}= G,

\bar{L} \mathrel{+}= -G X^T.

Because $L$ is triangular, only the triangular part of $\bar{L}$ is valid:

\bar{L} \leftarrow \operatorname{tril}(\bar{L}).

For upper triangular solves, use the corresponding upper-triangular projection.

Determinant

Let

y = \det(A).

For nonsingular $A$ ,

dy = \det(A)\operatorname{tr}(A^{-1}dA).

Therefore,

\nabla_A \det(A) = \det(A)A^{-T}.

Given scalar adjoint $\bar{y}$ :

\bar{A} \mathrel{+}= \bar{y}\det(A)A^{-T}.

The determinant can overflow or underflow easily. Numerical systems often prefer the log-determinant.

Log-Determinant

Let

y = \log \det(A),

for nonsingular $A$ with positive determinant, or more commonly for symmetric positive definite $A$ .

The differential is

dy = \operatorname{tr}(A^{-1}dA).

\nabla_A y = A^{-T}.

Reverse rule:

\bar{A} \mathrel{+}= \bar{y}A^{-T}.

For symmetric positive definite matrices, implementations usually compute log-determinants through Cholesky factorization. This is more stable than forming the determinant directly.

Trace

Trace is the sum of diagonal elements:

y = \operatorname{tr}(A).

The differential is

dy = \operatorname{tr}(dA).

The gradient is the identity matrix:

\nabla_A y = I.

Reverse rule:

\bar{A} \mathrel{+}= \bar{y}I.

For

y = \operatorname{tr}(AB),

the differential is

dy = \operatorname{tr}(dA\,B) + \operatorname{tr}(A\,dB).

The gradients are:

\bar{A} \mathrel{+}= \bar{y}B^T,

\bar{B} \mathrel{+}= \bar{y}A^T.

Matrix Norms

The Frobenius norm is

y = \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2}.

For $A \ne 0$ ,

dy = \left\langle \frac{A}{\|A\|_F}, dA \right\rangle_F.

Reverse rule:

\bar{A} \mathrel{+}= \bar{y}\frac{A}{\|A\|_F}.

For the squared Frobenius norm,

y = \|A\|_F^2,

the reverse rule is:

\bar{A} \mathrel{+}= 2\bar{y}A.

The squared norm is smoother and cheaper. It avoids the division by $\|A\|_F$ .

Cholesky Factorization

For a symmetric positive definite matrix $A$ , Cholesky factorization computes

A = LL^T,

where $L$ is lower triangular with positive diagonal.

Differentiating this factorization is more involved because $L$ is constrained to be triangular. The differential satisfies

dA = dL\,L^T + L\,dL^T.

A custom AD rule must solve for the lower-triangular part of $dL$ , or in reverse mode, map $\bar{L}$ back to a symmetric $\bar{A}$ .

Cholesky is important because it appears in:

Use Case	Role
Gaussian processes	Covariance factorization
Probabilistic models	Multivariate normal densities
Optimization	Newton and quasi-Newton methods
Least squares	Stable normal-equation variants
Sampling	Reparameterized Gaussian samples

Most production AD libraries implement a specialized Cholesky backward rule.

QR Factorization

QR factorization writes

A = QR,

where $Q$ has orthonormal columns and $R$ is upper triangular.

QR is used for least squares, orthogonalization, and numerically stable basis construction.

Differentiating QR requires respecting the constraints:

Q^TQ = I,

R \text{ is upper triangular}.

The derivative has singularities when $A$ loses rank or when diagonal conventions for $R$ change. AD systems should document the assumptions under which QR gradients are valid.

Singular Value Decomposition

The singular value decomposition writes

A = U\Sigma V^T.

SVD is powerful but delicate for AD.

Gradients through SVD may be undefined or unstable when singular values are repeated or nearly equal. This happens because singular vectors are not uniquely determined in degenerate subspaces.

For many applications, differentiating singular values alone is more stable than differentiating singular vectors.

AD systems should treat SVD as a high-risk primitive. It requires clear domain assumptions and careful numerical handling.

Eigenvalue Decomposition

For a symmetric matrix,

A = Q\Lambda Q^T.

Gradients through eigenvectors are unstable when eigenvalues are close. Terms of the form

\frac{1}{\lambda_i - \lambda_j}

appear in derivative formulas. When eigenvalues collide, the derivative of individual eigenvectors becomes undefined.

For this reason, differentiating spectral decompositions requires attention to degeneracy, ordering, and sign conventions.

Prefer Solves Over Inverses

A common numerical rule is:

A^{-1}B

should usually be computed as a solve:

\operatorname{solve}(A,B).

The same applies in AD. The backward pass for solve uses triangular or linear solves. This preserves numerical structure and reduces error.

Code that explicitly forms inverses often creates:

Problem	Cause
More floating-point error	Inverse materialization
More memory traffic	Dense intermediate matrix
Poor conditioning	Explicit inverse amplifies error
Slower backward pass	Extra matrix multiplications

An AD-aware linear algebra API should expose solves as first-class primitives.

Primitive Rule Table

Primitive	Forward Differential	Reverse Rule
$C = AB$	$dC=dA\,B+A\,dB$	$\bar{A}{+}=\bar{C}B^T,\ \bar{B}{+}=A^T\bar{C}$
$y=a^Tb$	$dy=da^Tb+a^Tdb$	$\bar{a}{+}=\bar{y}b,\ \bar{b}{+}=\bar{y}a$
$Y=A^{-1}$	$dY=-A^{-1}dA A^{-1}$	$\bar{A}{+}=-A^{-T}\bar{Y}A^{-T}$
$X=A^{-1}B$	$dX=A^{-1}(dB-dA\,X)$	$G=A^{-T}\bar{X};\ \bar{B}{+}=G,\ \bar{A}{+}=-GX^T$
$y=\det A$	$dy=\det(A)\operatorname{tr}(A^{-1}dA)$	$\bar{A}{+}=\bar{y}\det(A)A^{-T}$
$y=\log\det A$	$dy=\operatorname{tr}(A^{-1}dA)$	$\bar{A}{+}=\bar{y}A^{-T}$
$y=\operatorname{tr}A$	$dy=\operatorname{tr}(dA)$	$\bar{A}{+}=\bar{y}I$
$y=\\|A\\|_F^2$	$dy=2\langle A,dA\rangle_F$	$\bar{A}{+}=2\bar{y}A$

Implementation Notes

A production AD implementation should store enough forward-pass metadata for each primitive:

primitive
input shapes
output shapes
factorization metadata
pivoting information
triangular flags
symmetry flags
batch dimensions
dtype
device
layout

For example, a Cholesky backward rule needs to know whether the lower or upper factor was returned. A solve backward rule needs to know whether the matrix was transposed, triangular, batched, or factored with pivoting.

The mathematical rule is only part of the implementation. The primitive contract must also specify:

valid input domain
behavior at singular points
shape conventions
batch semantics
gradient convention
numerical stability guarantees

Linear algebra AD is as much API design as calculus.

Practical Guidance

For AD systems and users:

Prefer	Avoid
`solve(A, b)`	`inverse(A) @ b`
`slogdet(A)`	`log(det(A))`
Cholesky for SPD matrices	Generic factorization when SPD structure is known
Custom primitive gradients	Differentiating low-level factorization code blindly
Shape-checked batch rules	Implicit rank assumptions
Documented degeneracy behavior	Silent spectral gradient failures

Linear algebra primitives expose the boundary between formal differentiation and numerical analysis. The derivative rule may be exact, while the computed gradient may still be unstable if the underlying problem is ill-conditioned.

Summary

Linear algebra primitives should be differentiated at the level of their mathematical specification. Matrix multiplication, solves, determinants, traces, and factorizations each have compact differential rules. Reverse-mode AD obtains efficient adjoint rules by transposing these local linear maps.

The main design principle is simple: preserve structure. Use solves instead of inverses, use factorizations appropriate to the matrix class, avoid materializing large Jacobians, and define custom backward rules for primitives whose numerical algorithms contain structure that generic AD would obscure.