Skip to content

Linear Algebra Primitives

Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral...

Linear algebra primitives are tensor operations with algebraic structure: matrix multiplication, triangular solves, factorizations, inverses, determinants, norms, and spectral operations. They are central to automatic differentiation because many numerical programs are built from these operations rather than from scalar arithmetic alone.

An AD system can differentiate these primitives in two ways. It can decompose them into smaller operations and apply generic AD, or it can attach a custom derivative rule to the primitive. Production systems usually prefer custom rules. They are faster, more stable, and avoid materializing intermediate Jacobians.

Why Linear Algebra Needs Primitive Rules

Consider the matrix inverse:

Y=A1. Y = A^{-1}.

A naive implementation might compute the inverse through Gaussian elimination and differentiate every scalar operation in the elimination algorithm. That approach is correct in principle, but it exposes implementation details that the user did not intend to differentiate.

A better rule comes from the identity

AY=I. AY = I.

Differentiate both sides:

dAY+AdY=0. dA\,Y + A\,dY = 0.

Solve for dYdY:

dY=A1dAA1. dY = -A^{-1} dA A^{-1}.

This rule is compact and independent of the exact inversion algorithm.

The same principle applies broadly: differentiate the mathematical relation satisfied by the output, not necessarily the program used to compute it.

Matrix Multiplication

Matrix multiplication is the basic dense linear algebra primitive.

Let

C=AB, C = AB,

with

ARm×k,BRk×n,CRm×n. A \in \mathbb{R}^{m \times k}, \qquad B \in \mathbb{R}^{k \times n}, \qquad C \in \mathbb{R}^{m \times n}.

The differential is

dC=dAB+AdB. dC = dA\,B + A\,dB.

Given output adjoint Cˉ\bar{C}, reverse mode gives:

Aˉ+=CˉBT, \bar{A} \mathrel{+}= \bar{C}B^T, Bˉ+=ATCˉ. \bar{B} \mathrel{+}= A^T\bar{C}.

These rules preserve shape:

CˉBT:(m×n)(n×k)=m×k, \bar{C}B^T : (m \times n)(n \times k) = m \times k, ATCˉ:(k×m)(m×n)=k×n. A^T\bar{C} : (k \times m)(m \times n) = k \times n.

Matrix multiplication is also the primitive behind dense neural network layers, attention score computation, least-squares solvers, and many tensor contractions.

Matrix-Vector Product

Let

y=Ax, y = Ax,

where

ARm×n,xRn,yRm. A \in \mathbb{R}^{m \times n}, \qquad x \in \mathbb{R}^n, \qquad y \in \mathbb{R}^m.

The differential is

dy=dAx+Adx. dy = dA\,x + A\,dx.

Given yˉ\bar{y}, the reverse rules are:

Aˉ+=yˉxT, \bar{A} \mathrel{+}= \bar{y}x^T, xˉ+=ATyˉ. \bar{x} \mathrel{+}= A^T\bar{y}.

This is a special case of matrix multiplication. Many systems still implement it separately because matrix-vector products have different performance characteristics from matrix-matrix products.

Dot Product

Let

y=aTb. y = a^T b.

Then

dy=daTb+aTdb. dy = da^T b + a^T db.

Reverse rules:

aˉ+=yˉb, \bar{a} \mathrel{+}= \bar{y}b, bˉ+=yˉa. \bar{b} \mathrel{+}= \bar{y}a.

The dot product reduces two vectors to a scalar. Its reverse rule broadcasts the scalar adjoint back into both inputs.

Outer Product

Let

C=abT, C = ab^T,

where

aRm,bRn. a \in \mathbb{R}^m, \qquad b \in \mathbb{R}^n.

Then

Cij=aibj. C_{ij} = a_i b_j.

The differential is

dC=dabT+adbT. dC = da\,b^T + a\,db^T.

Given Cˉ\bar{C}, the reverse rules are:

aˉ+=Cˉb, \bar{a} \mathrel{+}= \bar{C}b, bˉ+=CˉTa. \bar{b} \mathrel{+}= \bar{C}^T a.

Outer products appear in rank-one updates, covariance estimates, attention mechanisms, and low-rank models.

Transpose and Permutation

Transpose is a linear operation:

Y=XT. Y = X^T.

The differential is

dY=dXT. dY = dX^T.

The reverse rule is:

Xˉ+=YˉT. \bar{X} \mathrel{+}= \bar{Y}^T.

For a general axis permutation,

Y=permute(X,π), Y = \operatorname{permute}(X,\pi),

the reverse rule applies the inverse permutation:

Xˉ+=permute(Yˉ,π1). \bar{X} \mathrel{+}= \operatorname{permute}(\bar{Y},\pi^{-1}).

These rules are simple, but they matter for layout. A transpose may be a view with changed strides rather than a materialized copy. The reverse pass must respect the same indexing semantics.

Matrix Inverse

Let

Y=A1. Y = A^{-1}.

The differential is

dY=A1dAA1. dY = -A^{-1} dA A^{-1}.

Given output adjoint Yˉ\bar{Y}, derive the reverse rule from

Yˉ,dYF=tr(YˉTA1dAA1). \langle \bar{Y}, dY \rangle_F = -\operatorname{tr}(\bar{Y}^T A^{-1} dA A^{-1}).

Using cyclic trace:

Yˉ,dYF=tr((ATYˉAT)TdA). \langle \bar{Y}, dY \rangle_F = -\operatorname{tr}\left((A^{-T}\bar{Y}A^{-T})^T dA\right).

Therefore

Aˉ+=ATYˉAT. \bar{A} \mathrel{+}= -A^{-T}\bar{Y}A^{-T}.

In practice, AD systems should avoid explicitly forming A1A^{-1} when possible. Solves are usually more stable than inverses.

Linear Solve

A linear solve computes

X=A1B X = A^{-1}B

without explicitly forming A1A^{-1}. Equivalently:

AX=B. AX = B.

Differentiate:

dAX+AdX=dB. dA\,X + A\,dX = dB.

Solve for dXdX:

dX=A1(dBdAX). dX = A^{-1}(dB - dA\,X).

Let Xˉ\bar{X} be the output adjoint. Define

G=ATXˉ. G = A^{-T}\bar{X}.

Then the reverse rules are:

Bˉ+=G, \bar{B} \mathrel{+}= G, Aˉ+=GXT. \bar{A} \mathrel{+}= -G X^T.

This rule is central to differentiating least-squares methods, implicit layers, ODE solvers, Gaussian processes, and constrained optimization routines.

It is also a good example of an AD rule that uses another linear solve during the backward pass.

Triangular Solve

Triangular solve is a structured version of linear solve:

LX=B, LX = B,

where LL is lower triangular.

The same differential relation holds:

dLX+LdX=dB. dL\,X + L\,dX = dB.

So

dX=L1(dBdLX). dX = L^{-1}(dB - dL\,X).

Given Xˉ\bar{X}, define

G=LTXˉ. G = L^{-T}\bar{X}.

Then

Bˉ+=G, \bar{B} \mathrel{+}= G, Lˉ+=GXT. \bar{L} \mathrel{+}= -G X^T.

Because LL is triangular, only the triangular part of Lˉ\bar{L} is valid:

Lˉtril(Lˉ). \bar{L} \leftarrow \operatorname{tril}(\bar{L}).

For upper triangular solves, use the corresponding upper-triangular projection.

Determinant

Let

y=det(A). y = \det(A).

For nonsingular AA,

dy=det(A)tr(A1dA). dy = \det(A)\operatorname{tr}(A^{-1}dA).

Therefore,

Adet(A)=det(A)AT. \nabla_A \det(A) = \det(A)A^{-T}.

Given scalar adjoint yˉ\bar{y}:

Aˉ+=yˉdet(A)AT. \bar{A} \mathrel{+}= \bar{y}\det(A)A^{-T}.

The determinant can overflow or underflow easily. Numerical systems often prefer the log-determinant.

Log-Determinant

Let

y=logdet(A), y = \log \det(A),

for nonsingular AA with positive determinant, or more commonly for symmetric positive definite AA.

The differential is

dy=tr(A1dA). dy = \operatorname{tr}(A^{-1}dA).

So

Ay=AT. \nabla_A y = A^{-T}.

Reverse rule:

Aˉ+=yˉAT. \bar{A} \mathrel{+}= \bar{y}A^{-T}.

For symmetric positive definite matrices, implementations usually compute log-determinants through Cholesky factorization. This is more stable than forming the determinant directly.

Trace

Trace is the sum of diagonal elements:

y=tr(A). y = \operatorname{tr}(A).

The differential is

dy=tr(dA). dy = \operatorname{tr}(dA).

The gradient is the identity matrix:

Ay=I. \nabla_A y = I.

Reverse rule:

Aˉ+=yˉI. \bar{A} \mathrel{+}= \bar{y}I.

For

y=tr(AB), y = \operatorname{tr}(AB),

the differential is

dy=tr(dAB)+tr(AdB). dy = \operatorname{tr}(dA\,B) + \operatorname{tr}(A\,dB).

The gradients are:

Aˉ+=yˉBT, \bar{A} \mathrel{+}= \bar{y}B^T, Bˉ+=yˉAT. \bar{B} \mathrel{+}= \bar{y}A^T.

Matrix Norms

The Frobenius norm is

y=AF=ijAij2. y = \|A\|_F = \sqrt{\sum_{ij} A_{ij}^2}.

For A0A \ne 0,

dy=AAF,dAF. dy = \left\langle \frac{A}{\|A\|_F}, dA \right\rangle_F.

Reverse rule:

Aˉ+=yˉAAF. \bar{A} \mathrel{+}= \bar{y}\frac{A}{\|A\|_F}.

For the squared Frobenius norm,

y=AF2, y = \|A\|_F^2,

the reverse rule is:

Aˉ+=2yˉA. \bar{A} \mathrel{+}= 2\bar{y}A.

The squared norm is smoother and cheaper. It avoids the division by AF\|A\|_F.

Cholesky Factorization

For a symmetric positive definite matrix AA, Cholesky factorization computes

A=LLT, A = LL^T,

where LL is lower triangular with positive diagonal.

Differentiating this factorization is more involved because LL is constrained to be triangular. The differential satisfies

dA=dLLT+LdLT. dA = dL\,L^T + L\,dL^T.

A custom AD rule must solve for the lower-triangular part of dLdL, or in reverse mode, map Lˉ\bar{L} back to a symmetric Aˉ\bar{A}.

Cholesky is important because it appears in:

Use CaseRole
Gaussian processesCovariance factorization
Probabilistic modelsMultivariate normal densities
OptimizationNewton and quasi-Newton methods
Least squaresStable normal-equation variants
SamplingReparameterized Gaussian samples

Most production AD libraries implement a specialized Cholesky backward rule.

QR Factorization

QR factorization writes

A=QR, A = QR,

where QQ has orthonormal columns and RR is upper triangular.

QR is used for least squares, orthogonalization, and numerically stable basis construction.

Differentiating QR requires respecting the constraints:

QTQ=I, Q^TQ = I, R is upper triangular. R \text{ is upper triangular}.

The derivative has singularities when AA loses rank or when diagonal conventions for RR change. AD systems should document the assumptions under which QR gradients are valid.

Singular Value Decomposition

The singular value decomposition writes

A=UΣVT. A = U\Sigma V^T.

SVD is powerful but delicate for AD.

Gradients through SVD may be undefined or unstable when singular values are repeated or nearly equal. This happens because singular vectors are not uniquely determined in degenerate subspaces.

For many applications, differentiating singular values alone is more stable than differentiating singular vectors.

AD systems should treat SVD as a high-risk primitive. It requires clear domain assumptions and careful numerical handling.

Eigenvalue Decomposition

For a symmetric matrix,

A=QΛQT. A = Q\Lambda Q^T.

Gradients through eigenvectors are unstable when eigenvalues are close. Terms of the form

1λiλj \frac{1}{\lambda_i - \lambda_j}

appear in derivative formulas. When eigenvalues collide, the derivative of individual eigenvectors becomes undefined.

For this reason, differentiating spectral decompositions requires attention to degeneracy, ordering, and sign conventions.

Prefer Solves Over Inverses

A common numerical rule is:

A1B A^{-1}B

should usually be computed as a solve:

solve(A,B). \operatorname{solve}(A,B).

The same applies in AD. The backward pass for solve uses triangular or linear solves. This preserves numerical structure and reduces error.

Code that explicitly forms inverses often creates:

ProblemCause
More floating-point errorInverse materialization
More memory trafficDense intermediate matrix
Poor conditioningExplicit inverse amplifies error
Slower backward passExtra matrix multiplications

An AD-aware linear algebra API should expose solves as first-class primitives.

Primitive Rule Table

PrimitiveForward DifferentialReverse Rule
C=ABC = ABdC=dAB+AdBdC=dA\,B+A\,dBAˉ+=CˉBT, Bˉ+=ATCˉ\bar{A}{+}=\bar{C}B^T,\ \bar{B}{+}=A^T\bar{C}
y=aTby=a^Tbdy=daTb+aTdbdy=da^Tb+a^Tdbaˉ+=yˉb, bˉ+=yˉa\bar{a}{+}=\bar{y}b,\ \bar{b}{+}=\bar{y}a
Y=A1Y=A^{-1}dY=A1dAA1dY=-A^{-1}dA A^{-1}Aˉ+=ATYˉAT\bar{A}{+}=-A^{-T}\bar{Y}A^{-T}
X=A1BX=A^{-1}BdX=A1(dBdAX)dX=A^{-1}(dB-dA\,X)G=ATXˉ; Bˉ+=G, Aˉ+=GXTG=A^{-T}\bar{X};\ \bar{B}{+}=G,\ \bar{A}{+}=-GX^T
y=detAy=\det Ady=det(A)tr(A1dA)dy=\det(A)\operatorname{tr}(A^{-1}dA)Aˉ+=yˉdet(A)AT\bar{A}{+}=\bar{y}\det(A)A^{-T}
y=logdetAy=\log\det Ady=tr(A1dA)dy=\operatorname{tr}(A^{-1}dA)Aˉ+=yˉAT\bar{A}{+}=\bar{y}A^{-T}
y=trAy=\operatorname{tr}Ady=tr(dA)dy=\operatorname{tr}(dA)Aˉ+=yˉI\bar{A}{+}=\bar{y}I
y=AF2y=\|A\|_F^2dy=2A,dAFdy=2\langle A,dA\rangle_FAˉ+=2yˉA\bar{A}{+}=2\bar{y}A

Implementation Notes

A production AD implementation should store enough forward-pass metadata for each primitive:

primitive
input shapes
output shapes
factorization metadata
pivoting information
triangular flags
symmetry flags
batch dimensions
dtype
device
layout

For example, a Cholesky backward rule needs to know whether the lower or upper factor was returned. A solve backward rule needs to know whether the matrix was transposed, triangular, batched, or factored with pivoting.

The mathematical rule is only part of the implementation. The primitive contract must also specify:

valid input domain
behavior at singular points
shape conventions
batch semantics
gradient convention
numerical stability guarantees

Linear algebra AD is as much API design as calculus.

Practical Guidance

For AD systems and users:

PreferAvoid
solve(A, b)inverse(A) @ b
slogdet(A)log(det(A))
Cholesky for SPD matricesGeneric factorization when SPD structure is known
Custom primitive gradientsDifferentiating low-level factorization code blindly
Shape-checked batch rulesImplicit rank assumptions
Documented degeneracy behaviorSilent spectral gradient failures

Linear algebra primitives expose the boundary between formal differentiation and numerical analysis. The derivative rule may be exact, while the computed gradient may still be unstable if the underlying problem is ill-conditioned.

Summary

Linear algebra primitives should be differentiated at the level of their mathematical specification. Matrix multiplication, solves, determinants, traces, and factorizations each have compact differential rules. Reverse-mode AD obtains efficient adjoint rules by transposing these local linear maps.

The main design principle is simple: preserve structure. Use solves instead of inverses, use factorizations appropriate to the matrix class, avoid materializing large Jacobians, and define custom backward rules for primitives whose numerical algorithms contain structure that generic AD would obscure.