Skip to content

Differentiating Factorizations

Matrix factorizations rewrite a matrix into structured factors. They are used because the factors make later computations cheaper, more stable, or easier to interpret. In...

Matrix factorizations rewrite a matrix into structured factors. They are used because the factors make later computations cheaper, more stable, or easier to interpret. In automatic differentiation, factorizations are delicate primitives because their outputs satisfy constraints.

A generic matrix has free entries. A factorization output usually does not. Orthogonal matrices, triangular matrices, diagonal singular-value matrices, and permutation matrices all live in constrained spaces. The derivative must respect those constraints.

Factorization as an Implicit Equation

Most factorizations are best differentiated through the equation they satisfy.

For example, Cholesky factorization computes

A=LLT, A = LL^T,

where AA is symmetric positive definite and LL is lower triangular with positive diagonal.

The differential is

dA=dLLT+LdLT. dA = dL\,L^T + L\,dL^T.

This equation relates dAdA and dLdL. Because dLdL is lower triangular, the equation has a unique solution under the Cholesky constraints.

The same pattern applies to other factorizations:

FactorizationDefining EquationConstraints
LUPA=LUPA = LULL unit lower triangular, UU upper triangular
QRA=QRA = QRQTQ=IQ^TQ = I, RR upper triangular
CholeskyA=LLTA = LL^TLL lower triangular, positive diagonal
EigendecompositionA=QΛQTA = Q\Lambda Q^TQTQ=IQ^TQ = I, Λ\Lambda diagonal
SVDA=UΣVTA = U\Sigma V^TU,VU,V orthogonal, Σ\Sigma diagonal nonnegative

The factorization algorithm may use pivots, Householder reflectors, blocked kernels, or iterative methods. The derivative rule should usually be derived from the defining equation, not from the low-level implementation.

Cholesky Factorization

Cholesky is one of the cleanest factorizations for AD because it has a stable domain: symmetric positive definite matrices.

Let

A=LLT,L lower triangular. A = LL^T, \qquad L \text{ lower triangular}.

The forward differential is

dA=dLLT+LdLT. dA = dL\,L^T + L\,dL^T.

To solve for dLdL, multiply on the left by L1L^{-1} and on the right by LTL^{-T}:

L1dALT=L1dL+dLTLT. L^{-1} dA L^{-T} = L^{-1}dL + dL^T L^{-T}.

Define

S=L1dALT, S = L^{-1} dA L^{-T}, K=L1dL. K = L^{-1}dL.

Because dLdL is lower triangular and L1L^{-1} is lower triangular, KK is lower triangular.

Then

S=K+KT. S = K + K^T.

So KK is recovered by taking the lower triangular part of SS, with half of the diagonal:

Kij={Sij,i>j,12Sii,i=j,0,i<j. K_{ij} = \begin{cases} S_{ij}, & i > j, \\ \frac{1}{2}S_{ii}, & i=j, \\ 0, & i<j. \end{cases}

Then

dL=LK. dL = LK.

This gives a forward-mode rule for differentiating Cholesky.

Reverse mode derives the adjoint rule by transposing this linear map. Production systems usually implement a specialized backward pass using triangular solves and symmetric projections.

Cholesky Domain

Cholesky requires AA to be symmetric positive definite. At the boundary of this domain, the derivative may become unstable or undefined.

If the smallest eigenvalue of AA approaches zero, then L1L^{-1} becomes large. The backward pass can amplify adjoints sharply.

In practice, users often stabilize Cholesky with diagonal jitter:

Aϵ=A+ϵI. A_\epsilon = A + \epsilon I.

This changes the differentiated function. The gradient is then the gradient of the stabilized factorization, not the original one.

LU Factorization

LU factorization writes

PA=LU, PA = LU,

where PP is a permutation matrix, LL is unit lower triangular, and UU is upper triangular.

Ignoring pivot changes, the differential satisfies

PdA=dLU+LdU. P\,dA = dL\,U + L\,dU.

Multiplying by L1L^{-1} on the left and U1U^{-1} on the right gives

L1PdAU1=L1dL+dUU1. L^{-1}P\,dA\,U^{-1} = L^{-1}dL + dU\,U^{-1}.

The left term L1dLL^{-1}dL is strictly lower triangular because LL has unit diagonal. The right term dUU1dU\,U^{-1} is upper triangular. Thus the two parts can be separated by triangular projection.

LU is less smooth than Cholesky because pivoting is discrete. If the pivot pattern changes under small perturbations, the derivative of the factorization as implemented can change discontinuously.

For this reason, AD systems usually treat pivot choices as fixed during a local backward pass. Gradients with respect to pivot decisions are not defined in the ordinary real-valued sense.

QR Factorization

QR factorization writes

A=QR, A = QR,

where QQ has orthonormal columns and RR is upper triangular.

For reduced QR:

ARm×n,mn, A \in \mathbb{R}^{m \times n}, \qquad m \ge n, QRm×n,RRn×n. Q \in \mathbb{R}^{m \times n}, \qquad R \in \mathbb{R}^{n \times n}.

The constraints are

QTQ=I,R upper triangular. Q^TQ = I, \qquad R \text{ upper triangular}.

Differentiate:

dA=dQR+QdR. dA = dQ\,R + Q\,dR.

The orthogonality constraint gives

dQTQ+QTdQ=0. dQ^T Q + Q^T dQ = 0.

Thus

QTdQ Q^T dQ

is skew-symmetric.

QR derivatives must preserve this skew-symmetric structure. They must also respect the convention used for RR, commonly positive diagonal entries. If a diagonal entry of RR crosses zero, the convention can introduce a discontinuity.

QR and Least Squares

QR often appears inside least-squares solvers. Suppose

minxAxb22. \min_x \|Ax-b\|_2^2.

One can solve it by QR, but differentiating the QR factors themselves may not be the cleanest method. It is often better to differentiate the optimality condition:

ATAx=ATb. A^T A x = A^T b.

or a numerically stable solve formulation.

This is a recurring AD design rule: if a factorization is only an implementation detail of a higher-level operation, differentiate the higher-level operation when possible.

Eigenvalue Decomposition

For a symmetric matrix,

A=QΛQT, A = Q\Lambda Q^T,

where QQ is orthogonal and Λ\Lambda is diagonal.

The differential of the eigenvalues is comparatively simple:

dλi=qiTdAqi. d\lambda_i = q_i^T dA q_i.

So for a scalar loss LL depending only on eigenvalues,

Aˉ=Qdiag(λˉ)QT. \bar{A} = Q\,\operatorname{diag}(\bar{\lambda})\,Q^T.

The eigenvectors are more delicate. For distinct eigenvalues,

dqi=jiqjqjTdAqiλiλj. dq_i = \sum_{j \ne i} q_j \frac{q_j^T dA q_i}{\lambda_i - \lambda_j}.

The denominator shows the problem. If two eigenvalues are equal or nearly equal, the derivative of individual eigenvectors becomes undefined or ill-conditioned.

The invariant object is the eigenspace, not the arbitrary basis chosen inside that eigenspace.

Spectral Degeneracy

Spectral factorizations have unavoidable degeneracies.

For eigenvectors:

qi q_i

and

qi -q_i

represent the same eigendirection. The sign choice is arbitrary.

For repeated eigenvalues, any orthonormal basis of the repeated eigenspace is valid. The individual eigenvectors are not unique.

For SVD, singular vectors have similar sign and subspace ambiguities.

This means that functions depending on individual eigenvectors or singular vectors may have unstable gradients even if the forward computation appears well behaved.

Stable code should depend on invariant quantities when possible, such as:

QQT QQ^T

for a subspace projection, or singular values rather than singular vectors.

Singular Value Decomposition

The singular value decomposition writes

A=UΣVT. A = U\Sigma V^T.

For

ARm×n, A \in \mathbb{R}^{m \times n},

the singular values are nonnegative, and the columns of UU and VV are singular vectors.

If a scalar loss depends only on singular values, the gradient has a simple form when singular values are distinct:

Aˉ=Udiag(σˉ)VT. \bar{A} = U\,\operatorname{diag}(\bar{\sigma})\,V^T.

When the loss depends on singular vectors, the backward rule contains terms involving differences of squared singular values:

1σi2σj2. \frac{1}{\sigma_i^2 - \sigma_j^2}.

Thus repeated or close singular values produce unstable gradients.

SVD is often mathematically correct but numerically fragile in AD. A system should document its behavior near repeated singular values.

Polar Decomposition

The polar decomposition writes

A=UP, A = UP,

where UU has orthonormal columns and PP is symmetric positive semidefinite.

It is useful in mechanics, geometry processing, and optimization on matrix manifolds.

Its differentiation has issues similar to SVD because it depends on spectral structure. If AA is rank deficient or nearly rank deficient, gradients can become unstable.

Factoring Batched Matrices

Modern tensor systems usually support batched factorizations.

For example:

A:(B,n,n) A : (B, n, n)

may represent BB independent matrices. Cholesky then returns

L:(B,n,n). L : (B, n, n).

The derivative rule applies independently to each batch element. The batch axes are structural, not algebraic matrix axes.

A correct AD implementation must distinguish:

Axis TypeMeaning
Batch axisIndependent repeated problem
Matrix row axisLinear algebra dimension
Matrix column axisLinear algebra dimension

Accidentally reducing over batch axes or treating them as matrix axes gives wrong gradients.

Pivoting and Discrete Choices

Many numerical factorizations use discrete choices:

FactorizationDiscrete Choice
LUPivot rows
QR with pivotingPivot columns
SVDSingular value ordering and sign conventions
EigendecompositionEigenvalue ordering and eigenvector signs

Ordinary AD handles real-valued smooth maps. Discrete choices do not have ordinary derivatives.

Most systems follow a local convention:

  1. Run the forward algorithm.
  2. Record the chosen pivots or ordering.
  3. Treat those choices as fixed during the backward pass.

This gives a useful local derivative away from discontinuities. It does not differentiate the decision rule that selected the pivot.

Custom Rules vs Tracing Algorithms

Tracing through a factorization algorithm can produce a derivative, but it often has poor properties.

ApproachAdvantageProblem
Trace scalar algorithmSimple for AD implementerLarge graph, poor stability, exposes implementation detail
Custom factorization ruleCompact and stableRequires mathematical derivation
Differentiate higher-level solveBest abstraction when factorization is internalNeeds suitable primitive API

For high-performance AD, factorizations should generally be primitive operations with custom derivative rules.

Implementation Contract

A factorization primitive should specify:

input domain
output convention
rank assumptions
pivot behavior
batch semantics
gradient validity conditions
behavior at degeneracy

For example, an SVD primitive should say whether it returns full or reduced matrices, how singular values are ordered, how signs are chosen, and what happens when singular values are repeated.

A Cholesky primitive should say whether the input must be symmetric, whether the upper or lower factor is returned, and whether the backward pass symmetrizes the gradient.

Practical Guidance

For users:

Use CasePrefer
Symmetric positive definite solveCholesky solve
General linear solveLU or QR solve
Least squaresQR or dedicated least-squares primitive
Log determinant of SPD matrixCholesky-based logdet
Spectral regularizationSingular values or eigenvalues only
Subspace objectivesProjectors, not arbitrary bases

For AD implementers:

PrimitiveMain Concern
CholeskySPD domain and triangular adjoint
LUPivot discontinuities
QROrthogonality and triangular constraints
EigendecompositionRepeated eigenvalues
SVDRepeated singular values and sign ambiguity
Polar decompositionRank deficiency

Summary

Differentiating factorizations means differentiating constrained decompositions. The right starting point is usually the defining equation, together with the constraints on the factors. Cholesky and triangular solves have stable, useful derivative rules on their natural domains. QR, eigendecomposition, and SVD require more care because orthogonality, ordering, signs, and degeneracies affect the derivative.

A robust AD system treats factorizations as mathematical primitives with explicit contracts. It records the structural choices made in the forward pass and uses custom backward rules that preserve triangular, orthogonal, symmetric, and spectral structure.