Differentiating Factorizations

Matrix factorizations rewrite a matrix into structured factors. They are used because the factors make later computations cheaper, more stable, or easier to interpret. In automatic differentiation, factorizations are delicate primitives because their outputs satisfy constraints.

A generic matrix has free entries. A factorization output usually does not. Orthogonal matrices, triangular matrices, diagonal singular-value matrices, and permutation matrices all live in constrained spaces. The derivative must respect those constraints.

Factorization as an Implicit Equation

Most factorizations are best differentiated through the equation they satisfy.

For example, Cholesky factorization computes

A = LL^T,

where $A$ is symmetric positive definite and $L$ is lower triangular with positive diagonal.

The differential is

dA = dL\,L^T + L\,dL^T.

This equation relates $dA$ and $dL$ . Because $dL$ is lower triangular, the equation has a unique solution under the Cholesky constraints.

The same pattern applies to other factorizations:

Factorization	Defining Equation	Constraints
LU	$PA = LU$	$L$ unit lower triangular, $U$ upper triangular
QR	$A = QR$	$Q^TQ = I$ , $R$ upper triangular
Cholesky	$A = LL^T$	$L$ lower triangular, positive diagonal
Eigendecomposition	$A = Q\Lambda Q^T$	$Q^TQ = I$ , $\Lambda$ diagonal
SVD	$A = U\Sigma V^T$	$U,V$ orthogonal, $\Sigma$ diagonal nonnegative

The factorization algorithm may use pivots, Householder reflectors, blocked kernels, or iterative methods. The derivative rule should usually be derived from the defining equation, not from the low-level implementation.

Cholesky Factorization

Cholesky is one of the cleanest factorizations for AD because it has a stable domain: symmetric positive definite matrices.

Let

A = LL^T, \qquad L \text{ lower triangular}.

The forward differential is

dA = dL\,L^T + L\,dL^T.

To solve for $dL$ , multiply on the left by $L^{-1}$ and on the right by $L^{-T}$ :

L^{-1} dA L^{-T} = L^{-1}dL + dL^T L^{-T}.

Define

S = L^{-1} dA L^{-T},

K = L^{-1}dL.

Because $dL$ is lower triangular and $L^{-1}$ is lower triangular, $K$ is lower triangular.

Then

S = K + K^T.

So $K$ is recovered by taking the lower triangular part of $S$ , with half of the diagonal:

K_{ij} = \begin{cases} S_{ij}, & i > j, \\ \frac{1}{2}S_{ii}, & i=j, \\ 0, & i<j. \end{cases}

Then

dL = LK.

This gives a forward-mode rule for differentiating Cholesky.

Reverse mode derives the adjoint rule by transposing this linear map. Production systems usually implement a specialized backward pass using triangular solves and symmetric projections.

Cholesky Domain

Cholesky requires $A$ to be symmetric positive definite. At the boundary of this domain, the derivative may become unstable or undefined.

If the smallest eigenvalue of $A$ approaches zero, then $L^{-1}$ becomes large. The backward pass can amplify adjoints sharply.

In practice, users often stabilize Cholesky with diagonal jitter:

A_\epsilon = A + \epsilon I.

This changes the differentiated function. The gradient is then the gradient of the stabilized factorization, not the original one.

LU Factorization

LU factorization writes

PA = LU,

where $P$ is a permutation matrix, $L$ is unit lower triangular, and $U$ is upper triangular.

Ignoring pivot changes, the differential satisfies

P\,dA = dL\,U + L\,dU.

Multiplying by $L^{-1}$ on the left and $U^{-1}$ on the right gives

L^{-1}P\,dA\,U^{-1} = L^{-1}dL + dU\,U^{-1}.

The left term $L^{-1}dL$ is strictly lower triangular because $L$ has unit diagonal. The right term $dU\,U^{-1}$ is upper triangular. Thus the two parts can be separated by triangular projection.

LU is less smooth than Cholesky because pivoting is discrete. If the pivot pattern changes under small perturbations, the derivative of the factorization as implemented can change discontinuously.

For this reason, AD systems usually treat pivot choices as fixed during a local backward pass. Gradients with respect to pivot decisions are not defined in the ordinary real-valued sense.

QR Factorization

QR factorization writes

A = QR,

where $Q$ has orthonormal columns and $R$ is upper triangular.

For reduced QR:

A \in \mathbb{R}^{m \times n}, \qquad m \ge n,

Q \in \mathbb{R}^{m \times n}, \qquad R \in \mathbb{R}^{n \times n}.

The constraints are

Q^TQ = I, \qquad R \text{ upper triangular}.

Differentiate:

dA = dQ\,R + Q\,dR.

The orthogonality constraint gives

dQ^T Q + Q^T dQ = 0.

Thus

Q^T dQ

is skew-symmetric.

QR derivatives must preserve this skew-symmetric structure. They must also respect the convention used for $R$ , commonly positive diagonal entries. If a diagonal entry of $R$ crosses zero, the convention can introduce a discontinuity.

QR and Least Squares

QR often appears inside least-squares solvers. Suppose

\min_x \|Ax-b\|_2^2.

One can solve it by QR, but differentiating the QR factors themselves may not be the cleanest method. It is often better to differentiate the optimality condition:

A^T A x = A^T b.

or a numerically stable solve formulation.

This is a recurring AD design rule: if a factorization is only an implementation detail of a higher-level operation, differentiate the higher-level operation when possible.

Eigenvalue Decomposition

For a symmetric matrix,

A = Q\Lambda Q^T,

where $Q$ is orthogonal and $\Lambda$ is diagonal.

The differential of the eigenvalues is comparatively simple:

d\lambda_i = q_i^T dA q_i.

So for a scalar loss $L$ depending only on eigenvalues,

\bar{A} = Q\,\operatorname{diag}(\bar{\lambda})\,Q^T.

The eigenvectors are more delicate. For distinct eigenvalues,

dq_i = \sum_{j \ne i} q_j \frac{q_j^T dA q_i}{\lambda_i - \lambda_j}.

The denominator shows the problem. If two eigenvalues are equal or nearly equal, the derivative of individual eigenvectors becomes undefined or ill-conditioned.

The invariant object is the eigenspace, not the arbitrary basis chosen inside that eigenspace.

Spectral Degeneracy

Spectral factorizations have unavoidable degeneracies.

For eigenvectors:

q_i

and

-q_i

represent the same eigendirection. The sign choice is arbitrary.

For repeated eigenvalues, any orthonormal basis of the repeated eigenspace is valid. The individual eigenvectors are not unique.

For SVD, singular vectors have similar sign and subspace ambiguities.

This means that functions depending on individual eigenvectors or singular vectors may have unstable gradients even if the forward computation appears well behaved.

Stable code should depend on invariant quantities when possible, such as:

QQ^T

for a subspace projection, or singular values rather than singular vectors.

Singular Value Decomposition

The singular value decomposition writes

A = U\Sigma V^T.

For

A \in \mathbb{R}^{m \times n},

the singular values are nonnegative, and the columns of $U$ and $V$ are singular vectors.

If a scalar loss depends only on singular values, the gradient has a simple form when singular values are distinct:

\bar{A} = U\,\operatorname{diag}(\bar{\sigma})\,V^T.

When the loss depends on singular vectors, the backward rule contains terms involving differences of squared singular values:

\frac{1}{\sigma_i^2 - \sigma_j^2}.

Thus repeated or close singular values produce unstable gradients.

SVD is often mathematically correct but numerically fragile in AD. A system should document its behavior near repeated singular values.

Polar Decomposition

The polar decomposition writes

A = UP,

where $U$ has orthonormal columns and $P$ is symmetric positive semidefinite.

It is useful in mechanics, geometry processing, and optimization on matrix manifolds.

Its differentiation has issues similar to SVD because it depends on spectral structure. If $A$ is rank deficient or nearly rank deficient, gradients can become unstable.

Factoring Batched Matrices

Modern tensor systems usually support batched factorizations.

For example:

A : (B, n, n)

may represent $B$ independent matrices. Cholesky then returns

L : (B, n, n).

The derivative rule applies independently to each batch element. The batch axes are structural, not algebraic matrix axes.

A correct AD implementation must distinguish:

Axis Type	Meaning
Batch axis	Independent repeated problem
Matrix row axis	Linear algebra dimension
Matrix column axis	Linear algebra dimension

Accidentally reducing over batch axes or treating them as matrix axes gives wrong gradients.

Pivoting and Discrete Choices

Many numerical factorizations use discrete choices:

Factorization	Discrete Choice
LU	Pivot rows
QR with pivoting	Pivot columns
SVD	Singular value ordering and sign conventions
Eigendecomposition	Eigenvalue ordering and eigenvector signs

Ordinary AD handles real-valued smooth maps. Discrete choices do not have ordinary derivatives.

Most systems follow a local convention:

Run the forward algorithm.
Record the chosen pivots or ordering.
Treat those choices as fixed during the backward pass.

This gives a useful local derivative away from discontinuities. It does not differentiate the decision rule that selected the pivot.

Custom Rules vs Tracing Algorithms

Tracing through a factorization algorithm can produce a derivative, but it often has poor properties.

Approach	Advantage	Problem
Trace scalar algorithm	Simple for AD implementer	Large graph, poor stability, exposes implementation detail
Custom factorization rule	Compact and stable	Requires mathematical derivation
Differentiate higher-level solve	Best abstraction when factorization is internal	Needs suitable primitive API

For high-performance AD, factorizations should generally be primitive operations with custom derivative rules.

Implementation Contract

A factorization primitive should specify:

input domain
output convention
rank assumptions
pivot behavior
batch semantics
gradient validity conditions
behavior at degeneracy

For example, an SVD primitive should say whether it returns full or reduced matrices, how singular values are ordered, how signs are chosen, and what happens when singular values are repeated.

A Cholesky primitive should say whether the input must be symmetric, whether the upper or lower factor is returned, and whether the backward pass symmetrizes the gradient.

Practical Guidance

For users:

Use Case	Prefer
Symmetric positive definite solve	Cholesky solve
General linear solve	LU or QR solve
Least squares	QR or dedicated least-squares primitive
Log determinant of SPD matrix	Cholesky-based logdet
Spectral regularization	Singular values or eigenvalues only
Subspace objectives	Projectors, not arbitrary bases

For AD implementers:

Primitive	Main Concern
Cholesky	SPD domain and triangular adjoint
LU	Pivot discontinuities
QR	Orthogonality and triangular constraints
Eigendecomposition	Repeated eigenvalues
SVD	Repeated singular values and sign ambiguity
Polar decomposition	Rank deficiency

Summary

Differentiating factorizations means differentiating constrained decompositions. The right starting point is usually the defining equation, together with the constraints on the factors. Cholesky and triangular solves have stable, useful derivative rules on their natural domains. QR, eigendecomposition, and SVD require more care because orthogonality, ordering, signs, and degeneracies affect the derivative.

A robust AD system treats factorizations as mathematical primitives with explicit contracts. It records the structural choices made in the forward pass and uses custom backward rules that preserve triangular, orthogonal, symmetric, and spectral structure.