# Differentiating Factorizations

## Differentiating Factorizations

Matrix factorizations rewrite a matrix into structured factors. They are used because the factors make later computations cheaper, more stable, or easier to interpret. In automatic differentiation, factorizations are delicate primitives because their outputs satisfy constraints.

A generic matrix has free entries. A factorization output usually does not. Orthogonal matrices, triangular matrices, diagonal singular-value matrices, and permutation matrices all live in constrained spaces. The derivative must respect those constraints.

## Factorization as an Implicit Equation

Most factorizations are best differentiated through the equation they satisfy.

For example, Cholesky factorization computes

$$
A = LL^T,
$$

where $A$ is symmetric positive definite and $L$ is lower triangular with positive diagonal.

The differential is

$$
dA = dL\,L^T + L\,dL^T.
$$

This equation relates $dA$ and $dL$. Because $dL$ is lower triangular, the equation has a unique solution under the Cholesky constraints.

The same pattern applies to other factorizations:

| Factorization | Defining Equation | Constraints |
|---|---|---|
| LU | $PA = LU$ | $L$ unit lower triangular, $U$ upper triangular |
| QR | $A = QR$ | $Q^TQ = I$, $R$ upper triangular |
| Cholesky | $A = LL^T$ | $L$ lower triangular, positive diagonal |
| Eigendecomposition | $A = Q\Lambda Q^T$ | $Q^TQ = I$, $\Lambda$ diagonal |
| SVD | $A = U\Sigma V^T$ | $U,V$ orthogonal, $\Sigma$ diagonal nonnegative |

The factorization algorithm may use pivots, Householder reflectors, blocked kernels, or iterative methods. The derivative rule should usually be derived from the defining equation, not from the low-level implementation.

## Cholesky Factorization

Cholesky is one of the cleanest factorizations for AD because it has a stable domain: symmetric positive definite matrices.

Let

$$
A = LL^T,
\qquad
L \text{ lower triangular}.
$$

The forward differential is

$$
dA = dL\,L^T + L\,dL^T.
$$

To solve for $dL$, multiply on the left by $L^{-1}$ and on the right by $L^{-T}$:

$$
L^{-1} dA L^{-T} =
L^{-1}dL + dL^T L^{-T}.
$$

Define

$$
S = L^{-1} dA L^{-T},
$$

$$
K = L^{-1}dL.
$$

Because $dL$ is lower triangular and $L^{-1}$ is lower triangular, $K$ is lower triangular.

Then

$$
S = K + K^T.
$$

So $K$ is recovered by taking the lower triangular part of $S$, with half of the diagonal:

$$
K_{ij} =
\begin{cases}
S_{ij}, & i > j, \\
\frac{1}{2}S_{ii}, & i=j, \\
0, & i<j.
\end{cases}
$$

Then

$$
dL = LK.
$$

This gives a forward-mode rule for differentiating Cholesky.

Reverse mode derives the adjoint rule by transposing this linear map. Production systems usually implement a specialized backward pass using triangular solves and symmetric projections.

## Cholesky Domain

Cholesky requires $A$ to be symmetric positive definite. At the boundary of this domain, the derivative may become unstable or undefined.

If the smallest eigenvalue of $A$ approaches zero, then $L^{-1}$ becomes large. The backward pass can amplify adjoints sharply.

In practice, users often stabilize Cholesky with diagonal jitter:

$$
A_\epsilon = A + \epsilon I.
$$

This changes the differentiated function. The gradient is then the gradient of the stabilized factorization, not the original one.

## LU Factorization

LU factorization writes

$$
PA = LU,
$$

where $P$ is a permutation matrix, $L$ is unit lower triangular, and $U$ is upper triangular.

Ignoring pivot changes, the differential satisfies

$$
P\,dA = dL\,U + L\,dU.
$$

Multiplying by $L^{-1}$ on the left and $U^{-1}$ on the right gives

$$
L^{-1}P\,dA\,U^{-1} =
L^{-1}dL + dU\,U^{-1}.
$$

The left term $L^{-1}dL$ is strictly lower triangular because $L$ has unit diagonal. The right term $dU\,U^{-1}$ is upper triangular. Thus the two parts can be separated by triangular projection.

LU is less smooth than Cholesky because pivoting is discrete. If the pivot pattern changes under small perturbations, the derivative of the factorization as implemented can change discontinuously.

For this reason, AD systems usually treat pivot choices as fixed during a local backward pass. Gradients with respect to pivot decisions are not defined in the ordinary real-valued sense.

## QR Factorization

QR factorization writes

$$
A = QR,
$$

where $Q$ has orthonormal columns and $R$ is upper triangular.

For reduced QR:

$$
A \in \mathbb{R}^{m \times n},
\qquad
m \ge n,
$$

$$
Q \in \mathbb{R}^{m \times n},
\qquad
R \in \mathbb{R}^{n \times n}.
$$

The constraints are

$$
Q^TQ = I,
\qquad
R \text{ upper triangular}.
$$

Differentiate:

$$
dA = dQ\,R + Q\,dR.
$$

The orthogonality constraint gives

$$
dQ^T Q + Q^T dQ = 0.
$$

Thus

$$
Q^T dQ
$$

is skew-symmetric.

QR derivatives must preserve this skew-symmetric structure. They must also respect the convention used for $R$, commonly positive diagonal entries. If a diagonal entry of $R$ crosses zero, the convention can introduce a discontinuity.

## QR and Least Squares

QR often appears inside least-squares solvers. Suppose

$$
\min_x \|Ax-b\|_2^2.
$$

One can solve it by QR, but differentiating the QR factors themselves may not be the cleanest method. It is often better to differentiate the optimality condition:

$$
A^T A x = A^T b.
$$

or a numerically stable solve formulation.

This is a recurring AD design rule: if a factorization is only an implementation detail of a higher-level operation, differentiate the higher-level operation when possible.

## Eigenvalue Decomposition

For a symmetric matrix,

$$
A = Q\Lambda Q^T,
$$

where $Q$ is orthogonal and $\Lambda$ is diagonal.

The differential of the eigenvalues is comparatively simple:

$$
d\lambda_i = q_i^T dA q_i.
$$

So for a scalar loss $L$ depending only on eigenvalues,

$$
\bar{A} =
Q\,\operatorname{diag}(\bar{\lambda})\,Q^T.
$$

The eigenvectors are more delicate. For distinct eigenvalues,

$$
dq_i =
\sum_{j \ne i}
q_j
\frac{q_j^T dA q_i}{\lambda_i - \lambda_j}.
$$

The denominator shows the problem. If two eigenvalues are equal or nearly equal, the derivative of individual eigenvectors becomes undefined or ill-conditioned.

The invariant object is the eigenspace, not the arbitrary basis chosen inside that eigenspace.

## Spectral Degeneracy

Spectral factorizations have unavoidable degeneracies.

For eigenvectors:

$$
q_i
$$

and

$$
-q_i
$$

represent the same eigendirection. The sign choice is arbitrary.

For repeated eigenvalues, any orthonormal basis of the repeated eigenspace is valid. The individual eigenvectors are not unique.

For SVD, singular vectors have similar sign and subspace ambiguities.

This means that functions depending on individual eigenvectors or singular vectors may have unstable gradients even if the forward computation appears well behaved.

Stable code should depend on invariant quantities when possible, such as:

$$
QQ^T
$$

for a subspace projection, or singular values rather than singular vectors.

## Singular Value Decomposition

The singular value decomposition writes

$$
A = U\Sigma V^T.
$$

For

$$
A \in \mathbb{R}^{m \times n},
$$

the singular values are nonnegative, and the columns of $U$ and $V$ are singular vectors.

If a scalar loss depends only on singular values, the gradient has a simple form when singular values are distinct:

$$
\bar{A} =
U\,\operatorname{diag}(\bar{\sigma})\,V^T.
$$

When the loss depends on singular vectors, the backward rule contains terms involving differences of squared singular values:

$$
\frac{1}{\sigma_i^2 - \sigma_j^2}.
$$

Thus repeated or close singular values produce unstable gradients.

SVD is often mathematically correct but numerically fragile in AD. A system should document its behavior near repeated singular values.

## Polar Decomposition

The polar decomposition writes

$$
A = UP,
$$

where $U$ has orthonormal columns and $P$ is symmetric positive semidefinite.

It is useful in mechanics, geometry processing, and optimization on matrix manifolds.

Its differentiation has issues similar to SVD because it depends on spectral structure. If $A$ is rank deficient or nearly rank deficient, gradients can become unstable.

## Factoring Batched Matrices

Modern tensor systems usually support batched factorizations.

For example:

$$
A : (B, n, n)
$$

may represent $B$ independent matrices. Cholesky then returns

$$
L : (B, n, n).
$$

The derivative rule applies independently to each batch element. The batch axes are structural, not algebraic matrix axes.

A correct AD implementation must distinguish:

| Axis Type | Meaning |
|---|---|
| Batch axis | Independent repeated problem |
| Matrix row axis | Linear algebra dimension |
| Matrix column axis | Linear algebra dimension |

Accidentally reducing over batch axes or treating them as matrix axes gives wrong gradients.

## Pivoting and Discrete Choices

Many numerical factorizations use discrete choices:

| Factorization | Discrete Choice |
|---|---|
| LU | Pivot rows |
| QR with pivoting | Pivot columns |
| SVD | Singular value ordering and sign conventions |
| Eigendecomposition | Eigenvalue ordering and eigenvector signs |

Ordinary AD handles real-valued smooth maps. Discrete choices do not have ordinary derivatives.

Most systems follow a local convention:

1. Run the forward algorithm.
2. Record the chosen pivots or ordering.
3. Treat those choices as fixed during the backward pass.

This gives a useful local derivative away from discontinuities. It does not differentiate the decision rule that selected the pivot.

## Custom Rules vs Tracing Algorithms

Tracing through a factorization algorithm can produce a derivative, but it often has poor properties.

| Approach | Advantage | Problem |
|---|---|---|
| Trace scalar algorithm | Simple for AD implementer | Large graph, poor stability, exposes implementation detail |
| Custom factorization rule | Compact and stable | Requires mathematical derivation |
| Differentiate higher-level solve | Best abstraction when factorization is internal | Needs suitable primitive API |

For high-performance AD, factorizations should generally be primitive operations with custom derivative rules.

## Implementation Contract

A factorization primitive should specify:

```text
input domain
output convention
rank assumptions
pivot behavior
batch semantics
gradient validity conditions
behavior at degeneracy
```

For example, an SVD primitive should say whether it returns full or reduced matrices, how singular values are ordered, how signs are chosen, and what happens when singular values are repeated.

A Cholesky primitive should say whether the input must be symmetric, whether the upper or lower factor is returned, and whether the backward pass symmetrizes the gradient.

## Practical Guidance

For users:

| Use Case | Prefer |
|---|---|
| Symmetric positive definite solve | Cholesky solve |
| General linear solve | LU or QR solve |
| Least squares | QR or dedicated least-squares primitive |
| Log determinant of SPD matrix | Cholesky-based logdet |
| Spectral regularization | Singular values or eigenvalues only |
| Subspace objectives | Projectors, not arbitrary bases |

For AD implementers:

| Primitive | Main Concern |
|---|---|
| Cholesky | SPD domain and triangular adjoint |
| LU | Pivot discontinuities |
| QR | Orthogonality and triangular constraints |
| Eigendecomposition | Repeated eigenvalues |
| SVD | Repeated singular values and sign ambiguity |
| Polar decomposition | Rank deficiency |

## Summary

Differentiating factorizations means differentiating constrained decompositions. The right starting point is usually the defining equation, together with the constraints on the factors. Cholesky and triangular solves have stable, useful derivative rules on their natural domains. QR, eigendecomposition, and SVD require more care because orthogonality, ordering, signs, and degeneracies affect the derivative.

A robust AD system treats factorizations as mathematical primitives with explicit contracts. It records the structural choices made in the forward pass and uses custom backward rules that preserve triangular, orthogonal, symmetric, and spectral structure.

