The singular value decomposition SVD is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares,...
The singular value decomposition (SVD) is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares, low-rank approximation, spectral regularization, recommendation systems, scientific computing, and modern machine learning.
For automatic differentiation, SVD is both powerful and difficult. Singular values are comparatively stable. Singular vectors are not. The derivative formulas expose this instability directly.
Definition
For a matrix
the singular value decomposition is
where:
and
with
The singular values are the square roots of the eigenvalues of
or
The columns of are left singular vectors. The columns of are right singular vectors.
Most numerical systems use reduced SVD:
where
Why SVD Matters
SVD provides optimal low-rank structure.
The best rank- approximation of in Frobenius norm is
Applications include:
| Application | Role |
|---|---|
| PCA | Principal components |
| Compression | Low-rank approximation |
| Recommender systems | Matrix factorization |
| Scientific computing | Ill-conditioned systems |
| Optimization | Spectral regularization |
| Vision | Latent representations |
| NLP | Embedding reduction |
| Control theory | Balanced truncation |
Because SVD appears inside many differentiable systems, robust AD rules are essential.
Differential of Singular Values
The singular values are comparatively well behaved.
Suppose singular values are distinct. Let
and
be corresponding singular vectors:
Differentiate:
Left-multiply by :
Using
we obtain
Because singular vectors remain normalized,
Therefore:
This gives the gradient:
This rule is stable even when singular vectors themselves are unstable.
Reverse Rule for Singular Values
Suppose a scalar loss depends only on singular values:
Let
Then:
Substitute:
Rewriting:
Therefore:
This is one of the most important spectral gradient formulas.
Singular Vector Derivatives
Singular vectors are much more delicate.
Differentiating them produces terms involving:
If singular values are close, the derivatives become large. If singular values are equal, the derivatives become undefined.
This is analogous to eigenvector differentiation.
The instability arises because singular subspaces are intrinsic, but individual basis vectors are not.
If
then any orthonormal basis of the corresponding singular subspace is valid.
The decomposition is therefore non-unique.
Sign Ambiguity
If
is a singular vector pair, then
represents the same singular mode.
Thus singular vectors have unavoidable sign ambiguity.
Small numerical perturbations may flip signs unpredictably:
The forward computation remains correct, but gradients involving singular vectors may change discontinuously.
Losses depending directly on singular vector coordinates are therefore fragile.
Full Reverse Rule
Suppose the loss depends on both singular values and singular vectors:
Let:
Define matrix :
The backward rule contains terms:
and similarly for .
The exact expression is complicated, but the important structure is simple:
- Singular-value gradients are stable.
- Singular-vector gradients contain inverse spectral gaps.
- Repeated singular values create undefined derivatives.
Geometric Interpretation
SVD identifies orthogonal subspaces.
The stable object is the subspace projector:
The basis vectors themselves are arbitrary coordinates inside the subspace.
This distinction is critical.
Good objectives depend on:
| Stable Quantity | Unstable Quantity |
|---|---|
| Singular values | Singular vector signs |
| Subspace projectors | Individual basis vectors |
| Low-rank reconstructions | Raw singular coordinates |
| Spectral norms | Ordered vector identities |
When possible, design losses around invariant geometry.
Spectral Norm
The spectral norm is the largest singular value:
If the largest singular value is simple:
This appears in:
| Use Case | Purpose |
|---|---|
| Spectral normalization | Stabilize neural networks |
| Lipschitz constraints | Robust optimization |
| Control theory | Gain bounds |
| Numerical analysis | Condition estimation |
If the top singular value is repeated, the gradient becomes set-valued.
Nuclear Norm
The nuclear norm is
Its gradient for full-rank matrices with distinct singular values is
The nuclear norm is widely used for low-rank regularization.
At rank-deficient points, the derivative becomes non-unique and requires subgradient theory.
Low-Rank Approximation
Suppose
keeps only the top singular values.
This operation is discontinuous when:
Near rank transitions, gradients become unstable.
Hard truncation creates nondifferentiable boundaries.
Soft spectral weighting is often preferable:
for a smooth function .
Examples:
| Function | Effect |
|---|---|
| Soft threshold | Shrink singular values |
| Exponential decay | Smooth filtering |
| Logistic gating | Continuous truncation |
SVD and PCA
Principal component analysis computes the leading singular vectors of centered data.
Suppose:
PCA often uses:
Principal directions are columns of .
Differentiating through PCA inherits all singular-vector instability issues.
If principal components swap order under perturbation, gradients become discontinuous.
Using covariance projectors:
is more stable than depending on individual components.
SVD and Matrix Completion
Recommendation systems often optimize low-rank factors:
Differentiating through an explicit SVD is usually avoided. Instead, systems optimize factors directly.
This avoids:
| Problem | Cause |
|---|---|
| Singular-vector instability | Spectral degeneracy |
| Expensive backward pass | Full decomposition |
| Rank-transition discontinuities | Hard truncation |
Direct factor parameterization is often more stable.
Differentiating Truncated SVD
Many applications compute only the top singular values and vectors.
This creates additional challenges:
| Issue | Explanation |
|---|---|
| Ordering discontinuity | Singular values may swap |
| Rank instability | Small modes appear/disappear |
| Iterative solver truncation | Backward graph incomplete |
| Approximation error | Forward and backward mismatch |
For iterative methods like Lanczos or power iteration, implicit differentiation may be preferable.
Iterative SVD Methods
Large systems rarely compute full dense SVD.
Instead they use:
| Method | Use |
|---|---|
| Power iteration | Largest singular value |
| Lanczos | Partial spectrum |
| Randomized SVD | Large sparse matrices |
| Block Krylov | Fast low-rank approximation |
AD can:
- Differentiate through all iterations.
- Use implicit differentiation.
Differentiating through iterations creates long computation graphs and memory costs. Implicit methods are often more scalable.
Complex-Valued SVD
For complex matrices:
The adjoint uses conjugate transpose:
Complex differentiation requires Wirtinger calculus or equivalent conventions.
Orthogonality becomes unitarity:
Complex spectral differentiation is even more subtle because phase ambiguity replaces sign ambiguity.
Numerical Stability
SVD backward passes are numerically sensitive.
Common stabilization techniques:
| Technique | Purpose |
|---|---|
| Spectral gap regularization | Avoid division blowup |
| Truncated spectra | Remove unstable modes |
| Symmetrization | Reduce numerical drift |
| Soft spectral functions | Avoid discontinuous truncation |
| Projector losses | Avoid basis instability |
No implementation can fully eliminate instability near repeated singular values because the mathematical derivative itself becomes singular.
Batch SVD
Modern tensor systems support batched SVD:
Each batch element is factorized independently:
The backward pass must preserve:
batch axes
matrix axes
spectral ordering
orthogonality constraintsShape errors in batched spectral code are common and difficult to debug.
Implementation Metadata
An SVD primitive should record:
full vs reduced decomposition
singular value ordering
sign conventions
batch dimensions
real vs complex arithmetic
rank truncation
iterative solver detailsThe backward rule depends on all of these choices.
Practical Guidance
Prefer objectives based on:
| Stable | Avoid |
|---|---|
| Singular values | Raw singular vectors |
| Spectral norms | Vector coordinates |
| Nuclear norms | Arbitrary basis orientation |
| Subspace projectors | Sign-sensitive losses |
| Smooth spectral filters | Hard truncation |
When singular vectors are unavoidable:
| Strategy | Benefit |
|---|---|
| Add spectral gap regularization | Improves conditioning |
| Use projector formulations | Removes basis ambiguity |
| Avoid repeated spectra | Reduces instability |
| Use implicit methods | Better scalability |
Summary
SVD is one of the most expressive and difficult primitives in automatic differentiation. Singular values have stable first-order perturbations. Singular vectors do not. Their derivatives contain inverse spectral gaps and become undefined at repeated singular values.
The stable geometric object is the singular subspace, not the arbitrary basis chosen to represent it. Robust differentiable systems should therefore formulate objectives in terms of invariant spectral quantities whenever possible.