# Singular Value Decomposition

## Singular Value Decomposition

The singular value decomposition (SVD) is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares, low-rank approximation, spectral regularization, recommendation systems, scientific computing, and modern machine learning.

For automatic differentiation, SVD is both powerful and difficult. Singular values are comparatively stable. Singular vectors are not. The derivative formulas expose this instability directly.

## Definition

For a matrix

$$
A \in \mathbb{R}^{m \times n},
$$

the singular value decomposition is

$$
A = U\Sigma V^T,
$$

where:

$$
U \in \mathbb{R}^{m \times m},
\qquad
U^TU = I,
$$

$$
V \in \mathbb{R}^{n \times n},
\qquad
V^TV = I,
$$

and

$$
\Sigma =
\operatorname{diag}(\sigma_1,\ldots,\sigma_r),
$$

with

$$
\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0.
$$

The singular values are the square roots of the eigenvalues of

$$
A^TA
$$

or

$$
AA^T.
$$

The columns of $U$ are left singular vectors. The columns of $V$ are right singular vectors.

Most numerical systems use reduced SVD:

$$
U \in \mathbb{R}^{m \times r},
\qquad
\Sigma \in \mathbb{R}^{r \times r},
\qquad
V \in \mathbb{R}^{n \times r},
$$

where

$$
r = \operatorname{rank}(A).
$$

## Why SVD Matters

SVD provides optimal low-rank structure.

The best rank-$k$ approximation of $A$ in Frobenius norm is

$$
A_k =
U_k \Sigma_k V_k^T.
$$

Applications include:

| Application | Role |
|---|---|
| PCA | Principal components |
| Compression | Low-rank approximation |
| Recommender systems | Matrix factorization |
| Scientific computing | Ill-conditioned systems |
| Optimization | Spectral regularization |
| Vision | Latent representations |
| NLP | Embedding reduction |
| Control theory | Balanced truncation |

Because SVD appears inside many differentiable systems, robust AD rules are essential.

## Differential of Singular Values

The singular values are comparatively well behaved.

Suppose singular values are distinct. Let

$$
u_i
$$

and

$$
v_i
$$

be corresponding singular vectors:

$$
Av_i = \sigma_i u_i,
$$

$$
A^T u_i = \sigma_i v_i.
$$

Differentiate:

$$
dA\,v_i + A\,dv_i =
d\sigma_i\,u_i + \sigma_i\,du_i.
$$

Left-multiply by $u_i^T$:

$$
u_i^T dA\,v_i
+
u_i^T A\,dv_i =
d\sigma_i + \sigma_i u_i^T du_i.
$$

Using

$$
u_i^T A = \sigma_i v_i^T,
$$

we obtain

$$
u_i^T A\,dv_i =
\sigma_i v_i^T dv_i.
$$

Because singular vectors remain normalized,

$$
u_i^T du_i = 0,
\qquad
v_i^T dv_i = 0.
$$

Therefore:

$$
d\sigma_i =
u_i^T dA\,v_i.
$$

This gives the gradient:

$$
\nabla_A \sigma_i =
u_i v_i^T.
$$

This rule is stable even when singular vectors themselves are unstable.

## Reverse Rule for Singular Values

Suppose a scalar loss depends only on singular values:

$$
L = f(\sigma_1,\ldots,\sigma_r).
$$

Let

$$
\bar{\sigma}_i =
\frac{\partial L}{\partial \sigma_i}.
$$

Then:

$$
dL =
\sum_i
\bar{\sigma}_i d\sigma_i.
$$

Substitute:

$$
dL =
\sum_i
\bar{\sigma}_i
u_i^T dA\,v_i.
$$

Rewriting:

$$
dL =
\operatorname{tr}
\left(
U\,\operatorname{diag}(\bar{\sigma})\,V^T dA^T
\right).
$$

Therefore:

$$
\bar{A} =
U\,\operatorname{diag}(\bar{\sigma})\,V^T.
$$

This is one of the most important spectral gradient formulas.

## Singular Vector Derivatives

Singular vectors are much more delicate.

Differentiating them produces terms involving:

$$
\frac{1}{\sigma_i^2 - \sigma_j^2}.
$$

If singular values are close, the derivatives become large. If singular values are equal, the derivatives become undefined.

This is analogous to eigenvector differentiation.

The instability arises because singular subspaces are intrinsic, but individual basis vectors are not.

If

$$
\sigma_i = \sigma_j,
$$

then any orthonormal basis of the corresponding singular subspace is valid.

The decomposition is therefore non-unique.

## Sign Ambiguity

If

$$
(u_i, v_i)
$$

is a singular vector pair, then

$$
(-u_i, -v_i)
$$

represents the same singular mode.

Thus singular vectors have unavoidable sign ambiguity.

Small numerical perturbations may flip signs unpredictably:

$$
u_i \to -u_i,
\qquad
v_i \to -v_i.
$$

The forward computation remains correct, but gradients involving singular vectors may change discontinuously.

Losses depending directly on singular vector coordinates are therefore fragile.

## Full Reverse Rule

Suppose the loss depends on both singular values and singular vectors:

$$
L(U,\Sigma,V).
$$

Let:

$$
\bar{U} =
\frac{\partial L}{\partial U},
\qquad
\bar{\Sigma} =
\frac{\partial L}{\partial \Sigma},
\qquad
\bar{V} =
\frac{\partial L}{\partial V}.
$$

Define matrix $F$:

$$
F_{ij} =
\begin{cases}
\frac{1}{\sigma_i^2 - \sigma_j^2}, & i \ne j, \\
0, & i=j.
\end{cases}
$$

The backward rule contains terms:

$$
F \odot
(U^T\bar{U} - \bar{U}^T U),
$$

and similarly for $V$.

The exact expression is complicated, but the important structure is simple:

1. Singular-value gradients are stable.
2. Singular-vector gradients contain inverse spectral gaps.
3. Repeated singular values create undefined derivatives.

## Geometric Interpretation

SVD identifies orthogonal subspaces.

The stable object is the subspace projector:

$$
UU^T,
\qquad
VV^T.
$$

The basis vectors themselves are arbitrary coordinates inside the subspace.

This distinction is critical.

Good objectives depend on:

| Stable Quantity | Unstable Quantity |
|---|---|
| Singular values | Singular vector signs |
| Subspace projectors | Individual basis vectors |
| Low-rank reconstructions | Raw singular coordinates |
| Spectral norms | Ordered vector identities |

When possible, design losses around invariant geometry.

## Spectral Norm

The spectral norm is the largest singular value:

$$
\|A\|_2 = \sigma_1.
$$

If the largest singular value is simple:

$$
\nabla_A \|A\|_2 =
u_1 v_1^T.
$$

This appears in:

| Use Case | Purpose |
|---|---|
| Spectral normalization | Stabilize neural networks |
| Lipschitz constraints | Robust optimization |
| Control theory | Gain bounds |
| Numerical analysis | Condition estimation |

If the top singular value is repeated, the gradient becomes set-valued.

## Nuclear Norm

The nuclear norm is

$$
\|A\|_* =
\sum_i \sigma_i.
$$

Its gradient for full-rank matrices with distinct singular values is

$$
\nabla_A \|A\|_* =
UV^T.
$$

The nuclear norm is widely used for low-rank regularization.

At rank-deficient points, the derivative becomes non-unique and requires subgradient theory.

## Low-Rank Approximation

Suppose

$$
A_k = U_k \Sigma_k V_k^T
$$

keeps only the top $k$ singular values.

This operation is discontinuous when:

$$
\sigma_k = \sigma_{k+1}.
$$

Near rank transitions, gradients become unstable.

Hard truncation creates nondifferentiable boundaries.

Soft spectral weighting is often preferable:

$$
\tilde{\Sigma}_{ii} =
f(\sigma_i),
$$

for a smooth function $f$.

Examples:

| Function | Effect |
|---|---|
| Soft threshold | Shrink singular values |
| Exponential decay | Smooth filtering |
| Logistic gating | Continuous truncation |

## SVD and PCA

Principal component analysis computes the leading singular vectors of centered data.

Suppose:

$$
X \in \mathbb{R}^{N \times d}.
$$

PCA often uses:

$$
X = U\Sigma V^T.
$$

Principal directions are columns of $V$.

Differentiating through PCA inherits all singular-vector instability issues.

If principal components swap order under perturbation, gradients become discontinuous.

Using covariance projectors:

$$
VV^T
$$

is more stable than depending on individual components.

## SVD and Matrix Completion

Recommendation systems often optimize low-rank factors:

$$
A \approx UV^T.
$$

Differentiating through an explicit SVD is usually avoided. Instead, systems optimize factors directly.

This avoids:

| Problem | Cause |
|---|---|
| Singular-vector instability | Spectral degeneracy |
| Expensive backward pass | Full decomposition |
| Rank-transition discontinuities | Hard truncation |

Direct factor parameterization is often more stable.

## Differentiating Truncated SVD

Many applications compute only the top $k$ singular values and vectors.

This creates additional challenges:

| Issue | Explanation |
|---|---|
| Ordering discontinuity | Singular values may swap |
| Rank instability | Small modes appear/disappear |
| Iterative solver truncation | Backward graph incomplete |
| Approximation error | Forward and backward mismatch |

For iterative methods like Lanczos or power iteration, implicit differentiation may be preferable.

## Iterative SVD Methods

Large systems rarely compute full dense SVD.

Instead they use:

| Method | Use |
|---|---|
| Power iteration | Largest singular value |
| Lanczos | Partial spectrum |
| Randomized SVD | Large sparse matrices |
| Block Krylov | Fast low-rank approximation |

AD can:

1. Differentiate through all iterations.
2. Use implicit differentiation.

Differentiating through iterations creates long computation graphs and memory costs. Implicit methods are often more scalable.

## Complex-Valued SVD

For complex matrices:

$$
A = U\Sigma V^H.
$$

The adjoint uses conjugate transpose:

$$
V^H.
$$

Complex differentiation requires Wirtinger calculus or equivalent conventions.

Orthogonality becomes unitarity:

$$
U^H U = I.
$$

Complex spectral differentiation is even more subtle because phase ambiguity replaces sign ambiguity.

## Numerical Stability

SVD backward passes are numerically sensitive.

Common stabilization techniques:

| Technique | Purpose |
|---|---|
| Spectral gap regularization | Avoid division blowup |
| Truncated spectra | Remove unstable modes |
| Symmetrization | Reduce numerical drift |
| Soft spectral functions | Avoid discontinuous truncation |
| Projector losses | Avoid basis instability |

No implementation can fully eliminate instability near repeated singular values because the mathematical derivative itself becomes singular.

## Batch SVD

Modern tensor systems support batched SVD:

$$
A : (B,m,n).
$$

Each batch element is factorized independently:

$$
A_b = U_b \Sigma_b V_b^T.
$$

The backward pass must preserve:

```text
batch axes
matrix axes
spectral ordering
orthogonality constraints
```

Shape errors in batched spectral code are common and difficult to debug.

## Implementation Metadata

An SVD primitive should record:

```text
full vs reduced decomposition
singular value ordering
sign conventions
batch dimensions
real vs complex arithmetic
rank truncation
iterative solver details
```

The backward rule depends on all of these choices.

## Practical Guidance

Prefer objectives based on:

| Stable | Avoid |
|---|---|
| Singular values | Raw singular vectors |
| Spectral norms | Vector coordinates |
| Nuclear norms | Arbitrary basis orientation |
| Subspace projectors | Sign-sensitive losses |
| Smooth spectral filters | Hard truncation |

When singular vectors are unavoidable:

| Strategy | Benefit |
|---|---|
| Add spectral gap regularization | Improves conditioning |
| Use projector formulations | Removes basis ambiguity |
| Avoid repeated spectra | Reduces instability |
| Use implicit methods | Better scalability |

## Summary

SVD is one of the most expressive and difficult primitives in automatic differentiation. Singular values have stable first-order perturbations. Singular vectors do not. Their derivatives contain inverse spectral gaps and become undefined at repeated singular values.

The stable geometric object is the singular subspace, not the arbitrary basis chosen to represent it. Robust differentiable systems should therefore formulate objectives in terms of invariant spectral quantities whenever possible.