Skip to content

Singular Value Decomposition

The singular value decomposition SVD is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares,...

The singular value decomposition (SVD) is one of the most important matrix factorizations in numerical linear algebra. It appears in dimensionality reduction, least squares, low-rank approximation, spectral regularization, recommendation systems, scientific computing, and modern machine learning.

For automatic differentiation, SVD is both powerful and difficult. Singular values are comparatively stable. Singular vectors are not. The derivative formulas expose this instability directly.

Definition

For a matrix

ARm×n, A \in \mathbb{R}^{m \times n},

the singular value decomposition is

A=UΣVT, A = U\Sigma V^T,

where:

URm×m,UTU=I, U \in \mathbb{R}^{m \times m}, \qquad U^TU = I, VRn×n,VTV=I, V \in \mathbb{R}^{n \times n}, \qquad V^TV = I,

and

Σ=diag(σ1,,σr), \Sigma = \operatorname{diag}(\sigma_1,\ldots,\sigma_r),

with

σ1σ2σr0. \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_r \ge 0.

The singular values are the square roots of the eigenvalues of

ATA A^TA

or

AAT. AA^T.

The columns of UU are left singular vectors. The columns of VV are right singular vectors.

Most numerical systems use reduced SVD:

URm×r,ΣRr×r,VRn×r, U \in \mathbb{R}^{m \times r}, \qquad \Sigma \in \mathbb{R}^{r \times r}, \qquad V \in \mathbb{R}^{n \times r},

where

r=rank(A). r = \operatorname{rank}(A).

Why SVD Matters

SVD provides optimal low-rank structure.

The best rank-kk approximation of AA in Frobenius norm is

Ak=UkΣkVkT. A_k = U_k \Sigma_k V_k^T.

Applications include:

ApplicationRole
PCAPrincipal components
CompressionLow-rank approximation
Recommender systemsMatrix factorization
Scientific computingIll-conditioned systems
OptimizationSpectral regularization
VisionLatent representations
NLPEmbedding reduction
Control theoryBalanced truncation

Because SVD appears inside many differentiable systems, robust AD rules are essential.

Differential of Singular Values

The singular values are comparatively well behaved.

Suppose singular values are distinct. Let

ui u_i

and

vi v_i

be corresponding singular vectors:

Avi=σiui, Av_i = \sigma_i u_i, ATui=σivi. A^T u_i = \sigma_i v_i.

Differentiate:

dAvi+Advi=dσiui+σidui. dA\,v_i + A\,dv_i = d\sigma_i\,u_i + \sigma_i\,du_i.

Left-multiply by uiTu_i^T:

uiTdAvi+uiTAdvi=dσi+σiuiTdui. u_i^T dA\,v_i + u_i^T A\,dv_i = d\sigma_i + \sigma_i u_i^T du_i.

Using

uiTA=σiviT, u_i^T A = \sigma_i v_i^T,

we obtain

uiTAdvi=σiviTdvi. u_i^T A\,dv_i = \sigma_i v_i^T dv_i.

Because singular vectors remain normalized,

uiTdui=0,viTdvi=0. u_i^T du_i = 0, \qquad v_i^T dv_i = 0.

Therefore:

dσi=uiTdAvi. d\sigma_i = u_i^T dA\,v_i.

This gives the gradient:

Aσi=uiviT. \nabla_A \sigma_i = u_i v_i^T.

This rule is stable even when singular vectors themselves are unstable.

Reverse Rule for Singular Values

Suppose a scalar loss depends only on singular values:

L=f(σ1,,σr). L = f(\sigma_1,\ldots,\sigma_r).

Let

σˉi=Lσi. \bar{\sigma}_i = \frac{\partial L}{\partial \sigma_i}.

Then:

dL=iσˉidσi. dL = \sum_i \bar{\sigma}_i d\sigma_i.

Substitute:

dL=iσˉiuiTdAvi. dL = \sum_i \bar{\sigma}_i u_i^T dA\,v_i.

Rewriting:

dL=tr(Udiag(σˉ)VTdAT). dL = \operatorname{tr} \left( U\,\operatorname{diag}(\bar{\sigma})\,V^T dA^T \right).

Therefore:

Aˉ=Udiag(σˉ)VT. \bar{A} = U\,\operatorname{diag}(\bar{\sigma})\,V^T.

This is one of the most important spectral gradient formulas.

Singular Vector Derivatives

Singular vectors are much more delicate.

Differentiating them produces terms involving:

1σi2σj2. \frac{1}{\sigma_i^2 - \sigma_j^2}.

If singular values are close, the derivatives become large. If singular values are equal, the derivatives become undefined.

This is analogous to eigenvector differentiation.

The instability arises because singular subspaces are intrinsic, but individual basis vectors are not.

If

σi=σj, \sigma_i = \sigma_j,

then any orthonormal basis of the corresponding singular subspace is valid.

The decomposition is therefore non-unique.

Sign Ambiguity

If

(ui,vi) (u_i, v_i)

is a singular vector pair, then

(ui,vi) (-u_i, -v_i)

represents the same singular mode.

Thus singular vectors have unavoidable sign ambiguity.

Small numerical perturbations may flip signs unpredictably:

uiui,vivi. u_i \to -u_i, \qquad v_i \to -v_i.

The forward computation remains correct, but gradients involving singular vectors may change discontinuously.

Losses depending directly on singular vector coordinates are therefore fragile.

Full Reverse Rule

Suppose the loss depends on both singular values and singular vectors:

L(U,Σ,V). L(U,\Sigma,V).

Let:

Uˉ=LU,Σˉ=LΣ,Vˉ=LV. \bar{U} = \frac{\partial L}{\partial U}, \qquad \bar{\Sigma} = \frac{\partial L}{\partial \Sigma}, \qquad \bar{V} = \frac{\partial L}{\partial V}.

Define matrix FF:

Fij={1σi2σj2,ij,0,i=j. F_{ij} = \begin{cases} \frac{1}{\sigma_i^2 - \sigma_j^2}, & i \ne j, \\ 0, & i=j. \end{cases}

The backward rule contains terms:

F(UTUˉUˉTU), F \odot (U^T\bar{U} - \bar{U}^T U),

and similarly for VV.

The exact expression is complicated, but the important structure is simple:

  1. Singular-value gradients are stable.
  2. Singular-vector gradients contain inverse spectral gaps.
  3. Repeated singular values create undefined derivatives.

Geometric Interpretation

SVD identifies orthogonal subspaces.

The stable object is the subspace projector:

UUT,VVT. UU^T, \qquad VV^T.

The basis vectors themselves are arbitrary coordinates inside the subspace.

This distinction is critical.

Good objectives depend on:

Stable QuantityUnstable Quantity
Singular valuesSingular vector signs
Subspace projectorsIndividual basis vectors
Low-rank reconstructionsRaw singular coordinates
Spectral normsOrdered vector identities

When possible, design losses around invariant geometry.

Spectral Norm

The spectral norm is the largest singular value:

A2=σ1. \|A\|_2 = \sigma_1.

If the largest singular value is simple:

AA2=u1v1T. \nabla_A \|A\|_2 = u_1 v_1^T.

This appears in:

Use CasePurpose
Spectral normalizationStabilize neural networks
Lipschitz constraintsRobust optimization
Control theoryGain bounds
Numerical analysisCondition estimation

If the top singular value is repeated, the gradient becomes set-valued.

Nuclear Norm

The nuclear norm is

A=iσi. \|A\|_* = \sum_i \sigma_i.

Its gradient for full-rank matrices with distinct singular values is

AA=UVT. \nabla_A \|A\|_* = UV^T.

The nuclear norm is widely used for low-rank regularization.

At rank-deficient points, the derivative becomes non-unique and requires subgradient theory.

Low-Rank Approximation

Suppose

Ak=UkΣkVkT A_k = U_k \Sigma_k V_k^T

keeps only the top kk singular values.

This operation is discontinuous when:

σk=σk+1. \sigma_k = \sigma_{k+1}.

Near rank transitions, gradients become unstable.

Hard truncation creates nondifferentiable boundaries.

Soft spectral weighting is often preferable:

Σ~ii=f(σi), \tilde{\Sigma}_{ii} = f(\sigma_i),

for a smooth function ff.

Examples:

FunctionEffect
Soft thresholdShrink singular values
Exponential decaySmooth filtering
Logistic gatingContinuous truncation

SVD and PCA

Principal component analysis computes the leading singular vectors of centered data.

Suppose:

XRN×d. X \in \mathbb{R}^{N \times d}.

PCA often uses:

X=UΣVT. X = U\Sigma V^T.

Principal directions are columns of VV.

Differentiating through PCA inherits all singular-vector instability issues.

If principal components swap order under perturbation, gradients become discontinuous.

Using covariance projectors:

VVT VV^T

is more stable than depending on individual components.

SVD and Matrix Completion

Recommendation systems often optimize low-rank factors:

AUVT. A \approx UV^T.

Differentiating through an explicit SVD is usually avoided. Instead, systems optimize factors directly.

This avoids:

ProblemCause
Singular-vector instabilitySpectral degeneracy
Expensive backward passFull decomposition
Rank-transition discontinuitiesHard truncation

Direct factor parameterization is often more stable.

Differentiating Truncated SVD

Many applications compute only the top kk singular values and vectors.

This creates additional challenges:

IssueExplanation
Ordering discontinuitySingular values may swap
Rank instabilitySmall modes appear/disappear
Iterative solver truncationBackward graph incomplete
Approximation errorForward and backward mismatch

For iterative methods like Lanczos or power iteration, implicit differentiation may be preferable.

Iterative SVD Methods

Large systems rarely compute full dense SVD.

Instead they use:

MethodUse
Power iterationLargest singular value
LanczosPartial spectrum
Randomized SVDLarge sparse matrices
Block KrylovFast low-rank approximation

AD can:

  1. Differentiate through all iterations.
  2. Use implicit differentiation.

Differentiating through iterations creates long computation graphs and memory costs. Implicit methods are often more scalable.

Complex-Valued SVD

For complex matrices:

A=UΣVH. A = U\Sigma V^H.

The adjoint uses conjugate transpose:

VH. V^H.

Complex differentiation requires Wirtinger calculus or equivalent conventions.

Orthogonality becomes unitarity:

UHU=I. U^H U = I.

Complex spectral differentiation is even more subtle because phase ambiguity replaces sign ambiguity.

Numerical Stability

SVD backward passes are numerically sensitive.

Common stabilization techniques:

TechniquePurpose
Spectral gap regularizationAvoid division blowup
Truncated spectraRemove unstable modes
SymmetrizationReduce numerical drift
Soft spectral functionsAvoid discontinuous truncation
Projector lossesAvoid basis instability

No implementation can fully eliminate instability near repeated singular values because the mathematical derivative itself becomes singular.

Batch SVD

Modern tensor systems support batched SVD:

A:(B,m,n). A : (B,m,n).

Each batch element is factorized independently:

Ab=UbΣbVbT. A_b = U_b \Sigma_b V_b^T.

The backward pass must preserve:

batch axes
matrix axes
spectral ordering
orthogonality constraints

Shape errors in batched spectral code are common and difficult to debug.

Implementation Metadata

An SVD primitive should record:

full vs reduced decomposition
singular value ordering
sign conventions
batch dimensions
real vs complex arithmetic
rank truncation
iterative solver details

The backward rule depends on all of these choices.

Practical Guidance

Prefer objectives based on:

StableAvoid
Singular valuesRaw singular vectors
Spectral normsVector coordinates
Nuclear normsArbitrary basis orientation
Subspace projectorsSign-sensitive losses
Smooth spectral filtersHard truncation

When singular vectors are unavoidable:

StrategyBenefit
Add spectral gap regularizationImproves conditioning
Use projector formulationsRemoves basis ambiguity
Avoid repeated spectraReduces instability
Use implicit methodsBetter scalability

Summary

SVD is one of the most expressive and difficult primitives in automatic differentiation. Singular values have stable first-order perturbations. Singular vectors do not. Their derivatives contain inverse spectral gaps and become undefined at repeated singular values.

The stable geometric object is the singular subspace, not the arbitrary basis chosen to represent it. Robust differentiable systems should therefore formulate objectives in terms of invariant spectral quantities whenever possible.