Skip to content

Chapter 121. Principal Component Analysis

Principal component analysis, or PCA, is a method for finding the main directions of variation in a dataset.

It converts correlated variables into new orthogonal variables called principal components. The first principal component points in the direction of greatest variance. The second principal component points in the direction of greatest remaining variance subject to being orthogonal to the first. The process continues until all directions have been found or until a chosen lower-dimensional representation is obtained.

Linear algebra is the whole structure of PCA. The method uses centering, covariance matrices, eigenvalues, eigenvectors, orthogonal projection, and singular value decomposition. In practice, PCA is often computed either from the covariance matrix or directly from the singular value decomposition of the centered data matrix.

121.1 Data Matrix

Let a dataset contain mm observations and nn features.

Write the data matrix as

$$ X = \begin{bmatrix}

  • & x_1^T & - \
  • & x_2^T & - \ & \vdots & \
  • & x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times n}. $$

Each row xiTx_i^T is one observation. Each column is one feature.

For example, if each observation records height, weight, and age, then n=3n=3. If there are m=1000m=1000 people, then XX is a 1000×31000\times 3 matrix.

PCA studies the geometry of the rows of XX as points in Rn\mathbb{R}^n.

121.2 Centering the Data

PCA is usually applied to centered data.

The mean vector is

μ=1mi=1mxi. \mu = \frac{1}{m} \sum_{i=1}^m x_i.

The centered observations are

x~i=xiμ. \tilde{x}_i = x_i-\mu.

The centered data matrix is

$$ \tilde{X} = \begin{bmatrix}

  • & \tilde{x}_1^T & - \
  • & \tilde{x}_2^T & - \ & \vdots & \
  • & \tilde{x}_m^T & - \end{bmatrix}. $$

Centering moves the cloud of data points so that its mean is at the origin.

This matters because PCA measures variation around the mean. Without centering, the first component may point toward the mean offset rather than toward the main direction of variation.

121.3 Variance in a Direction

Let wRnw\in\mathbb{R}^n be a unit vector:

w=1. \|w\|=1.

The projection of a centered observation x~i\tilde{x}_i onto ww is

x~iTw. \tilde{x}_i^T w.

The variance of the data along direction ww is proportional to

i=1m(x~iTw)2. \sum_{i=1}^m (\tilde{x}_i^T w)^2.

In matrix form,

i=1m(x~iTw)2=X~w2=wTX~TX~w. \sum_{i=1}^m (\tilde{x}_i^T w)^2 = \|\tilde{X}w\|^2 = w^T\tilde{X}^T\tilde{X}w.

Thus the direction of maximum variance solves

maxw=1wTX~TX~w. \max_{\|w\|=1} w^T\tilde{X}^T\tilde{X}w.

This is a Rayleigh quotient problem.

121.4 Covariance Matrix

The sample covariance matrix is

S=1m1X~TX~. S = \frac{1}{m-1}\tilde{X}^T\tilde{X}.

The entry SijS_{ij} measures how feature ii and feature jj vary together.

The matrix SS is symmetric and positive semidefinite.

For any vector ww,

wTSw0. w^TSw \geq 0.

The variance of the data along unit direction ww is

wTSw. w^TSw.

Therefore PCA is the problem of finding orthonormal directions that diagonalize the covariance matrix. The eigenvector associated with the largest eigenvalue of the covariance or correlation matrix gives the first principal component, and later components are ordered by decreasing eigenvalue.

121.5 Principal Components

A principal component is a unit direction ww in feature space.

The first principal component is

w1=argmaxw=1wTSw. w_1 = \arg\max_{\|w\|=1} w^TSw.

The solution is an eigenvector of SS with the largest eigenvalue.

If

Sw1=λ1w1, Sw_1=\lambda_1w_1,

where

λ1λ2λn, \lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n,

then w1w_1 is the first principal direction.

The second principal component is the unit vector orthogonal to w1w_1 that maximizes variance. It is an eigenvector corresponding to λ2\lambda_2.

Continuing in this way gives an orthonormal eigenbasis

w1,w2,,wn. w_1,w_2,\ldots,w_n.

121.6 Eigenvalue Decomposition

Since SS is symmetric, the spectral theorem gives

S=WΛWT. S = W\Lambda W^T.

Here

W=[w1w2wn] W = \begin{bmatrix} | & | & & | \\ w_1 & w_2 & \cdots & w_n \\ | & | & & | \end{bmatrix}

is an orthogonal matrix whose columns are eigenvectors, and

Λ=[λ1000λ2000λn] \Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix}

contains the eigenvalues.

The eigenvalues are nonnegative because SS is positive semidefinite.

The eigenvectors define the new coordinate axes. The eigenvalues measure variance along those axes.

121.7 Scores

The principal component scores are the coordinates of the centered data in the principal component basis.

For one observation,

zi=WTx~i. z_i = W^T\tilde{x}_i.

For all observations,

Z=X~W. Z=\tilde{X}W.

The first column of ZZ contains the coordinates along the first principal component. The second column contains the coordinates along the second principal component, and so on.

The covariance matrix of the transformed data is diagonal:

1m1ZTZ=Λ. \frac{1}{m-1}Z^TZ = \Lambda.

Thus PCA finds a coordinate system in which the features are uncorrelated.

121.8 Dimensionality Reduction

PCA is often used to reduce dimension.

Choose r<nr<n principal directions and form

Wr=[w1w2wr]. W_r = \begin{bmatrix} | & | & & | \\ w_1 & w_2 & \cdots & w_r \\ | & | & & | \end{bmatrix}.

The reduced representation of an observation is

zi=WrTx~iRr. z_i = W_r^T\tilde{x}_i \in \mathbb{R}^r.

For all observations,

Zr=X~Wr. Z_r=\tilde{X}W_r.

This replaces nn original features by rr principal component coordinates.

The new coordinates preserve as much variance as possible among all rr-dimensional orthogonal projections.

121.9 Reconstruction

The reduced coordinates can be mapped back to the original feature space.

For one observation,

x^i=μ+Wrzi. \hat{x}_i = \mu + W_r z_i.

Since

zi=WrT(xiμ), z_i=W_r^T(x_i-\mu),

the reconstruction is

x^i=μ+WrWrT(xiμ). \hat{x}_i = \mu + W_rW_r^T(x_i-\mu).

The matrix

Pr=WrWrT P_r=W_rW_r^T

is the orthogonal projection onto the subspace spanned by the first rr principal components.

Thus PCA approximation is projection onto a data-adapted subspace.

121.10 Reconstruction Error

The reconstruction error for one centered observation is

x~iWrWrTx~i2. \|\tilde{x}_i-W_rW_r^T\tilde{x}_i\|^2.

For the whole dataset, the total squared reconstruction error is

X~X~WrWrTF2. \|\tilde{X}-\tilde{X}W_rW_r^T\|_F^2.

PCA chooses the rr-dimensional subspace that minimizes this error among all rr-dimensional linear subspaces.

The minimum error equals the sum of the discarded eigenvalues, up to the same scaling used in the covariance matrix:

j=r+1nλj. \sum_{j=r+1}^n \lambda_j.

Thus small discarded eigenvalues mean little information is lost.

121.11 Explained Variance

The eigenvalue λj\lambda_j is the variance explained by principal component jj.

The total variance is

λ1+λ2++λn. \lambda_1+\lambda_2+\cdots+\lambda_n.

The explained variance ratio for component jj is

λjλ1+λ2++λn. \frac{\lambda_j}{\lambda_1+\lambda_2+\cdots+\lambda_n}.

The cumulative explained variance for the first rr components is

λ1++λrλ1++λn. \frac{\lambda_1+\cdots+\lambda_r} {\lambda_1+\cdots+\lambda_n}.

This ratio helps choose the number of components.

For example, one may choose the smallest rr such that the cumulative explained variance is at least 0.950.95.

121.12 Singular Value Decomposition

PCA can also be computed from the singular value decomposition of the centered data matrix:

X~=UΣVT. \tilde{X}=U\Sigma V^T.

Here the columns of VV are the right singular vectors.

Since

X~TX~=VΣTΣVT, \tilde{X}^T\tilde{X} = V\Sigma^T\Sigma V^T,

the columns of VV are eigenvectors of X~TX~\tilde{X}^T\tilde{X}, hence principal directions.

The eigenvalues of the covariance matrix satisfy

λj=σj2m1, \lambda_j=\frac{\sigma_j^2}{m-1},

where σj\sigma_j is the jj-th singular value.

This gives a direct connection between PCA and SVD. Many implementations compute PCA through SVD, especially for numerical stability and efficiency.

121.13 Low-Rank Approximation

The truncated SVD gives

X~r=UrΣrVrT. \tilde{X}_r=U_r\Sigma_rV_r^T.

This is the best rank-rr approximation to X~\tilde{X} in Frobenius norm.

PCA and truncated SVD are therefore two views of the same operation:

PCA languageSVD language
Principal directionsRight singular vectors
ScoresUrΣrU_r\Sigma_r
Explained varianceSquared singular values
Projection subspaceSpan of VrV_r
Low-dimensional dataX~Vr\tilde{X}V_r
ReconstructionUrΣrVrTU_r\Sigma_rV_r^T

This is one of the cleanest links between statistics and matrix factorization.

121.14 Correlation PCA

Sometimes features have very different units or scales.

For example, one feature may be measured in dollars and another in millimeters. A feature with large numerical scale may dominate the covariance matrix.

To avoid this, one may standardize each feature:

xijstd=xijμjsj, x_{ij}^{\text{std}} = \frac{x_{ij}-\mu_j}{s_j},

where μj\mu_j is the feature mean and sjs_j is the feature standard deviation.

PCA on standardized data is equivalent to PCA using the correlation matrix.

This is useful when relative variation matters more than raw units. OneDAL notes distinguish covariance-based and correlation-based PCA, with the choice depending on whether feature scaling is important.

121.15 Whitening

PCA can be used for whitening.

Let

S=WΛWT. S=W\Lambda W^T.

The PCA coordinates are

z=WTx~. z=W^T\tilde{x}.

Their covariance is

Λ. \Lambda.

To make the covariance identity, define

u=Λ1/2WTx~. u=\Lambda^{-1/2}W^T\tilde{x}.

Then

Cov(u)=I, \operatorname{Cov}(u)=I,

assuming positive eigenvalues.

Whitening removes correlations and rescales each principal direction to unit variance.

It is useful in preprocessing, signal processing, independent component analysis, and some machine learning workflows.

121.16 Biplots and Scree Plots

PCA is often visualized with plots.

A scree plot shows eigenvalues or explained variance ratios in decreasing order.

A sharp drop suggests that the first few components capture most of the structure.

A biplot shows observations and feature loadings together in a low-dimensional principal component plane.

These plots help interpret which variables contribute to each component and how observations are arranged in the reduced space.

Scree plots and biplots are standard PCA interpretation tools.

121.17 Loadings

The entries of a principal direction are called loadings.

If

wj=[w1jw2jwnj], w_j = \begin{bmatrix} w_{1j} \\ w_{2j} \\ \vdots \\ w_{nj} \end{bmatrix},

then wkjw_{kj} is the loading of feature kk on component jj.

Large positive or negative loadings indicate that a feature contributes strongly to the component.

Loadings help interpret components.

For example, in a dataset of body measurements, one component might have positive loadings on height, weight, and limb length. This component may represent overall body size.

Another component may contrast height against weight and represent shape.

121.18 Sign Ambiguity

Principal components have arbitrary sign.

If ww is an eigenvector, then

w -w

is also an eigenvector with the same eigenvalue.

Therefore PCA results may differ by signs across software packages or runs.

This does not change the subspace, explained variance, reconstruction error, or geometry.

If a principal component score changes sign, the corresponding loading vector also changes sign. The interpretation should account for this convention.

121.19 Rank and Degeneracy

If the centered data matrix has rank rr, then only rr eigenvalues of the covariance matrix are positive.

Since centering subtracts the mean, the rank satisfies

rank(X~)min(m1,n). \operatorname{rank}(\tilde{X}) \leq \min(m-1,n).

Thus if nn is large and mm is small, many covariance eigenvalues are zero.

When eigenvalues are repeated, the individual eigenvectors inside the repeated eigenspace are not unique. Only the subspace is determined.

This matters when interpreting components with nearly equal eigenvalues.

121.20 PCA as Rotation

PCA can be viewed as rotating the coordinate axes.

The centered data are transformed by

Z=X~W. Z=\tilde{X}W.

Since WW is orthogonal, this transformation preserves distances and angles.

It does not distort the data. It merely changes the coordinate system.

After rotation, the covariance matrix becomes diagonal.

Thus PCA first rotates the data into uncorrelated coordinates, then optionally drops low-variance coordinates.

121.21 PCA as Projection

PCA can also be viewed as projection.

The reduced reconstruction is

x^=μ+WrWrT(xμ). \hat{x}=\mu+W_rW_r^T(x-\mu).

Here WrWrTW_rW_r^T projects onto the principal subspace.

This viewpoint emphasizes approximation.

The original point is replaced by its closest point in the selected principal subspace.

The error vector is orthogonal to that subspace.

This is the same geometry as least squares.

121.22 PCA and Noise

PCA is often used for denoising.

Suppose the signal lies mostly in a low-dimensional subspace, while noise spreads across many directions.

Then the leading principal components capture the signal, and the trailing components capture noise.

Keeping only the first rr components gives

X^=X~WrWrT. \hat{X}=\tilde{X}W_rW_r^T.

This may suppress noise.

However, PCA does not know which variance is meaningful. If noise has large variance, PCA may preserve noise. If signal has small variance, PCA may discard signal.

Thus PCA is a variance-based method, not a meaning-based method.

121.23 PCA and Compression

PCA compresses data by storing reduced coordinates and a basis.

Instead of storing each centered observation in Rn\mathbb{R}^n, store

ziRr z_i\in\mathbb{R}^r

and the matrix

WrRn×r. W_r\in\mathbb{R}^{n\times r}.

Approximate reconstruction uses

x^i=μ+Wrzi. \hat{x}_i=\mu+W_rz_i.

This can reduce storage when rnr\ll n.

For images, PCA may represent many similar images using a small number of basis images, sometimes called eigenfaces in face analysis.

121.24 Limitations

PCA is powerful but limited.

LimitationExplanation
Linear onlyPCA finds linear subspaces
Variance-basedHigh variance may not mean useful signal
Sensitive to scalingFeature units affect covariance PCA
Sensitive to outliersExtreme points can dominate directions
Components may be hard to interpretOrthogonal directions may mix features
Sign is arbitraryComponent signs have no intrinsic meaning
Repeated eigenvalues reduce uniquenessOnly eigenspaces are stable

These limitations are not defects in the mathematics. They describe the assumptions PCA makes.

PCA is appropriate when linear structure and variance capture the main information.

121.25 Kernel PCA

Kernel PCA extends PCA to nonlinear feature spaces.

A kernel function has the form

K(x,z)=ϕ(x),ϕ(z), K(x,z)=\langle \phi(x),\phi(z)\rangle,

where ϕ\phi maps data into a possibly high-dimensional feature space.

Kernel PCA performs PCA using the Gram matrix

Gij=K(xi,xj) G_{ij}=K(x_i,x_j)

rather than explicitly forming ϕ(xi)\phi(x_i).

This can reveal nonlinear structure in the original space.

The method still relies on eigenvalue decomposition, but the matrix being diagonalized is a kernel matrix rather than an ordinary covariance matrix.

121.26 PCA and Linear Algebra

The dictionary is direct.

PCA conceptLinear algebra object
DatasetMatrix
Feature meanVector
CenteringTranslation to origin
CovarianceSymmetric positive semidefinite matrix
Principal directionEigenvector
Explained varianceEigenvalue
ScoresCoordinates in eigenbasis
Dimension reductionProjection onto subspace
ReconstructionOrthogonal projection
CompressionLow-rank approximation
WhiteningDiagonal rescaling
SVD computationMatrix factorization

PCA is one of the clearest examples of linear algebra used as a data analysis method.

121.27 Summary

Principal component analysis finds orthogonal directions of maximum variance in centered data.

The covariance matrix

S=1m1X~TX~ S=\frac{1}{m-1}\tilde{X}^T\tilde{X}

is diagonalized as

S=WΛWT. S=W\Lambda W^T.

The columns of WW are principal directions. The eigenvalues in Λ\Lambda are variances explained by those directions.

Reducing dimension means keeping the first rr eigenvectors and projecting data onto their span. Reconstruction is orthogonal projection back into the original space. The discarded eigenvalues measure lost variance.

PCA can also be computed through the singular value decomposition

X~=UΣVT. \tilde{X}=U\Sigma V^T.

The right singular vectors are the principal directions, and the squared singular values determine explained variance.

The central principle is that PCA chooses the coordinate system in which the covariance matrix is diagonal and the largest variation appears first.