Chapter 121. Principal Component Analysis

Principal component analysis, or PCA, is a method for finding the main directions of variation in a dataset.

It converts correlated variables into new orthogonal variables called principal components. The first principal component points in the direction of greatest variance. The second principal component points in the direction of greatest remaining variance subject to being orthogonal to the first. The process continues until all directions have been found or until a chosen lower-dimensional representation is obtained.

Linear algebra is the whole structure of PCA. The method uses centering, covariance matrices, eigenvalues, eigenvectors, orthogonal projection, and singular value decomposition. In practice, PCA is often computed either from the covariance matrix or directly from the singular value decomposition of the centered data matrix.

121.1 Data Matrix

Let a dataset contain $m$ observations and $n$ features.

Write the data matrix as

$$ X = \begin{bmatrix}

& x_1^T & - \
& x_2^T & - \ & \vdots & \
& x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times n}. $$

Each row $x_i^T$ is one observation. Each column is one feature.

For example, if each observation records height, weight, and age, then $n=3$ . If there are $m=1000$ people, then $X$ is a $1000\times 3$ matrix.

PCA studies the geometry of the rows of $X$ as points in $\mathbb{R}^n$ .

121.2 Centering the Data

PCA is usually applied to centered data.

The mean vector is

\mu = \frac{1}{m} \sum_{i=1}^m x_i.

The centered observations are

\tilde{x}_i = x_i-\mu.

The centered data matrix is

$$ \tilde{X} = \begin{bmatrix}

& \tilde{x}_1^T & - \
& \tilde{x}_2^T & - \ & \vdots & \
& \tilde{x}_m^T & - \end{bmatrix}. $$

Centering moves the cloud of data points so that its mean is at the origin.

This matters because PCA measures variation around the mean. Without centering, the first component may point toward the mean offset rather than toward the main direction of variation.

121.3 Variance in a Direction

Let $w\in\mathbb{R}^n$ be a unit vector:

\|w\|=1.

The projection of a centered observation $\tilde{x}_i$ onto $w$ is

\tilde{x}_i^T w.

The variance of the data along direction $w$ is proportional to

\sum_{i=1}^m (\tilde{x}_i^T w)^2.

In matrix form,

\sum_{i=1}^m (\tilde{x}_i^T w)^2 = \|\tilde{X}w\|^2 = w^T\tilde{X}^T\tilde{X}w.

Thus the direction of maximum variance solves

\max_{\|w\|=1} w^T\tilde{X}^T\tilde{X}w.

This is a Rayleigh quotient problem.

121.4 Covariance Matrix

The sample covariance matrix is

S = \frac{1}{m-1}\tilde{X}^T\tilde{X}.

The entry $S_{ij}$ measures how feature $i$ and feature $j$ vary together.

The matrix $S$ is symmetric and positive semidefinite.

For any vector $w$ ,

w^TSw \geq 0.

The variance of the data along unit direction $w$ is

w^TSw.

Therefore PCA is the problem of finding orthonormal directions that diagonalize the covariance matrix. The eigenvector associated with the largest eigenvalue of the covariance or correlation matrix gives the first principal component, and later components are ordered by decreasing eigenvalue.

121.5 Principal Components

A principal component is a unit direction $w$ in feature space.

The first principal component is

w_1 = \arg\max_{\|w\|=1} w^TSw.

The solution is an eigenvector of $S$ with the largest eigenvalue.

Sw_1=\lambda_1w_1,

where

\lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n,

then $w_1$ is the first principal direction.

The second principal component is the unit vector orthogonal to $w_1$ that maximizes variance. It is an eigenvector corresponding to $\lambda_2$ .

Continuing in this way gives an orthonormal eigenbasis

w_1,w_2,\ldots,w_n.

121.6 Eigenvalue Decomposition

Since $S$ is symmetric, the spectral theorem gives

S = W\Lambda W^T.

Here

W = \begin{bmatrix} | & | & & | \\ w_1 & w_2 & \cdots & w_n \\ | & | & & | \end{bmatrix}

is an orthogonal matrix whose columns are eigenvectors, and

\Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_n \end{bmatrix}

contains the eigenvalues.

The eigenvalues are nonnegative because $S$ is positive semidefinite.

The eigenvectors define the new coordinate axes. The eigenvalues measure variance along those axes.

121.7 Scores

The principal component scores are the coordinates of the centered data in the principal component basis.

For one observation,

z_i = W^T\tilde{x}_i.

For all observations,

Z=\tilde{X}W.

The first column of $Z$ contains the coordinates along the first principal component. The second column contains the coordinates along the second principal component, and so on.

The covariance matrix of the transformed data is diagonal:

\frac{1}{m-1}Z^TZ = \Lambda.

Thus PCA finds a coordinate system in which the features are uncorrelated.

121.8 Dimensionality Reduction

PCA is often used to reduce dimension.

Choose $r<n$ principal directions and form

W_r = \begin{bmatrix} | & | & & | \\ w_1 & w_2 & \cdots & w_r \\ | & | & & | \end{bmatrix}.

The reduced representation of an observation is

z_i = W_r^T\tilde{x}_i \in \mathbb{R}^r.

For all observations,

Z_r=\tilde{X}W_r.

This replaces $n$ original features by $r$ principal component coordinates.

The new coordinates preserve as much variance as possible among all $r$ -dimensional orthogonal projections.

121.9 Reconstruction

The reduced coordinates can be mapped back to the original feature space.

For one observation,

\hat{x}_i = \mu + W_r z_i.

Since

z_i=W_r^T(x_i-\mu),

the reconstruction is

\hat{x}_i = \mu + W_rW_r^T(x_i-\mu).

The matrix

P_r=W_rW_r^T

is the orthogonal projection onto the subspace spanned by the first $r$ principal components.

Thus PCA approximation is projection onto a data-adapted subspace.

121.10 Reconstruction Error

The reconstruction error for one centered observation is

\|\tilde{x}_i-W_rW_r^T\tilde{x}_i\|^2.

For the whole dataset, the total squared reconstruction error is

\|\tilde{X}-\tilde{X}W_rW_r^T\|_F^2.

PCA chooses the $r$ -dimensional subspace that minimizes this error among all $r$ -dimensional linear subspaces.

The minimum error equals the sum of the discarded eigenvalues, up to the same scaling used in the covariance matrix:

\sum_{j=r+1}^n \lambda_j.

Thus small discarded eigenvalues mean little information is lost.

121.11 Explained Variance

The eigenvalue $\lambda_j$ is the variance explained by principal component $j$ .

The total variance is

\lambda_1+\lambda_2+\cdots+\lambda_n.

The explained variance ratio for component $j$ is

\frac{\lambda_j}{\lambda_1+\lambda_2+\cdots+\lambda_n}.

The cumulative explained variance for the first $r$ components is

\frac{\lambda_1+\cdots+\lambda_r} {\lambda_1+\cdots+\lambda_n}.

This ratio helps choose the number of components.

For example, one may choose the smallest $r$ such that the cumulative explained variance is at least $0.95$ .

121.12 Singular Value Decomposition

PCA can also be computed from the singular value decomposition of the centered data matrix:

\tilde{X}=U\Sigma V^T.

Here the columns of $V$ are the right singular vectors.

Since

\tilde{X}^T\tilde{X} = V\Sigma^T\Sigma V^T,

the columns of $V$ are eigenvectors of $\tilde{X}^T\tilde{X}$ , hence principal directions.

The eigenvalues of the covariance matrix satisfy

\lambda_j=\frac{\sigma_j^2}{m-1},

where $\sigma_j$ is the $j$ -th singular value.

This gives a direct connection between PCA and SVD. Many implementations compute PCA through SVD, especially for numerical stability and efficiency.

121.13 Low-Rank Approximation

The truncated SVD gives

\tilde{X}_r=U_r\Sigma_rV_r^T.

This is the best rank- $r$ approximation to $\tilde{X}$ in Frobenius norm.

PCA and truncated SVD are therefore two views of the same operation:

PCA language	SVD language
Principal directions	Right singular vectors
Scores	$U_r\Sigma_r$
Explained variance	Squared singular values
Projection subspace	Span of $V_r$
Low-dimensional data	$\tilde{X}V_r$
Reconstruction	$U_r\Sigma_rV_r^T$

This is one of the cleanest links between statistics and matrix factorization.

121.14 Correlation PCA

Sometimes features have very different units or scales.

For example, one feature may be measured in dollars and another in millimeters. A feature with large numerical scale may dominate the covariance matrix.

To avoid this, one may standardize each feature:

x_{ij}^{\text{std}} = \frac{x_{ij}-\mu_j}{s_j},

where $\mu_j$ is the feature mean and $s_j$ is the feature standard deviation.

PCA on standardized data is equivalent to PCA using the correlation matrix.

This is useful when relative variation matters more than raw units. OneDAL notes distinguish covariance-based and correlation-based PCA, with the choice depending on whether feature scaling is important.

121.15 Whitening

PCA can be used for whitening.

Let

S=W\Lambda W^T.

The PCA coordinates are

z=W^T\tilde{x}.

Their covariance is

\Lambda.

To make the covariance identity, define

u=\Lambda^{-1/2}W^T\tilde{x}.

Then

\operatorname{Cov}(u)=I,

assuming positive eigenvalues.

Whitening removes correlations and rescales each principal direction to unit variance.

It is useful in preprocessing, signal processing, independent component analysis, and some machine learning workflows.

121.16 Biplots and Scree Plots

PCA is often visualized with plots.

A scree plot shows eigenvalues or explained variance ratios in decreasing order.

A sharp drop suggests that the first few components capture most of the structure.

A biplot shows observations and feature loadings together in a low-dimensional principal component plane.

These plots help interpret which variables contribute to each component and how observations are arranged in the reduced space.

Scree plots and biplots are standard PCA interpretation tools.

121.17 Loadings

The entries of a principal direction are called loadings.

w_j = \begin{bmatrix} w_{1j} \\ w_{2j} \\ \vdots \\ w_{nj} \end{bmatrix},

then $w_{kj}$ is the loading of feature $k$ on component $j$ .

Large positive or negative loadings indicate that a feature contributes strongly to the component.

Loadings help interpret components.

For example, in a dataset of body measurements, one component might have positive loadings on height, weight, and limb length. This component may represent overall body size.

Another component may contrast height against weight and represent shape.

121.18 Sign Ambiguity

Principal components have arbitrary sign.

If $w$ is an eigenvector, then

-w

is also an eigenvector with the same eigenvalue.

Therefore PCA results may differ by signs across software packages or runs.

This does not change the subspace, explained variance, reconstruction error, or geometry.

If a principal component score changes sign, the corresponding loading vector also changes sign. The interpretation should account for this convention.

121.19 Rank and Degeneracy

If the centered data matrix has rank $r$ , then only $r$ eigenvalues of the covariance matrix are positive.

Since centering subtracts the mean, the rank satisfies

\operatorname{rank}(\tilde{X}) \leq \min(m-1,n).

Thus if $n$ is large and $m$ is small, many covariance eigenvalues are zero.

When eigenvalues are repeated, the individual eigenvectors inside the repeated eigenspace are not unique. Only the subspace is determined.

This matters when interpreting components with nearly equal eigenvalues.

121.20 PCA as Rotation

PCA can be viewed as rotating the coordinate axes.

The centered data are transformed by

Z=\tilde{X}W.

Since $W$ is orthogonal, this transformation preserves distances and angles.

It does not distort the data. It merely changes the coordinate system.

After rotation, the covariance matrix becomes diagonal.

Thus PCA first rotates the data into uncorrelated coordinates, then optionally drops low-variance coordinates.

121.21 PCA as Projection

PCA can also be viewed as projection.

The reduced reconstruction is

\hat{x}=\mu+W_rW_r^T(x-\mu).

Here $W_rW_r^T$ projects onto the principal subspace.

This viewpoint emphasizes approximation.

The original point is replaced by its closest point in the selected principal subspace.

The error vector is orthogonal to that subspace.

This is the same geometry as least squares.

121.22 PCA and Noise

PCA is often used for denoising.

Suppose the signal lies mostly in a low-dimensional subspace, while noise spreads across many directions.

Then the leading principal components capture the signal, and the trailing components capture noise.

Keeping only the first $r$ components gives

\hat{X}=\tilde{X}W_rW_r^T.

This may suppress noise.

However, PCA does not know which variance is meaningful. If noise has large variance, PCA may preserve noise. If signal has small variance, PCA may discard signal.

Thus PCA is a variance-based method, not a meaning-based method.

121.23 PCA and Compression

PCA compresses data by storing reduced coordinates and a basis.

Instead of storing each centered observation in $\mathbb{R}^n$ , store

z_i\in\mathbb{R}^r

and the matrix

W_r\in\mathbb{R}^{n\times r}.

Approximate reconstruction uses

\hat{x}_i=\mu+W_rz_i.

This can reduce storage when $r\ll n$ .

For images, PCA may represent many similar images using a small number of basis images, sometimes called eigenfaces in face analysis.

121.24 Limitations

PCA is powerful but limited.

Limitation	Explanation
Linear only	PCA finds linear subspaces
Variance-based	High variance may not mean useful signal
Sensitive to scaling	Feature units affect covariance PCA
Sensitive to outliers	Extreme points can dominate directions
Components may be hard to interpret	Orthogonal directions may mix features
Sign is arbitrary	Component signs have no intrinsic meaning
Repeated eigenvalues reduce uniqueness	Only eigenspaces are stable

These limitations are not defects in the mathematics. They describe the assumptions PCA makes.

PCA is appropriate when linear structure and variance capture the main information.

121.25 Kernel PCA

Kernel PCA extends PCA to nonlinear feature spaces.

A kernel function has the form

K(x,z)=\langle \phi(x),\phi(z)\rangle,

where $\phi$ maps data into a possibly high-dimensional feature space.

Kernel PCA performs PCA using the Gram matrix

G_{ij}=K(x_i,x_j)

rather than explicitly forming $\phi(x_i)$ .

This can reveal nonlinear structure in the original space.

The method still relies on eigenvalue decomposition, but the matrix being diagonalized is a kernel matrix rather than an ordinary covariance matrix.

121.26 PCA and Linear Algebra

The dictionary is direct.

PCA concept	Linear algebra object
Dataset	Matrix
Feature mean	Vector
Centering	Translation to origin
Covariance	Symmetric positive semidefinite matrix
Principal direction	Eigenvector
Explained variance	Eigenvalue
Scores	Coordinates in eigenbasis
Dimension reduction	Projection onto subspace
Reconstruction	Orthogonal projection
Compression	Low-rank approximation
Whitening	Diagonal rescaling
SVD computation	Matrix factorization

PCA is one of the clearest examples of linear algebra used as a data analysis method.

121.27 Summary

Principal component analysis finds orthogonal directions of maximum variance in centered data.

The covariance matrix

S=\frac{1}{m-1}\tilde{X}^T\tilde{X}

is diagonalized as

S=W\Lambda W^T.

The columns of $W$ are principal directions. The eigenvalues in $\Lambda$ are variances explained by those directions.

Reducing dimension means keeping the first $r$ eigenvectors and projecting data onto their span. Reconstruction is orthogonal projection back into the original space. The discarded eigenvalues measure lost variance.

PCA can also be computed through the singular value decomposition

\tilde{X}=U\Sigma V^T.

The right singular vectors are the principal directions, and the squared singular values determine explained variance.

The central principle is that PCA chooses the coordinate system in which the covariance matrix is diagonal and the largest variation appears first.