Principal component analysis, or PCA, is a method for finding the main directions of variation in a dataset.
It converts correlated variables into new orthogonal variables called principal components. The first principal component points in the direction of greatest variance. The second principal component points in the direction of greatest remaining variance subject to being orthogonal to the first. The process continues until all directions have been found or until a chosen lower-dimensional representation is obtained.
Linear algebra is the whole structure of PCA. The method uses centering, covariance matrices, eigenvalues, eigenvectors, orthogonal projection, and singular value decomposition. In practice, PCA is often computed either from the covariance matrix or directly from the singular value decomposition of the centered data matrix.
121.1 Data Matrix
Let a dataset contain observations and features.
Write the data matrix as
$$ X = \begin{bmatrix}
- & x_1^T & - \
- & x_2^T & - \ & \vdots & \
- & x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times n}. $$
Each row is one observation. Each column is one feature.
For example, if each observation records height, weight, and age, then . If there are people, then is a matrix.
PCA studies the geometry of the rows of as points in .
121.2 Centering the Data
PCA is usually applied to centered data.
The mean vector is
The centered observations are
The centered data matrix is
$$ \tilde{X} = \begin{bmatrix}
- & \tilde{x}_1^T & - \
- & \tilde{x}_2^T & - \ & \vdots & \
- & \tilde{x}_m^T & - \end{bmatrix}. $$
Centering moves the cloud of data points so that its mean is at the origin.
This matters because PCA measures variation around the mean. Without centering, the first component may point toward the mean offset rather than toward the main direction of variation.
121.3 Variance in a Direction
Let be a unit vector:
The projection of a centered observation onto is
The variance of the data along direction is proportional to
In matrix form,
Thus the direction of maximum variance solves
This is a Rayleigh quotient problem.
121.4 Covariance Matrix
The sample covariance matrix is
The entry measures how feature and feature vary together.
The matrix is symmetric and positive semidefinite.
For any vector ,
The variance of the data along unit direction is
Therefore PCA is the problem of finding orthonormal directions that diagonalize the covariance matrix. The eigenvector associated with the largest eigenvalue of the covariance or correlation matrix gives the first principal component, and later components are ordered by decreasing eigenvalue.
121.5 Principal Components
A principal component is a unit direction in feature space.
The first principal component is
The solution is an eigenvector of with the largest eigenvalue.
If
where
then is the first principal direction.
The second principal component is the unit vector orthogonal to that maximizes variance. It is an eigenvector corresponding to .
Continuing in this way gives an orthonormal eigenbasis
121.6 Eigenvalue Decomposition
Since is symmetric, the spectral theorem gives
Here
is an orthogonal matrix whose columns are eigenvectors, and
contains the eigenvalues.
The eigenvalues are nonnegative because is positive semidefinite.
The eigenvectors define the new coordinate axes. The eigenvalues measure variance along those axes.
121.7 Scores
The principal component scores are the coordinates of the centered data in the principal component basis.
For one observation,
For all observations,
The first column of contains the coordinates along the first principal component. The second column contains the coordinates along the second principal component, and so on.
The covariance matrix of the transformed data is diagonal:
Thus PCA finds a coordinate system in which the features are uncorrelated.
121.8 Dimensionality Reduction
PCA is often used to reduce dimension.
Choose principal directions and form
The reduced representation of an observation is
For all observations,
This replaces original features by principal component coordinates.
The new coordinates preserve as much variance as possible among all -dimensional orthogonal projections.
121.9 Reconstruction
The reduced coordinates can be mapped back to the original feature space.
For one observation,
Since
the reconstruction is
The matrix
is the orthogonal projection onto the subspace spanned by the first principal components.
Thus PCA approximation is projection onto a data-adapted subspace.
121.10 Reconstruction Error
The reconstruction error for one centered observation is
For the whole dataset, the total squared reconstruction error is
PCA chooses the -dimensional subspace that minimizes this error among all -dimensional linear subspaces.
The minimum error equals the sum of the discarded eigenvalues, up to the same scaling used in the covariance matrix:
Thus small discarded eigenvalues mean little information is lost.
121.11 Explained Variance
The eigenvalue is the variance explained by principal component .
The total variance is
The explained variance ratio for component is
The cumulative explained variance for the first components is
This ratio helps choose the number of components.
For example, one may choose the smallest such that the cumulative explained variance is at least .
121.12 Singular Value Decomposition
PCA can also be computed from the singular value decomposition of the centered data matrix:
Here the columns of are the right singular vectors.
Since
the columns of are eigenvectors of , hence principal directions.
The eigenvalues of the covariance matrix satisfy
where is the -th singular value.
This gives a direct connection between PCA and SVD. Many implementations compute PCA through SVD, especially for numerical stability and efficiency.
121.13 Low-Rank Approximation
The truncated SVD gives
This is the best rank- approximation to in Frobenius norm.
PCA and truncated SVD are therefore two views of the same operation:
| PCA language | SVD language |
|---|---|
| Principal directions | Right singular vectors |
| Scores | |
| Explained variance | Squared singular values |
| Projection subspace | Span of |
| Low-dimensional data | |
| Reconstruction |
This is one of the cleanest links between statistics and matrix factorization.
121.14 Correlation PCA
Sometimes features have very different units or scales.
For example, one feature may be measured in dollars and another in millimeters. A feature with large numerical scale may dominate the covariance matrix.
To avoid this, one may standardize each feature:
where is the feature mean and is the feature standard deviation.
PCA on standardized data is equivalent to PCA using the correlation matrix.
This is useful when relative variation matters more than raw units. OneDAL notes distinguish covariance-based and correlation-based PCA, with the choice depending on whether feature scaling is important.
121.15 Whitening
PCA can be used for whitening.
Let
The PCA coordinates are
Their covariance is
To make the covariance identity, define
Then
assuming positive eigenvalues.
Whitening removes correlations and rescales each principal direction to unit variance.
It is useful in preprocessing, signal processing, independent component analysis, and some machine learning workflows.
121.16 Biplots and Scree Plots
PCA is often visualized with plots.
A scree plot shows eigenvalues or explained variance ratios in decreasing order.
A sharp drop suggests that the first few components capture most of the structure.
A biplot shows observations and feature loadings together in a low-dimensional principal component plane.
These plots help interpret which variables contribute to each component and how observations are arranged in the reduced space.
Scree plots and biplots are standard PCA interpretation tools.
121.17 Loadings
The entries of a principal direction are called loadings.
If
then is the loading of feature on component .
Large positive or negative loadings indicate that a feature contributes strongly to the component.
Loadings help interpret components.
For example, in a dataset of body measurements, one component might have positive loadings on height, weight, and limb length. This component may represent overall body size.
Another component may contrast height against weight and represent shape.
121.18 Sign Ambiguity
Principal components have arbitrary sign.
If is an eigenvector, then
is also an eigenvector with the same eigenvalue.
Therefore PCA results may differ by signs across software packages or runs.
This does not change the subspace, explained variance, reconstruction error, or geometry.
If a principal component score changes sign, the corresponding loading vector also changes sign. The interpretation should account for this convention.
121.19 Rank and Degeneracy
If the centered data matrix has rank , then only eigenvalues of the covariance matrix are positive.
Since centering subtracts the mean, the rank satisfies
Thus if is large and is small, many covariance eigenvalues are zero.
When eigenvalues are repeated, the individual eigenvectors inside the repeated eigenspace are not unique. Only the subspace is determined.
This matters when interpreting components with nearly equal eigenvalues.
121.20 PCA as Rotation
PCA can be viewed as rotating the coordinate axes.
The centered data are transformed by
Since is orthogonal, this transformation preserves distances and angles.
It does not distort the data. It merely changes the coordinate system.
After rotation, the covariance matrix becomes diagonal.
Thus PCA first rotates the data into uncorrelated coordinates, then optionally drops low-variance coordinates.
121.21 PCA as Projection
PCA can also be viewed as projection.
The reduced reconstruction is
Here projects onto the principal subspace.
This viewpoint emphasizes approximation.
The original point is replaced by its closest point in the selected principal subspace.
The error vector is orthogonal to that subspace.
This is the same geometry as least squares.
121.22 PCA and Noise
PCA is often used for denoising.
Suppose the signal lies mostly in a low-dimensional subspace, while noise spreads across many directions.
Then the leading principal components capture the signal, and the trailing components capture noise.
Keeping only the first components gives
This may suppress noise.
However, PCA does not know which variance is meaningful. If noise has large variance, PCA may preserve noise. If signal has small variance, PCA may discard signal.
Thus PCA is a variance-based method, not a meaning-based method.
121.23 PCA and Compression
PCA compresses data by storing reduced coordinates and a basis.
Instead of storing each centered observation in , store
and the matrix
Approximate reconstruction uses
This can reduce storage when .
For images, PCA may represent many similar images using a small number of basis images, sometimes called eigenfaces in face analysis.
121.24 Limitations
PCA is powerful but limited.
| Limitation | Explanation |
|---|---|
| Linear only | PCA finds linear subspaces |
| Variance-based | High variance may not mean useful signal |
| Sensitive to scaling | Feature units affect covariance PCA |
| Sensitive to outliers | Extreme points can dominate directions |
| Components may be hard to interpret | Orthogonal directions may mix features |
| Sign is arbitrary | Component signs have no intrinsic meaning |
| Repeated eigenvalues reduce uniqueness | Only eigenspaces are stable |
These limitations are not defects in the mathematics. They describe the assumptions PCA makes.
PCA is appropriate when linear structure and variance capture the main information.
121.25 Kernel PCA
Kernel PCA extends PCA to nonlinear feature spaces.
A kernel function has the form
where maps data into a possibly high-dimensional feature space.
Kernel PCA performs PCA using the Gram matrix
rather than explicitly forming .
This can reveal nonlinear structure in the original space.
The method still relies on eigenvalue decomposition, but the matrix being diagonalized is a kernel matrix rather than an ordinary covariance matrix.
121.26 PCA and Linear Algebra
The dictionary is direct.
| PCA concept | Linear algebra object |
|---|---|
| Dataset | Matrix |
| Feature mean | Vector |
| Centering | Translation to origin |
| Covariance | Symmetric positive semidefinite matrix |
| Principal direction | Eigenvector |
| Explained variance | Eigenvalue |
| Scores | Coordinates in eigenbasis |
| Dimension reduction | Projection onto subspace |
| Reconstruction | Orthogonal projection |
| Compression | Low-rank approximation |
| Whitening | Diagonal rescaling |
| SVD computation | Matrix factorization |
PCA is one of the clearest examples of linear algebra used as a data analysis method.
121.27 Summary
Principal component analysis finds orthogonal directions of maximum variance in centered data.
The covariance matrix
is diagonalized as
The columns of are principal directions. The eigenvalues in are variances explained by those directions.
Reducing dimension means keeping the first eigenvectors and projecting data onto their span. Reconstruction is orthogonal projection back into the original space. The discarded eigenvalues measure lost variance.
PCA can also be computed through the singular value decomposition
The right singular vectors are the principal directions, and the squared singular values determine explained variance.
The central principle is that PCA chooses the coordinate system in which the covariance matrix is diagonal and the largest variation appears first.