Skip to content

Chapter 120. Machine Learning

Machine learning studies algorithms that improve their behavior from data.

A machine learning system receives examples, extracts structure, and produces predictions, classifications, rankings, embeddings, or decisions. Linear algebra is central because data are represented as vectors and matrices, models are often linear or locally linear, and training is usually an optimization problem over parameters.

Many standard learning methods can be written in the form

data matrixlinear maploss functionoptimization. \text{data matrix} \to \text{linear map} \to \text{loss function} \to \text{optimization}.

This is why linear algebra appears throughout machine learning: in regression, classification, kernels, embeddings, neural networks, dimensionality reduction, and numerical optimization. Stanford CS229 notes, for example, develop regularization, kernels, SVMs, and optimization using vector and matrix notation.

120.1 Data as Vectors

A single data point is often represented as a vector

x=[x1x2xd]Rd. x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \in \mathbb{R}^d.

Each component is a feature.

For example, a house may be represented by features such as size, number of rooms, distance to city center, and age. An image may be represented by pixel values. A document may be represented by word counts or embedding coordinates.

ObjectVector representation
ImagePixel vector or feature vector
TextToken vector, count vector, or embedding
AudioSample vector or spectral feature vector
UserPreference vector
ProductAttribute or embedding vector
Sensor readingMeasurement vector

Once data are vectors, distances, inner products, projections, norms, and linear transformations become available.

120.2 Data Matrices

A dataset with mm examples and dd features is stored as a matrix

$$ X = \begin{bmatrix}

  • & x_1^T & - \
  • & x_2^T & - \ & \vdots & \
  • & x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times d}. $$

Each row is one example. Each column is one feature.

The target values are often stored as a vector

y=[y1y2ym]. y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}.

The pair (X,y)(X,y) is the basic supervised learning dataset.

The design is the same as in linear regression. Machine learning extends this framework to broader models, losses, constraints, and data types.

120.3 Features

A feature is a numerical quantity used by a model.

A feature map converts raw input into a vector:

ϕ:XRd. \phi : \mathcal{X} \to \mathbb{R}^d.

For an input object aa, the model uses

x=ϕ(a). x=\phi(a).

Feature maps may be simple or complex.

Raw objectPossible features
TextWord counts, TF-IDF, embeddings
ImagePixels, patches, learned features
GraphDegrees, neighborhoods, spectral features
User behaviorCounts, recency, preferences
Time seriesLags, moving averages, Fourier coefficients

Feature design changes the geometry of the learning problem. A model that is linear in one feature space may represent nonlinear behavior in the original input space.

120.4 Linear Prediction

The simplest prediction model is linear:

y^=wTx+b. \hat{y}=w^Tx+b.

Here wRdw\in\mathbb{R}^d is the weight vector and bRb\in\mathbb{R} is the bias.

The value wjw_j measures how feature jj contributes to the prediction.

For the full dataset,

y^=Xw+b1. \hat{y}=Xw+b\mathbf{1}.

If the bias is included as an extra feature equal to 11, then the model becomes

y^=Xw. \hat{y}=Xw.

This is the same matrix form used in least squares.

120.5 Loss Functions

A loss function measures prediction error.

For regression, a common loss is squared error:

(y^,y)=(y^y)2. \ell(\hat{y},y)=(\hat{y}-y)^2.

For a dataset, the empirical risk is

L(w)=1mi=1m(wTxiyi)2. L(w)=\frac{1}{m}\sum_{i=1}^m (w^Tx_i-y_i)^2.

In matrix form,

L(w)=1mXwy2. L(w)=\frac{1}{m}\|Xw-y\|^2.

Thus linear regression is a least squares problem.

Other learning problems use different losses.

ProblemCommon loss
RegressionSquared error
Binary classificationLogistic loss
Margin classificationHinge loss
Multiclass classificationCross-entropy
RankingPairwise ranking loss
Embedding learningContrastive loss

The choice of loss changes the optimization problem.

120.6 Training

Training means choosing model parameters to minimize loss.

For a linear model, training solves

minwL(w). \min_w L(w).

For squared loss,

minwXwy2. \min_w \|Xw-y\|^2.

This is ordinary least squares.

If XTXX^TX is invertible, the solution is

w^=(XTX)1XTy. \hat{w}=(X^TX)^{-1}X^Ty.

In larger or more complex models, training is usually done by iterative optimization, such as gradient descent.

120.7 Gradient Descent

Gradient descent updates parameters by moving opposite the gradient:

wk+1=wkηL(wk). w_{k+1}=w_k-\eta \nabla L(w_k).

Here η>0\eta>0 is the learning rate.

For squared loss,

L(w)=1mXwy2, L(w)=\frac{1}{m}\|Xw-y\|^2,

the gradient is

L(w)=2mXT(Xwy). \nabla L(w)=\frac{2}{m}X^T(Xw-y).

Thus each update uses matrix-vector products with XX and XTX^T.

This is one reason large-scale machine learning depends heavily on efficient linear algebra kernels.

120.8 Regularization

Regularization adds a penalty to the training objective.

A common example is ridge regularization:

minwXwy2+λw2. \min_w \|Xw-y\|^2+\lambda\|w\|^2.

Here λ>0\lambda>0 controls the penalty strength.

The solution satisfies

(XTX+λI)w=XTy. (X^TX+\lambda I)w=X^Ty.

Regularization discourages overly large parameter vectors and improves stability when features are correlated or data are limited. CS229 notes describe 2\ell_2 regularization as encouraging small parameter norm and, in gradient descent form, as weight decay.

120.9 Classification

In binary classification, the target is usually

y{1,+1} y\in\{-1,+1\}

or

y{0,1}. y\in\{0,1\}.

A linear classifier computes a score

s=wTx+b. s=w^Tx+b.

The predicted class may be

y^=sign(s). \hat{y}=\operatorname{sign}(s).

Geometrically, the equation

wTx+b=0 w^Tx+b=0

defines a hyperplane.

This hyperplane separates feature space into two half-spaces. The vector ww is normal to the separating hyperplane.

Classification is therefore a geometric problem in vector space.

120.10 Logistic Regression

Logistic regression converts a linear score into a probability.

The score is

s=wTx+b. s=w^Tx+b.

The probability of class 11 is

p=σ(s)=11+es. p=\sigma(s)=\frac{1}{1+e^{-s}}.

The sigmoid function maps real numbers to values between 00 and 11.

Training usually minimizes cross-entropy loss:

L(w)=i=1m[yilogpi+(1yi)log(1pi)]. L(w) = -\sum_{i=1}^m \left[ y_i\log p_i+(1-y_i)\log(1-p_i) \right].

Although the model is linear in its score, the probability is nonlinear in the parameters.

The training objective is convex for ordinary logistic regression.

120.11 Support Vector Machines

A support vector machine seeks a separating hyperplane with a large margin.

For binary labels

yi{1,+1}, y_i\in\{-1,+1\},

a linear classifier uses

f(x)=wTx+b. f(x)=w^Tx+b.

The margin condition is

yi(wTxi+b)1. y_i(w^Tx_i+b)\geq 1.

The hard-margin SVM solves

minw,b12w2 \min_{w,b}\frac{1}{2}\|w\|^2

subject to the margin constraints.

The support vectors are training points that lie on or inside the margin boundary. SVMs are commonly developed through optimization, kernels, and inner products, and the kernel trick replaces explicit high-dimensional feature maps by kernel evaluations.

120.12 Kernels

A kernel is a function

K(x,z) K(x,z)

that behaves like an inner product in some feature space:

K(x,z)=ϕ(x),ϕ(z). K(x,z)=\langle \phi(x),\phi(z)\rangle.

The kernel trick allows algorithms to use inner products in a high-dimensional feature space without explicitly constructing the feature vectors.

Common kernels include:

KernelFormula
LinearK(x,z)=xTzK(x,z)=x^Tz
PolynomialK(x,z)=(xTz+c)pK(x,z)=(x^Tz+c)^p
Gaussian RBFK(x,z)=exp(γxz2)K(x,z)=\exp(-\gamma\|x-z\|^2)

Kernel methods show that linear algebra can operate implicitly through Gram matrices.

The Gram matrix has entries

Gij=K(xi,xj). G_{ij}=K(x_i,x_j).

120.13 Embeddings

An embedding maps an object into a vector space.

For example, a word, document, image, user, or item may be mapped to

zRd. z\in\mathbb{R}^d.

The goal is that geometry in the embedding space reflects semantic or task-relevant structure.

Similar objects should have nearby vectors. Dissimilar objects should have distant or differently oriented vectors.

Common similarity measures include:

MeasureFormula
Dot productxTzx^Tz
Euclidean distancexz\|x-z\|
Cosine similarityxTzxz\frac{x^Tz}{\|x\|\|z\|}

Embeddings turn complex objects into points in a learned vector space.

120.14 Neural Networks

A neural network is a composition of linear maps and nonlinear activation functions.

A single layer has the form

h=σ(Wx+b). h=\sigma(Wx+b).

Here WW is a weight matrix, bb is a bias vector, and σ\sigma is applied componentwise.

A multilayer network has the form

f(x)=WLσ(WL1σ(σ(W1x+b1)))+bL. f(x) = W_L\sigma(W_{L-1}\sigma(\cdots\sigma(W_1x+b_1)))+b_L.

The matrices W1,,WLW_1,\ldots,W_L contain the learned parameters.

Without nonlinear activation functions, the composition would collapse to one linear map. Nonlinearity is what allows neural networks to represent complex functions.

Still, each layer is built from matrix multiplication.

120.15 Backpropagation

Backpropagation computes gradients of a loss with respect to all parameters in a neural network.

It applies the chain rule through the computational graph.

For a layer

h=σ(Wx+b), h=\sigma(Wx+b),

the gradient with respect to WW has an outer-product structure:

LW=δxT, \frac{\partial L}{\partial W} = \delta x^T,

where δ\delta is the backpropagated error vector for the layer.

Thus training neural networks repeatedly uses matrix products, transposes, Jacobian-vector products, and outer products.

MIT notes on matrix calculus for machine learning emphasize this matrix-based calculus viewpoint for differentiating vector and matrix expressions used in learning systems.

120.16 Dimensionality Reduction

High-dimensional data are often compressed into lower-dimensional representations.

A linear dimensionality reduction map has the form

z=WTx, z=W^Tx,

where

WRd×r,r<d. W\in\mathbb{R}^{d\times r}, \qquad r<d.

The vector zRrz\in\mathbb{R}^r is a lower-dimensional representation of xx.

Good dimensionality reduction preserves important structure: variance, distance, neighborhood relations, class separation, or task information.

Linear algebra supplies the central methods, especially eigenvalue decompositions and singular value decompositions.

120.17 Principal Component Analysis

Principal component analysis, or PCA, finds orthogonal directions of maximum variance.

Given centered data matrix XX, the sample covariance matrix is proportional to

XTX. X^TX.

The principal components are eigenvectors of XTXX^TX.

Equivalently, PCA can be computed by the singular value decomposition

X=UΣVT. X=U\Sigma V^T.

The columns of VV give principal directions.

Keeping the first rr directions gives a rank-rr approximation of the data.

PCA is therefore a direct application of eigenvalues, orthogonality, projection, and SVD.

120.18 Matrix Factorization

Many machine learning problems use matrix factorization.

Suppose RRm×nR\in\mathbb{R}^{m\times n} stores user-item ratings.

A low-rank model approximates

RUVT, R\approx UV^T,

where

URm×r,VRn×r. U\in\mathbb{R}^{m\times r}, \qquad V\in\mathbb{R}^{n\times r}.

The row uiu_i represents user ii. The row vjv_j represents item jj.

The predicted rating is

R^ij=uiTvj. \hat{R}_{ij}=u_i^Tv_j.

This is the basis of many recommender systems.

The model assumes that observed preferences are governed by a smaller number of latent factors.

120.19 Attention

Attention mechanisms compare query, key, and value vectors.

For matrices

Q,K,V, Q,\quad K,\quad V,

scaled dot-product attention computes

Attention(Q,K,V)=softmax(QKTd)V. \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d}} \right)V.

The matrix

QKT QK^T

contains pairwise similarities between queries and keys.

The softmax converts similarities into weights.

Multiplication by VV forms weighted sums of value vectors.

Attention is therefore a structured sequence of matrix multiplication, normalization, and weighted averaging.

120.20 Covariance

Covariance measures how features vary together.

For centered data matrix XX, the covariance matrix is

Σ=1mXTX. \Sigma=\frac{1}{m}X^TX.

The entry Σij\Sigma_{ij} measures the relationship between feature ii and feature jj.

The diagonal entries are variances. The off-diagonal entries are covariances.

Covariance matrices are symmetric positive semidefinite.

They appear in PCA, Gaussian models, whitening, metric learning, Kalman filters, and uncertainty estimation.

120.21 Whitening

Whitening transforms data so that its covariance becomes the identity matrix.

If

Σ=VΛVT \Sigma=V\Lambda V^T

is an eigendecomposition of the covariance matrix, then a whitening transform is

W=Λ1/2VT. W=\Lambda^{-1/2}V^T.

For centered data xx, define

z=Wx. z=Wx.

Then, ideally,

Cov(z)=I. \operatorname{Cov}(z)=I.

Whitening removes scale and correlation from the data.

It is used as preprocessing, in statistical models, and in representation learning.

120.22 Similarity Search

Many machine learning systems retrieve nearest vectors.

Given a query vector qq, find database vectors xix_i that maximize

qTxi q^Tx_i

or minimize

qxi. \|q-x_i\|.

This is nearest neighbor search in a vector space.

Applications include semantic search, image retrieval, recommendation, clustering, duplicate detection, and memory retrieval.

At small scale, exact search can compute all distances. At large scale, approximate nearest neighbor methods use indexing structures, quantization, hashing, or graph search.

The underlying computations remain vector operations.

120.23 Generalization

Training error measures performance on the training data. Generalization concerns performance on new data.

Linear algebra enters generalization through model capacity, norm constraints, rank, margins, and regularization.

For example, a classifier with a larger margin is often more robust to small perturbations. A low-rank representation may discard noise. A small norm solution may avoid overly sensitive predictions.

These are geometric ideas.

Machine learning theory studies how these geometric constraints affect prediction on unseen examples.

120.24 Overfitting

Overfitting occurs when a model fits noise or accidental structure in the training set.

A model with too much flexibility may achieve low training loss but poor test performance.

Linear algebraic symptoms include:

SymptomMeaning
Ill-conditioned design matrixUnstable parameters
Very large weightsSensitive predictions
High rank with noisy directionsNoise fitted as signal
Small singular values used heavilyPoor numerical stability
Near-duplicate featuresMulticollinearity

Regularization, dimensionality reduction, early stopping, and more data can reduce overfitting.

120.25 Machine Learning and Linear Algebra

The main dictionary is direct.

Machine learning objectLinear algebra object
ExampleVector
DatasetMatrix
Feature mapVector-valued function
Linear modelDot product
Neural network layerMatrix plus nonlinearity
TrainingOptimization over vectors and matrices
Loss gradientVector or matrix derivative
RegularizationNorm penalty
Kernel methodGram matrix
EmbeddingLearned coordinate vector
PCAEigenvalue problem
Matrix factorizationLow-rank approximation
AttentionMatrix products and weighted sums
Similarity searchInner products and distances

This table explains why linear algebra is not only a prerequisite for machine learning. It is part of the structure of the field.

120.26 Summary

Machine learning represents data as vectors and collections of data as matrices.

Linear models use dot products. Regression solves least squares problems. Classification separates vector spaces by hyperplanes. Kernels replace explicit feature vectors by inner products. Neural networks compose matrix transformations with nonlinearities. PCA, embeddings, attention, recommender systems, and similarity search all rely on matrix operations.

The central principle is representation plus optimization. Data are placed in vector spaces, models transform those vectors, losses measure error, and algorithms adjust parameters by linear algebraic computation.