# Chapter 120. Machine Learning

# Chapter 120. Machine Learning

Machine learning studies algorithms that improve their behavior from data.

A machine learning system receives examples, extracts structure, and produces predictions, classifications, rankings, embeddings, or decisions. Linear algebra is central because data are represented as vectors and matrices, models are often linear or locally linear, and training is usually an optimization problem over parameters.

Many standard learning methods can be written in the form

$$
\text{data matrix} \to \text{linear map} \to \text{loss function} \to \text{optimization}.
$$

This is why linear algebra appears throughout machine learning: in regression, classification, kernels, embeddings, neural networks, dimensionality reduction, and numerical optimization. Stanford CS229 notes, for example, develop regularization, kernels, SVMs, and optimization using vector and matrix notation.

## 120.1 Data as Vectors

A single data point is often represented as a vector

$$
x =
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_d
\end{bmatrix}
\in \mathbb{R}^d.
$$

Each component is a feature.

For example, a house may be represented by features such as size, number of rooms, distance to city center, and age. An image may be represented by pixel values. A document may be represented by word counts or embedding coordinates.

| Object | Vector representation |
|---|---|
| Image | Pixel vector or feature vector |
| Text | Token vector, count vector, or embedding |
| Audio | Sample vector or spectral feature vector |
| User | Preference vector |
| Product | Attribute or embedding vector |
| Sensor reading | Measurement vector |

Once data are vectors, distances, inner products, projections, norms, and linear transformations become available.

## 120.2 Data Matrices

A dataset with \(m\) examples and \(d\) features is stored as a matrix

$$
X =
\begin{bmatrix}
- & x_1^T & - \\
- & x_2^T & - \\
& \vdots & \\
- & x_m^T & -
\end{bmatrix}
\in \mathbb{R}^{m\times d}.
$$

Each row is one example. Each column is one feature.

The target values are often stored as a vector

$$
y =
\begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_m
\end{bmatrix}.
$$

The pair \((X,y)\) is the basic supervised learning dataset.

The design is the same as in linear regression. Machine learning extends this framework to broader models, losses, constraints, and data types.

## 120.3 Features

A feature is a numerical quantity used by a model.

A feature map converts raw input into a vector:

$$
\phi : \mathcal{X} \to \mathbb{R}^d.
$$

For an input object \(a\), the model uses

$$
x=\phi(a).
$$

Feature maps may be simple or complex.

| Raw object | Possible features |
|---|---|
| Text | Word counts, TF-IDF, embeddings |
| Image | Pixels, patches, learned features |
| Graph | Degrees, neighborhoods, spectral features |
| User behavior | Counts, recency, preferences |
| Time series | Lags, moving averages, Fourier coefficients |

Feature design changes the geometry of the learning problem. A model that is linear in one feature space may represent nonlinear behavior in the original input space.

## 120.4 Linear Prediction

The simplest prediction model is linear:

$$
\hat{y}=w^Tx+b.
$$

Here \(w\in\mathbb{R}^d\) is the weight vector and \(b\in\mathbb{R}\) is the bias.

The value \(w_j\) measures how feature \(j\) contributes to the prediction.

For the full dataset,

$$
\hat{y}=Xw+b\mathbf{1}.
$$

If the bias is included as an extra feature equal to \(1\), then the model becomes

$$
\hat{y}=Xw.
$$

This is the same matrix form used in least squares.

## 120.5 Loss Functions

A loss function measures prediction error.

For regression, a common loss is squared error:

$$
\ell(\hat{y},y)=(\hat{y}-y)^2.
$$

For a dataset, the empirical risk is

$$
L(w)=\frac{1}{m}\sum_{i=1}^m (w^Tx_i-y_i)^2.
$$

In matrix form,

$$
L(w)=\frac{1}{m}\|Xw-y\|^2.
$$

Thus linear regression is a least squares problem.

Other learning problems use different losses.

| Problem | Common loss |
|---|---|
| Regression | Squared error |
| Binary classification | Logistic loss |
| Margin classification | Hinge loss |
| Multiclass classification | Cross-entropy |
| Ranking | Pairwise ranking loss |
| Embedding learning | Contrastive loss |

The choice of loss changes the optimization problem.

## 120.6 Training

Training means choosing model parameters to minimize loss.

For a linear model, training solves

$$
\min_w L(w).
$$

For squared loss,

$$
\min_w \|Xw-y\|^2.
$$

This is ordinary least squares.

If \(X^TX\) is invertible, the solution is

$$
\hat{w}=(X^TX)^{-1}X^Ty.
$$

In larger or more complex models, training is usually done by iterative optimization, such as gradient descent.

## 120.7 Gradient Descent

Gradient descent updates parameters by moving opposite the gradient:

$$
w_{k+1}=w_k-\eta \nabla L(w_k).
$$

Here \(\eta>0\) is the learning rate.

For squared loss,

$$
L(w)=\frac{1}{m}\|Xw-y\|^2,
$$

the gradient is

$$
\nabla L(w)=\frac{2}{m}X^T(Xw-y).
$$

Thus each update uses matrix-vector products with \(X\) and \(X^T\).

This is one reason large-scale machine learning depends heavily on efficient linear algebra kernels.

## 120.8 Regularization

Regularization adds a penalty to the training objective.

A common example is ridge regularization:

$$
\min_w \|Xw-y\|^2+\lambda\|w\|^2.
$$

Here \(\lambda>0\) controls the penalty strength.

The solution satisfies

$$
(X^TX+\lambda I)w=X^Ty.
$$

Regularization discourages overly large parameter vectors and improves stability when features are correlated or data are limited. CS229 notes describe \(\ell_2\) regularization as encouraging small parameter norm and, in gradient descent form, as weight decay.

## 120.9 Classification

In binary classification, the target is usually

$$
y\in\{-1,+1\}
$$

or

$$
y\in\{0,1\}.
$$

A linear classifier computes a score

$$
s=w^Tx+b.
$$

The predicted class may be

$$
\hat{y}=\operatorname{sign}(s).
$$

Geometrically, the equation

$$
w^Tx+b=0
$$

defines a hyperplane.

This hyperplane separates feature space into two half-spaces. The vector \(w\) is normal to the separating hyperplane.

Classification is therefore a geometric problem in vector space.

## 120.10 Logistic Regression

Logistic regression converts a linear score into a probability.

The score is

$$
s=w^Tx+b.
$$

The probability of class \(1\) is

$$
p=\sigma(s)=\frac{1}{1+e^{-s}}.
$$

The sigmoid function maps real numbers to values between \(0\) and \(1\).

Training usually minimizes cross-entropy loss:

$$
L(w) =
-\sum_{i=1}^m
\left[
y_i\log p_i+(1-y_i)\log(1-p_i)
\right].
$$

Although the model is linear in its score, the probability is nonlinear in the parameters.

The training objective is convex for ordinary logistic regression.

## 120.11 Support Vector Machines

A support vector machine seeks a separating hyperplane with a large margin.

For binary labels

$$
y_i\in\{-1,+1\},
$$

a linear classifier uses

$$
f(x)=w^Tx+b.
$$

The margin condition is

$$
y_i(w^Tx_i+b)\geq 1.
$$

The hard-margin SVM solves

$$
\min_{w,b}\frac{1}{2}\|w\|^2
$$

subject to the margin constraints.

The support vectors are training points that lie on or inside the margin boundary. SVMs are commonly developed through optimization, kernels, and inner products, and the kernel trick replaces explicit high-dimensional feature maps by kernel evaluations.

## 120.12 Kernels

A kernel is a function

$$
K(x,z)
$$

that behaves like an inner product in some feature space:

$$
K(x,z)=\langle \phi(x),\phi(z)\rangle.
$$

The kernel trick allows algorithms to use inner products in a high-dimensional feature space without explicitly constructing the feature vectors.

Common kernels include:

| Kernel | Formula |
|---|---|
| Linear | \(K(x,z)=x^Tz\) |
| Polynomial | \(K(x,z)=(x^Tz+c)^p\) |
| Gaussian RBF | \(K(x,z)=\exp(-\gamma\|x-z\|^2)\) |

Kernel methods show that linear algebra can operate implicitly through Gram matrices.

The Gram matrix has entries

$$
G_{ij}=K(x_i,x_j).
$$

## 120.13 Embeddings

An embedding maps an object into a vector space.

For example, a word, document, image, user, or item may be mapped to

$$
z\in\mathbb{R}^d.
$$

The goal is that geometry in the embedding space reflects semantic or task-relevant structure.

Similar objects should have nearby vectors. Dissimilar objects should have distant or differently oriented vectors.

Common similarity measures include:

| Measure | Formula |
|---|---|
| Dot product | \(x^Tz\) |
| Euclidean distance | \(\|x-z\|\) |
| Cosine similarity | \(\frac{x^Tz}{\|x\|\|z\|}\) |

Embeddings turn complex objects into points in a learned vector space.

## 120.14 Neural Networks

A neural network is a composition of linear maps and nonlinear activation functions.

A single layer has the form

$$
h=\sigma(Wx+b).
$$

Here \(W\) is a weight matrix, \(b\) is a bias vector, and \(\sigma\) is applied componentwise.

A multilayer network has the form

$$
f(x) =
W_L\sigma(W_{L-1}\sigma(\cdots\sigma(W_1x+b_1)))+b_L.
$$

The matrices \(W_1,\ldots,W_L\) contain the learned parameters.

Without nonlinear activation functions, the composition would collapse to one linear map. Nonlinearity is what allows neural networks to represent complex functions.

Still, each layer is built from matrix multiplication.

## 120.15 Backpropagation

Backpropagation computes gradients of a loss with respect to all parameters in a neural network.

It applies the chain rule through the computational graph.

For a layer

$$
h=\sigma(Wx+b),
$$

the gradient with respect to \(W\) has an outer-product structure:

$$
\frac{\partial L}{\partial W} =
\delta x^T,
$$

where \(\delta\) is the backpropagated error vector for the layer.

Thus training neural networks repeatedly uses matrix products, transposes, Jacobian-vector products, and outer products.

MIT notes on matrix calculus for machine learning emphasize this matrix-based calculus viewpoint for differentiating vector and matrix expressions used in learning systems.

## 120.16 Dimensionality Reduction

High-dimensional data are often compressed into lower-dimensional representations.

A linear dimensionality reduction map has the form

$$
z=W^Tx,
$$

where

$$
W\in\mathbb{R}^{d\times r},
\qquad r<d.
$$

The vector \(z\in\mathbb{R}^r\) is a lower-dimensional representation of \(x\).

Good dimensionality reduction preserves important structure: variance, distance, neighborhood relations, class separation, or task information.

Linear algebra supplies the central methods, especially eigenvalue decompositions and singular value decompositions.

## 120.17 Principal Component Analysis

Principal component analysis, or PCA, finds orthogonal directions of maximum variance.

Given centered data matrix \(X\), the sample covariance matrix is proportional to

$$
X^TX.
$$

The principal components are eigenvectors of \(X^TX\).

Equivalently, PCA can be computed by the singular value decomposition

$$
X=U\Sigma V^T.
$$

The columns of \(V\) give principal directions.

Keeping the first \(r\) directions gives a rank-\(r\) approximation of the data.

PCA is therefore a direct application of eigenvalues, orthogonality, projection, and SVD.

## 120.18 Matrix Factorization

Many machine learning problems use matrix factorization.

Suppose \(R\in\mathbb{R}^{m\times n}\) stores user-item ratings.

A low-rank model approximates

$$
R\approx UV^T,
$$

where

$$
U\in\mathbb{R}^{m\times r},
\qquad
V\in\mathbb{R}^{n\times r}.
$$

The row \(u_i\) represents user \(i\). The row \(v_j\) represents item \(j\).

The predicted rating is

$$
\hat{R}_{ij}=u_i^Tv_j.
$$

This is the basis of many recommender systems.

The model assumes that observed preferences are governed by a smaller number of latent factors.

## 120.19 Attention

Attention mechanisms compare query, key, and value vectors.

For matrices

$$
Q,\quad K,\quad V,
$$

scaled dot-product attention computes

$$
\operatorname{Attention}(Q,K,V) =
\operatorname{softmax}
\left(
\frac{QK^T}{\sqrt{d}}
\right)V.
$$

The matrix

$$
QK^T
$$

contains pairwise similarities between queries and keys.

The softmax converts similarities into weights.

Multiplication by \(V\) forms weighted sums of value vectors.

Attention is therefore a structured sequence of matrix multiplication, normalization, and weighted averaging.

## 120.20 Covariance

Covariance measures how features vary together.

For centered data matrix \(X\), the covariance matrix is

$$
\Sigma=\frac{1}{m}X^TX.
$$

The entry \(\Sigma_{ij}\) measures the relationship between feature \(i\) and feature \(j\).

The diagonal entries are variances. The off-diagonal entries are covariances.

Covariance matrices are symmetric positive semidefinite.

They appear in PCA, Gaussian models, whitening, metric learning, Kalman filters, and uncertainty estimation.

## 120.21 Whitening

Whitening transforms data so that its covariance becomes the identity matrix.

If

$$
\Sigma=V\Lambda V^T
$$

is an eigendecomposition of the covariance matrix, then a whitening transform is

$$
W=\Lambda^{-1/2}V^T.
$$

For centered data \(x\), define

$$
z=Wx.
$$

Then, ideally,

$$
\operatorname{Cov}(z)=I.
$$

Whitening removes scale and correlation from the data.

It is used as preprocessing, in statistical models, and in representation learning.

## 120.22 Similarity Search

Many machine learning systems retrieve nearest vectors.

Given a query vector \(q\), find database vectors \(x_i\) that maximize

$$
q^Tx_i
$$

or minimize

$$
\|q-x_i\|.
$$

This is nearest neighbor search in a vector space.

Applications include semantic search, image retrieval, recommendation, clustering, duplicate detection, and memory retrieval.

At small scale, exact search can compute all distances. At large scale, approximate nearest neighbor methods use indexing structures, quantization, hashing, or graph search.

The underlying computations remain vector operations.

## 120.23 Generalization

Training error measures performance on the training data. Generalization concerns performance on new data.

Linear algebra enters generalization through model capacity, norm constraints, rank, margins, and regularization.

For example, a classifier with a larger margin is often more robust to small perturbations. A low-rank representation may discard noise. A small norm solution may avoid overly sensitive predictions.

These are geometric ideas.

Machine learning theory studies how these geometric constraints affect prediction on unseen examples.

## 120.24 Overfitting

Overfitting occurs when a model fits noise or accidental structure in the training set.

A model with too much flexibility may achieve low training loss but poor test performance.

Linear algebraic symptoms include:

| Symptom | Meaning |
|---|---|
| Ill-conditioned design matrix | Unstable parameters |
| Very large weights | Sensitive predictions |
| High rank with noisy directions | Noise fitted as signal |
| Small singular values used heavily | Poor numerical stability |
| Near-duplicate features | Multicollinearity |

Regularization, dimensionality reduction, early stopping, and more data can reduce overfitting.

## 120.25 Machine Learning and Linear Algebra

The main dictionary is direct.

| Machine learning object | Linear algebra object |
|---|---|
| Example | Vector |
| Dataset | Matrix |
| Feature map | Vector-valued function |
| Linear model | Dot product |
| Neural network layer | Matrix plus nonlinearity |
| Training | Optimization over vectors and matrices |
| Loss gradient | Vector or matrix derivative |
| Regularization | Norm penalty |
| Kernel method | Gram matrix |
| Embedding | Learned coordinate vector |
| PCA | Eigenvalue problem |
| Matrix factorization | Low-rank approximation |
| Attention | Matrix products and weighted sums |
| Similarity search | Inner products and distances |

This table explains why linear algebra is not only a prerequisite for machine learning. It is part of the structure of the field.

## 120.26 Summary

Machine learning represents data as vectors and collections of data as matrices.

Linear models use dot products. Regression solves least squares problems. Classification separates vector spaces by hyperplanes. Kernels replace explicit feature vectors by inner products. Neural networks compose matrix transformations with nonlinearities. PCA, embeddings, attention, recommender systems, and similarity search all rely on matrix operations.

The central principle is representation plus optimization. Data are placed in vector spaces, models transform those vectors, losses measure error, and algorithms adjust parameters by linear algebraic computation.