Machine learning studies algorithms that improve their behavior from data.
A machine learning system receives examples, extracts structure, and produces predictions, classifications, rankings, embeddings, or decisions. Linear algebra is central because data are represented as vectors and matrices, models are often linear or locally linear, and training is usually an optimization problem over parameters.
Many standard learning methods can be written in the form
This is why linear algebra appears throughout machine learning: in regression, classification, kernels, embeddings, neural networks, dimensionality reduction, and numerical optimization. Stanford CS229 notes, for example, develop regularization, kernels, SVMs, and optimization using vector and matrix notation.
120.1 Data as Vectors
A single data point is often represented as a vector
Each component is a feature.
For example, a house may be represented by features such as size, number of rooms, distance to city center, and age. An image may be represented by pixel values. A document may be represented by word counts or embedding coordinates.
| Object | Vector representation |
|---|---|
| Image | Pixel vector or feature vector |
| Text | Token vector, count vector, or embedding |
| Audio | Sample vector or spectral feature vector |
| User | Preference vector |
| Product | Attribute or embedding vector |
| Sensor reading | Measurement vector |
Once data are vectors, distances, inner products, projections, norms, and linear transformations become available.
120.2 Data Matrices
A dataset with examples and features is stored as a matrix
$$ X = \begin{bmatrix}
- & x_1^T & - \
- & x_2^T & - \ & \vdots & \
- & x_m^T & - \end{bmatrix} \in \mathbb{R}^{m\times d}. $$
Each row is one example. Each column is one feature.
The target values are often stored as a vector
The pair is the basic supervised learning dataset.
The design is the same as in linear regression. Machine learning extends this framework to broader models, losses, constraints, and data types.
120.3 Features
A feature is a numerical quantity used by a model.
A feature map converts raw input into a vector:
For an input object , the model uses
Feature maps may be simple or complex.
| Raw object | Possible features |
|---|---|
| Text | Word counts, TF-IDF, embeddings |
| Image | Pixels, patches, learned features |
| Graph | Degrees, neighborhoods, spectral features |
| User behavior | Counts, recency, preferences |
| Time series | Lags, moving averages, Fourier coefficients |
Feature design changes the geometry of the learning problem. A model that is linear in one feature space may represent nonlinear behavior in the original input space.
120.4 Linear Prediction
The simplest prediction model is linear:
Here is the weight vector and is the bias.
The value measures how feature contributes to the prediction.
For the full dataset,
If the bias is included as an extra feature equal to , then the model becomes
This is the same matrix form used in least squares.
120.5 Loss Functions
A loss function measures prediction error.
For regression, a common loss is squared error:
For a dataset, the empirical risk is
In matrix form,
Thus linear regression is a least squares problem.
Other learning problems use different losses.
| Problem | Common loss |
|---|---|
| Regression | Squared error |
| Binary classification | Logistic loss |
| Margin classification | Hinge loss |
| Multiclass classification | Cross-entropy |
| Ranking | Pairwise ranking loss |
| Embedding learning | Contrastive loss |
The choice of loss changes the optimization problem.
120.6 Training
Training means choosing model parameters to minimize loss.
For a linear model, training solves
For squared loss,
This is ordinary least squares.
If is invertible, the solution is
In larger or more complex models, training is usually done by iterative optimization, such as gradient descent.
120.7 Gradient Descent
Gradient descent updates parameters by moving opposite the gradient:
Here is the learning rate.
For squared loss,
the gradient is
Thus each update uses matrix-vector products with and .
This is one reason large-scale machine learning depends heavily on efficient linear algebra kernels.
120.8 Regularization
Regularization adds a penalty to the training objective.
A common example is ridge regularization:
Here controls the penalty strength.
The solution satisfies
Regularization discourages overly large parameter vectors and improves stability when features are correlated or data are limited. CS229 notes describe regularization as encouraging small parameter norm and, in gradient descent form, as weight decay.
120.9 Classification
In binary classification, the target is usually
or
A linear classifier computes a score
The predicted class may be
Geometrically, the equation
defines a hyperplane.
This hyperplane separates feature space into two half-spaces. The vector is normal to the separating hyperplane.
Classification is therefore a geometric problem in vector space.
120.10 Logistic Regression
Logistic regression converts a linear score into a probability.
The score is
The probability of class is
The sigmoid function maps real numbers to values between and .
Training usually minimizes cross-entropy loss:
Although the model is linear in its score, the probability is nonlinear in the parameters.
The training objective is convex for ordinary logistic regression.
120.11 Support Vector Machines
A support vector machine seeks a separating hyperplane with a large margin.
For binary labels
a linear classifier uses
The margin condition is
The hard-margin SVM solves
subject to the margin constraints.
The support vectors are training points that lie on or inside the margin boundary. SVMs are commonly developed through optimization, kernels, and inner products, and the kernel trick replaces explicit high-dimensional feature maps by kernel evaluations.
120.12 Kernels
A kernel is a function
that behaves like an inner product in some feature space:
The kernel trick allows algorithms to use inner products in a high-dimensional feature space without explicitly constructing the feature vectors.
Common kernels include:
| Kernel | Formula |
|---|---|
| Linear | |
| Polynomial | |
| Gaussian RBF |
Kernel methods show that linear algebra can operate implicitly through Gram matrices.
The Gram matrix has entries
120.13 Embeddings
An embedding maps an object into a vector space.
For example, a word, document, image, user, or item may be mapped to
The goal is that geometry in the embedding space reflects semantic or task-relevant structure.
Similar objects should have nearby vectors. Dissimilar objects should have distant or differently oriented vectors.
Common similarity measures include:
| Measure | Formula |
|---|---|
| Dot product | |
| Euclidean distance | |
| Cosine similarity |
Embeddings turn complex objects into points in a learned vector space.
120.14 Neural Networks
A neural network is a composition of linear maps and nonlinear activation functions.
A single layer has the form
Here is a weight matrix, is a bias vector, and is applied componentwise.
A multilayer network has the form
The matrices contain the learned parameters.
Without nonlinear activation functions, the composition would collapse to one linear map. Nonlinearity is what allows neural networks to represent complex functions.
Still, each layer is built from matrix multiplication.
120.15 Backpropagation
Backpropagation computes gradients of a loss with respect to all parameters in a neural network.
It applies the chain rule through the computational graph.
For a layer
the gradient with respect to has an outer-product structure:
where is the backpropagated error vector for the layer.
Thus training neural networks repeatedly uses matrix products, transposes, Jacobian-vector products, and outer products.
MIT notes on matrix calculus for machine learning emphasize this matrix-based calculus viewpoint for differentiating vector and matrix expressions used in learning systems.
120.16 Dimensionality Reduction
High-dimensional data are often compressed into lower-dimensional representations.
A linear dimensionality reduction map has the form
where
The vector is a lower-dimensional representation of .
Good dimensionality reduction preserves important structure: variance, distance, neighborhood relations, class separation, or task information.
Linear algebra supplies the central methods, especially eigenvalue decompositions and singular value decompositions.
120.17 Principal Component Analysis
Principal component analysis, or PCA, finds orthogonal directions of maximum variance.
Given centered data matrix , the sample covariance matrix is proportional to
The principal components are eigenvectors of .
Equivalently, PCA can be computed by the singular value decomposition
The columns of give principal directions.
Keeping the first directions gives a rank- approximation of the data.
PCA is therefore a direct application of eigenvalues, orthogonality, projection, and SVD.
120.18 Matrix Factorization
Many machine learning problems use matrix factorization.
Suppose stores user-item ratings.
A low-rank model approximates
where
The row represents user . The row represents item .
The predicted rating is
This is the basis of many recommender systems.
The model assumes that observed preferences are governed by a smaller number of latent factors.
120.19 Attention
Attention mechanisms compare query, key, and value vectors.
For matrices
scaled dot-product attention computes
The matrix
contains pairwise similarities between queries and keys.
The softmax converts similarities into weights.
Multiplication by forms weighted sums of value vectors.
Attention is therefore a structured sequence of matrix multiplication, normalization, and weighted averaging.
120.20 Covariance
Covariance measures how features vary together.
For centered data matrix , the covariance matrix is
The entry measures the relationship between feature and feature .
The diagonal entries are variances. The off-diagonal entries are covariances.
Covariance matrices are symmetric positive semidefinite.
They appear in PCA, Gaussian models, whitening, metric learning, Kalman filters, and uncertainty estimation.
120.21 Whitening
Whitening transforms data so that its covariance becomes the identity matrix.
If
is an eigendecomposition of the covariance matrix, then a whitening transform is
For centered data , define
Then, ideally,
Whitening removes scale and correlation from the data.
It is used as preprocessing, in statistical models, and in representation learning.
120.22 Similarity Search
Many machine learning systems retrieve nearest vectors.
Given a query vector , find database vectors that maximize
or minimize
This is nearest neighbor search in a vector space.
Applications include semantic search, image retrieval, recommendation, clustering, duplicate detection, and memory retrieval.
At small scale, exact search can compute all distances. At large scale, approximate nearest neighbor methods use indexing structures, quantization, hashing, or graph search.
The underlying computations remain vector operations.
120.23 Generalization
Training error measures performance on the training data. Generalization concerns performance on new data.
Linear algebra enters generalization through model capacity, norm constraints, rank, margins, and regularization.
For example, a classifier with a larger margin is often more robust to small perturbations. A low-rank representation may discard noise. A small norm solution may avoid overly sensitive predictions.
These are geometric ideas.
Machine learning theory studies how these geometric constraints affect prediction on unseen examples.
120.24 Overfitting
Overfitting occurs when a model fits noise or accidental structure in the training set.
A model with too much flexibility may achieve low training loss but poor test performance.
Linear algebraic symptoms include:
| Symptom | Meaning |
|---|---|
| Ill-conditioned design matrix | Unstable parameters |
| Very large weights | Sensitive predictions |
| High rank with noisy directions | Noise fitted as signal |
| Small singular values used heavily | Poor numerical stability |
| Near-duplicate features | Multicollinearity |
Regularization, dimensionality reduction, early stopping, and more data can reduce overfitting.
120.25 Machine Learning and Linear Algebra
The main dictionary is direct.
| Machine learning object | Linear algebra object |
|---|---|
| Example | Vector |
| Dataset | Matrix |
| Feature map | Vector-valued function |
| Linear model | Dot product |
| Neural network layer | Matrix plus nonlinearity |
| Training | Optimization over vectors and matrices |
| Loss gradient | Vector or matrix derivative |
| Regularization | Norm penalty |
| Kernel method | Gram matrix |
| Embedding | Learned coordinate vector |
| PCA | Eigenvalue problem |
| Matrix factorization | Low-rank approximation |
| Attention | Matrix products and weighted sums |
| Similarity search | Inner products and distances |
This table explains why linear algebra is not only a prerequisite for machine learning. It is part of the structure of the field.
120.26 Summary
Machine learning represents data as vectors and collections of data as matrices.
Linear models use dot products. Regression solves least squares problems. Classification separates vector spaces by hyperplanes. Kernels replace explicit feature vectors by inner products. Neural networks compose matrix transformations with nonlinearities. PCA, embeddings, attention, recommender systems, and similarity search all rely on matrix operations.
The central principle is representation plus optimization. Data are placed in vector spaces, models transform those vectors, losses measure error, and algorithms adjust parameters by linear algebraic computation.