140.1 Introduction
Modern artificial intelligence is built on linear algebra.
Data is represented as vectors. Batches of data are matrices or tensors. Neural network layers are affine transformations followed by nonlinear functions. Training uses gradients, Jacobians, Hessians, and large-scale matrix operations. In transformer models, attention is expressed through matrix products involving query, key, and value matrices.
The central pattern is:
This chapter describes how the ideas of linear algebra appear in current AI systems.
140.2 Data as Vectors
AI systems begin by converting objects into vectors.
A word, image, audio segment, document, user profile, protein sequence, or graph node may be represented by a vector
The dimension depends on the model.
Examples:
| Object | Vector representation |
|---|---|
| Word | Embedding vector |
| Image patch | Pixel or feature vector |
| Document | Dense semantic vector |
| User | Preference vector |
| Graph node | Node embedding |
| Audio frame | Feature vector |
Once data is represented as vectors, linear algebra can be used to compare, transform, combine, and optimize it.
140.3 Embeddings
An embedding maps a discrete object into a vector space.
For example, a vocabulary item may be mapped to
Words with related meanings often have embeddings that are close under cosine similarity or inner product.
The embedding matrix has the form
where is the vocabulary size and is the embedding dimension.
A token index selects one row of . Thus embedding lookup is a structured linear-algebra operation.
Embeddings are used in language models, recommender systems, image-text models, graph neural networks, and retrieval systems.
140.4 Similarity and Inner Products
Many AI systems compare vectors using inner products.
Given two vectors
their dot product is
Cosine similarity normalizes by vector lengths:
Large cosine similarity means that the two vectors point in similar directions.
This is used in:
| Task | Use |
|---|---|
| Search | Find nearby document vectors |
| Recommendation | Compare user and item vectors |
| Classification | Compare feature and class vectors |
| Clustering | Group similar embeddings |
| Retrieval-augmented generation | Retrieve relevant context |
Vector similarity is one of the most common uses of linear algebra in AI.
140.5 Neural Network Layers
A basic neural network layer has the form
where:
| Symbol | Meaning |
|---|---|
| Input vector | |
| Weight matrix | |
| Bias vector | |
| Nonlinear activation | |
| Output vector |
The affine part
is linear algebra. The activation introduces nonlinearity.
A deep neural network composes many such layers:
Thus deep learning alternates linear transformations with nonlinear coordinatewise operations.
140.6 Batches and Matrix Multiplication
Training uses batches of examples.
If a batch contains input vectors of dimension , they are stored as a matrix
A linear layer applied to the whole batch is
where is a weight matrix and broadcasts the bias.
This turns many vector operations into one matrix multiplication.
Matrix multiplication is the computational core of neural network training and inference. Modern hardware accelerators are designed around fast dense matrix and tensor operations.
140.7 Loss Functions and Gradients
Training adjusts weights to minimize a loss function.
Let
denote all model parameters. Training solves approximately:
The gradient
points in the direction of greatest local increase. Optimization algorithms move in the opposite direction.
A typical update is:
This is gradient descent in parameter space.
In modern AI, may contain millions or billions of parameters, but the principle remains ordinary vector calculus and linear algebra.
140.8 Backpropagation
Backpropagation computes gradients through a composed function.
If a model is a composition
then the chain rule says that derivatives multiply in reverse order.
For Jacobians,
Backpropagation applies this rule efficiently without explicitly forming every large Jacobian.
Instead, it propagates vector-Jacobian products backward through the computation graph.
This is why matrix calculus is essential for deep learning.
140.9 Attention
Attention is one of the most important linear-algebraic mechanisms in modern AI.
Given input vectors collected in a matrix
a transformer forms query, key, and value matrices:
The attention score matrix is
Scaled dot-product attention is
\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
The query, key, and value matrices are produced by learned linear projections, and the attention scores are computed by matrix multiplication.
140.10 Multi-Head Attention
Multi-head attention uses several attention operations in parallel.
Each head has its own projection matrices:
Each head computes attention in a different learned subspace.
The outputs are concatenated and multiplied by another matrix:
This allows the model to represent multiple kinds of relationships at once.
Some heads may track local syntax. Others may track long-range dependencies, positional structure, or semantic relations.
140.11 Low-Rank Structure in Attention
The attention matrix has size roughly
where is the sequence length.
This can be expensive for long sequences.
Many efficient transformer methods use linear algebraic structure to reduce cost. One common idea is low-rank approximation. Linformer, for example, proposed approximating self-attention by a low-rank matrix to reduce the sequence-length cost from quadratic to linear under its approximation assumptions.
The general principle is:
This connects transformer efficiency with matrix approximation.
140.12 State Space Models
Some modern sequence models use state space equations instead of full attention.
A linear state space model has the form
Here:
| Symbol | Meaning |
|---|---|
| Hidden state | |
| Input | |
| Output | |
| Learned matrices |
State space models use recurrence, convolution, and structured matrices to process long sequences.
Mamba is one modern architecture based on selective state spaces and designed for efficient sequence modeling.
140.13 Singular Value Decomposition in AI
The singular value decomposition writes a matrix as
SVD appears in AI through:
| Use | Role |
|---|---|
| Dimensionality reduction | Keep leading singular vectors |
| Compression | Approximate weight matrices |
| Denoising | Remove small singular components |
| Latent semantic analysis | Factor term-document matrices |
| Model analysis | Study learned representations |
Low-rank approximation is especially important when models are large.
If a weight matrix has effective low rank, it may be approximated by
where and are smaller matrices.
This reduces storage and computation.
140.14 Principal Component Analysis
Principal component analysis, or PCA, finds directions of maximal variance.
Given centered data matrix
the covariance matrix is
The principal components are eigenvectors of this covariance matrix.
PCA is used for:
| Task | Purpose |
|---|---|
| Visualization | Reduce to 2 or 3 dimensions |
| Preprocessing | Remove redundant dimensions |
| Denoising | Keep dominant components |
| Representation analysis | Inspect embedding geometry |
PCA is one of the classical bridges between linear algebra and data analysis.
140.15 Matrix Factorization for Recommendation
Recommender systems often use matrix factorization.
Let
be a user-item rating matrix.
The goal is to approximate
where:
| Matrix | Meaning |
|---|---|
| User factors | |
| Item factors |
The predicted rating for user and item is
This model says that users and items live in the same latent vector space.
Recommendation becomes inner-product prediction.
140.16 Graph Neural Networks
Graphs are common in AI: social networks, molecules, knowledge graphs, citation networks, and recommendation systems.
A graph neural network updates node features using neighboring nodes.
A simple linear message-passing layer has the form
where:
| Symbol | Meaning |
|---|---|
| Normalized adjacency matrix | |
| Node feature matrix | |
| Weight matrix | |
| Activation |
The adjacency matrix determines how information flows across the graph.
This is spectral graph theory and neural networks combined.
140.17 Generative Models
Generative AI models produce new samples.
Linear algebra appears in several forms:
| Model type | Linear algebra role |
|---|---|
| Language models | Token embeddings and attention |
| Diffusion models | Noise vectors and denoising networks |
| Image generators | Latent spaces and convolutional layers |
| Autoencoders | Encoder and decoder maps |
| GANs | Generator and discriminator matrices |
Latent vector spaces are central. A model often maps a vector
to a generated object.
Interpolating between latent vectors can produce smooth changes in generated outputs.
140.18 Retrieval-Augmented Generation
Retrieval-augmented generation combines search with generation.
Documents are embedded as vectors:
A query is embedded as
Retrieval selects documents with large similarity scores:
The selected documents are then passed to a language model as context.
Thus RAG systems depend heavily on:
| Component | Linear algebra operation |
|---|---|
| Embedding model | Vector representation |
| Vector database | Nearest-neighbor search |
| Similarity scoring | Dot products or cosine similarity |
| Reranking | Matrix and vector scoring |
| Generation | Transformer inference |
The retrieval step is essentially large-scale vector search.
140.19 Model Compression
Large AI models are expensive to store and run.
Linear algebra supports compression through:
| Method | Linear algebra idea |
|---|---|
| Low-rank factorization | Replace by |
| Pruning | Remove small or unimportant weights |
| Quantization | Store lower-precision values |
| Sparse matrices | Exploit zeros |
| Distillation | Approximate one function by another |
Low-rank methods explicitly use matrix factorization. Quantization changes the scalar representation. Sparse methods change the matrix storage pattern.
These techniques reduce memory bandwidth and computational cost.
140.20 Hardware and Tensor Algebra
AI hardware is optimized for tensor operations.
A tensor program is usually a sequence of operations such as:
and reductions such as sums, norms, and softmax.
Performance depends on:
| Factor | Linear algebra issue |
|---|---|
| Matrix shape | Arithmetic intensity |
| Memory layout | Data movement |
| Precision | Numerical error |
| Blocking | Cache and accelerator use |
| Sparsity | Irregular computation |
Even when the model is described statistically, the execution is numerical linear algebra.
140.21 AI for Linear Algebra
AI is also being used to discover or accelerate linear algebra algorithms.
One example is AlphaTensor, which used reinforcement learning to search for matrix multiplication algorithms with fewer scalar multiplications. Matrix multiplication is a core operation in linear algebra and machine learning, so algorithmic improvements can matter at large scale.
This reverses the usual relationship.
Linear algebra supports AI, and AI can search for better linear algebra procedures.
140.22 Numerical Stability
Modern AI uses finite precision arithmetic.
Common formats include 32-bit floating point, 16-bit floating point, bfloat16, and lower-precision quantized formats.
Numerical issues include:
| Issue | Effect |
|---|---|
| Overflow | Values exceed representable range |
| Underflow | Values become too small |
| Roundoff | Accumulated arithmetic error |
| Ill-conditioning | Small perturbations become large |
| Instability in softmax | Large exponentials |
Stable implementations often subtract the maximum before applying softmax. This keeps exponentials in a safe numerical range and is standard in attention implementations.
140.23 Summary
Modern AI is applied linear algebra at large scale.
The central ideas are:
| Concept | AI role |
|---|---|
| Vectors | Represent data and parameters |
| Matrices | Represent learned transformations |
| Tensors | Store batches, activations, and weights |
| Inner products | Similarity and attention scores |
| Matrix multiplication | Core computation |
| Gradients | Training signal |
| Jacobians | Chain rule and backpropagation |
| SVD | Compression and dimensionality reduction |
| Eigenvectors | PCA, graph learning, spectral methods |
| Low-rank approximation | Efficient models |
| Sparse matrices | Efficient storage and computation |
| State space matrices | Long-sequence modeling |
| Vector search | Retrieval and recommendation |
AI systems may appear complex at the application level, but their computational core is a small set of linear-algebraic operations repeated at very large scale.