Skip to content

Chapter 140. Modern Applications in AI

140.1 Introduction

Modern artificial intelligence is built on linear algebra.

Data is represented as vectors. Batches of data are matrices or tensors. Neural network layers are affine transformations followed by nonlinear functions. Training uses gradients, Jacobians, Hessians, and large-scale matrix operations. In transformer models, attention is expressed through matrix products involving query, key, and value matrices.

The central pattern is:

datavectorslinear mapsoptimization. \text{data} \longrightarrow \text{vectors} \longrightarrow \text{linear maps} \longrightarrow \text{optimization}.

This chapter describes how the ideas of linear algebra appear in current AI systems.

140.2 Data as Vectors

AI systems begin by converting objects into vectors.

A word, image, audio segment, document, user profile, protein sequence, or graph node may be represented by a vector

xRd. x\in\mathbb{R}^d.

The dimension dd depends on the model.

Examples:

ObjectVector representation
WordEmbedding vector
Image patchPixel or feature vector
DocumentDense semantic vector
UserPreference vector
Graph nodeNode embedding
Audio frameFeature vector

Once data is represented as vectors, linear algebra can be used to compare, transform, combine, and optimize it.

140.3 Embeddings

An embedding maps a discrete object into a vector space.

For example, a vocabulary item ww may be mapped to

ewRd. e_w\in\mathbb{R}^d.

Words with related meanings often have embeddings that are close under cosine similarity or inner product.

The embedding matrix has the form

ERV×d, E\in\mathbb{R}^{V\times d},

where VV is the vocabulary size and dd is the embedding dimension.

A token index selects one row of EE. Thus embedding lookup is a structured linear-algebra operation.

Embeddings are used in language models, recommender systems, image-text models, graph neural networks, and retrieval systems.

140.4 Similarity and Inner Products

Many AI systems compare vectors using inner products.

Given two vectors

x,yRd, x,y\in\mathbb{R}^d,

their dot product is

xTy. x^Ty.

Cosine similarity normalizes by vector lengths:

cos(x,y)=xTyxy. \cos(x,y) = \frac{x^Ty}{\|x\|\|y\|}.

Large cosine similarity means that the two vectors point in similar directions.

This is used in:

TaskUse
SearchFind nearby document vectors
RecommendationCompare user and item vectors
ClassificationCompare feature and class vectors
ClusteringGroup similar embeddings
Retrieval-augmented generationRetrieve relevant context

Vector similarity is one of the most common uses of linear algebra in AI.

140.5 Neural Network Layers

A basic neural network layer has the form

y=σ(Wx+b), y=\sigma(Wx+b),

where:

SymbolMeaning
xxInput vector
WWWeight matrix
bbBias vector
σ\sigmaNonlinear activation
yyOutput vector

The affine part

Wx+b Wx+b

is linear algebra. The activation introduces nonlinearity.

A deep neural network composes many such layers:

xσ(W1x+b1)σ(W2h1+b2). x \mapsto \sigma(W_1x+b_1) \mapsto \sigma(W_2h_1+b_2) \mapsto \cdots.

Thus deep learning alternates linear transformations with nonlinear coordinatewise operations.

140.6 Batches and Matrix Multiplication

Training uses batches of examples.

If a batch contains BB input vectors of dimension dd, they are stored as a matrix

XRB×d. X\in\mathbb{R}^{B\times d}.

A linear layer applied to the whole batch is

Y=XW+B0, Y=XW+B_0,

where WW is a weight matrix and B0B_0 broadcasts the bias.

This turns many vector operations into one matrix multiplication.

Matrix multiplication is the computational core of neural network training and inference. Modern hardware accelerators are designed around fast dense matrix and tensor operations.

140.7 Loss Functions and Gradients

Training adjusts weights to minimize a loss function.

Let

θ \theta

denote all model parameters. Training solves approximately:

minθL(θ). \min_\theta L(\theta).

The gradient

θL \nabla_\theta L

points in the direction of greatest local increase. Optimization algorithms move in the opposite direction.

A typical update is:

θk+1=θkαkθL(θk). \theta_{k+1} = \theta_k-\alpha_k\nabla_\theta L(\theta_k).

This is gradient descent in parameter space.

In modern AI, θ\theta may contain millions or billions of parameters, but the principle remains ordinary vector calculus and linear algebra.

140.8 Backpropagation

Backpropagation computes gradients through a composed function.

If a model is a composition

f=fnfn1f1, f=f_n\circ f_{n-1}\circ\cdots\circ f_1,

then the chain rule says that derivatives multiply in reverse order.

For Jacobians,

Jf=JfnJfn1Jf1. J_f = J_{f_n}J_{f_{n-1}}\cdots J_{f_1}.

Backpropagation applies this rule efficiently without explicitly forming every large Jacobian.

Instead, it propagates vector-Jacobian products backward through the computation graph.

This is why matrix calculus is essential for deep learning.

140.9 Attention

Attention is one of the most important linear-algebraic mechanisms in modern AI.

Given input vectors collected in a matrix

X, X,

a transformer forms query, key, and value matrices:

Q=XWQ,K=XWK,V=XWV. Q=XW^Q,\qquad K=XW^K,\qquad V=XW^V.

The attention score matrix is

QKT. QK^T.

Scaled dot-product attention is

Attention(Q,K,V)=softmax(QKTdk)V. \operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V.

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The query, key, and value matrices are produced by learned linear projections, and the attention scores are computed by matrix multiplication.

140.10 Multi-Head Attention

Multi-head attention uses several attention operations in parallel.

Each head has its own projection matrices:

WiQ,WiK,WiV. W_i^Q,\qquad W_i^K,\qquad W_i^V.

Each head computes attention in a different learned subspace.

The outputs are concatenated and multiplied by another matrix:

MultiHead(X)=Concat(H1,,Hh)WO. \operatorname{MultiHead}(X) = \operatorname{Concat}(H_1,\ldots,H_h)W^O.

This allows the model to represent multiple kinds of relationships at once.

Some heads may track local syntax. Others may track long-range dependencies, positional structure, or semantic relations.

140.11 Low-Rank Structure in Attention

The attention matrix has size roughly

n×n, n\times n,

where nn is the sequence length.

This can be expensive for long sequences.

Many efficient transformer methods use linear algebraic structure to reduce cost. One common idea is low-rank approximation. Linformer, for example, proposed approximating self-attention by a low-rank matrix to reduce the sequence-length cost from quadratic to linear under its approximation assumptions.

The general principle is:

large dense matrixsmaller structured factors. \text{large dense matrix} \approx \text{smaller structured factors}.

This connects transformer efficiency with matrix approximation.

140.12 State Space Models

Some modern sequence models use state space equations instead of full attention.

A linear state space model has the form

ht+1=Aht+Bxt, h_{t+1}=Ah_t+Bx_t, yt=Cht+Dxt. y_t=Ch_t+Dx_t.

Here:

SymbolMeaning
hth_tHidden state
xtx_tInput
yty_tOutput
A,B,C,DA,B,C,DLearned matrices

State space models use recurrence, convolution, and structured matrices to process long sequences.

Mamba is one modern architecture based on selective state spaces and designed for efficient sequence modeling.

140.13 Singular Value Decomposition in AI

The singular value decomposition writes a matrix as

A=UΣVT. A=U\Sigma V^T.

SVD appears in AI through:

UseRole
Dimensionality reductionKeep leading singular vectors
CompressionApproximate weight matrices
DenoisingRemove small singular components
Latent semantic analysisFactor term-document matrices
Model analysisStudy learned representations

Low-rank approximation is especially important when models are large.

If a weight matrix has effective low rank, it may be approximated by

WUVT, W\approx UV^T,

where UU and VV are smaller matrices.

This reduces storage and computation.

140.14 Principal Component Analysis

Principal component analysis, or PCA, finds directions of maximal variance.

Given centered data matrix

X, X,

the covariance matrix is

1nXTX. \frac{1}{n}X^TX.

The principal components are eigenvectors of this covariance matrix.

PCA is used for:

TaskPurpose
VisualizationReduce to 2 or 3 dimensions
PreprocessingRemove redundant dimensions
DenoisingKeep dominant components
Representation analysisInspect embedding geometry

PCA is one of the classical bridges between linear algebra and data analysis.

140.15 Matrix Factorization for Recommendation

Recommender systems often use matrix factorization.

Let

RRm×n R\in\mathbb{R}^{m\times n}

be a user-item rating matrix.

The goal is to approximate

RUVT, R\approx UV^T,

where:

MatrixMeaning
UUUser factors
VVItem factors

The predicted rating for user ii and item jj is

uiTvj. u_i^Tv_j.

This model says that users and items live in the same latent vector space.

Recommendation becomes inner-product prediction.

140.16 Graph Neural Networks

Graphs are common in AI: social networks, molecules, knowledge graphs, citation networks, and recommendation systems.

A graph neural network updates node features using neighboring nodes.

A simple linear message-passing layer has the form

Hk+1=σ(A~HkWk), H_{k+1} = \sigma(\widetilde{A}H_kW_k),

where:

SymbolMeaning
A~\widetilde{A}Normalized adjacency matrix
HkH_kNode feature matrix
WkW_kWeight matrix
σ\sigmaActivation

The adjacency matrix determines how information flows across the graph.

This is spectral graph theory and neural networks combined.

140.17 Generative Models

Generative AI models produce new samples.

Linear algebra appears in several forms:

Model typeLinear algebra role
Language modelsToken embeddings and attention
Diffusion modelsNoise vectors and denoising networks
Image generatorsLatent spaces and convolutional layers
AutoencodersEncoder and decoder maps
GANsGenerator and discriminator matrices

Latent vector spaces are central. A model often maps a vector

zRd z\in\mathbb{R}^d

to a generated object.

Interpolating between latent vectors can produce smooth changes in generated outputs.

140.18 Retrieval-Augmented Generation

Retrieval-augmented generation combines search with generation.

Documents are embedded as vectors:

d1,,dN. d_1,\ldots,d_N.

A query is embedded as

q. q.

Retrieval selects documents with large similarity scores:

qTdi. q^Td_i.

The selected documents are then passed to a language model as context.

Thus RAG systems depend heavily on:

ComponentLinear algebra operation
Embedding modelVector representation
Vector databaseNearest-neighbor search
Similarity scoringDot products or cosine similarity
RerankingMatrix and vector scoring
GenerationTransformer inference

The retrieval step is essentially large-scale vector search.

140.19 Model Compression

Large AI models are expensive to store and run.

Linear algebra supports compression through:

MethodLinear algebra idea
Low-rank factorizationReplace WW by UVTUV^T
PruningRemove small or unimportant weights
QuantizationStore lower-precision values
Sparse matricesExploit zeros
DistillationApproximate one function by another

Low-rank methods explicitly use matrix factorization. Quantization changes the scalar representation. Sparse methods change the matrix storage pattern.

These techniques reduce memory bandwidth and computational cost.

140.20 Hardware and Tensor Algebra

AI hardware is optimized for tensor operations.

A tensor program is usually a sequence of operations such as:

C=AB, C = AB, Y=XW+b, Y = XW+b, QKT, QK^T,

and reductions such as sums, norms, and softmax.

Performance depends on:

FactorLinear algebra issue
Matrix shapeArithmetic intensity
Memory layoutData movement
PrecisionNumerical error
BlockingCache and accelerator use
SparsityIrregular computation

Even when the model is described statistically, the execution is numerical linear algebra.

140.21 AI for Linear Algebra

AI is also being used to discover or accelerate linear algebra algorithms.

One example is AlphaTensor, which used reinforcement learning to search for matrix multiplication algorithms with fewer scalar multiplications. Matrix multiplication is a core operation in linear algebra and machine learning, so algorithmic improvements can matter at large scale.

This reverses the usual relationship.

Linear algebra supports AI, and AI can search for better linear algebra procedures.

140.22 Numerical Stability

Modern AI uses finite precision arithmetic.

Common formats include 32-bit floating point, 16-bit floating point, bfloat16, and lower-precision quantized formats.

Numerical issues include:

IssueEffect
OverflowValues exceed representable range
UnderflowValues become too small
RoundoffAccumulated arithmetic error
Ill-conditioningSmall perturbations become large
Instability in softmaxLarge exponentials

Stable implementations often subtract the maximum before applying softmax. This keeps exponentials in a safe numerical range and is standard in attention implementations.

140.23 Summary

Modern AI is applied linear algebra at large scale.

The central ideas are:

ConceptAI role
VectorsRepresent data and parameters
MatricesRepresent learned transformations
TensorsStore batches, activations, and weights
Inner productsSimilarity and attention scores
Matrix multiplicationCore computation
GradientsTraining signal
JacobiansChain rule and backpropagation
SVDCompression and dimensionality reduction
EigenvectorsPCA, graph learning, spectral methods
Low-rank approximationEfficient models
Sparse matricesEfficient storage and computation
State space matricesLong-sequence modeling
Vector searchRetrieval and recommendation

AI systems may appear complex at the application level, but their computational core is a small set of linear-algebraic operations repeated at very large scale.