Chapter 140. Modern Applications in AI

140.1 Introduction

Modern artificial intelligence is built on linear algebra.

Data is represented as vectors. Batches of data are matrices or tensors. Neural network layers are affine transformations followed by nonlinear functions. Training uses gradients, Jacobians, Hessians, and large-scale matrix operations. In transformer models, attention is expressed through matrix products involving query, key, and value matrices.

The central pattern is:

\text{data} \longrightarrow \text{vectors} \longrightarrow \text{linear maps} \longrightarrow \text{optimization}.

This chapter describes how the ideas of linear algebra appear in current AI systems.

140.2 Data as Vectors

AI systems begin by converting objects into vectors.

A word, image, audio segment, document, user profile, protein sequence, or graph node may be represented by a vector

x\in\mathbb{R}^d.

The dimension $d$ depends on the model.

Examples:

Object	Vector representation
Word	Embedding vector
Image patch	Pixel or feature vector
Document	Dense semantic vector
User	Preference vector
Graph node	Node embedding
Audio frame	Feature vector

Once data is represented as vectors, linear algebra can be used to compare, transform, combine, and optimize it.

140.3 Embeddings

An embedding maps a discrete object into a vector space.

For example, a vocabulary item $w$ may be mapped to

e_w\in\mathbb{R}^d.

Words with related meanings often have embeddings that are close under cosine similarity or inner product.

The embedding matrix has the form

E\in\mathbb{R}^{V\times d},

where $V$ is the vocabulary size and $d$ is the embedding dimension.

A token index selects one row of $E$ . Thus embedding lookup is a structured linear-algebra operation.

Embeddings are used in language models, recommender systems, image-text models, graph neural networks, and retrieval systems.

140.4 Similarity and Inner Products

Many AI systems compare vectors using inner products.

Given two vectors

x,y\in\mathbb{R}^d,

their dot product is

x^Ty.

Cosine similarity normalizes by vector lengths:

\cos(x,y) = \frac{x^Ty}{\|x\|\|y\|}.

Large cosine similarity means that the two vectors point in similar directions.

This is used in:

Task	Use
Search	Find nearby document vectors
Recommendation	Compare user and item vectors
Classification	Compare feature and class vectors
Clustering	Group similar embeddings
Retrieval-augmented generation	Retrieve relevant context

Vector similarity is one of the most common uses of linear algebra in AI.

140.5 Neural Network Layers

A basic neural network layer has the form

y=\sigma(Wx+b),

where:

Symbol	Meaning
$x$	Input vector
$W$	Weight matrix
$b$	Bias vector
$\sigma$	Nonlinear activation
$y$	Output vector

The affine part

Wx+b

is linear algebra. The activation introduces nonlinearity.

A deep neural network composes many such layers:

x \mapsto \sigma(W_1x+b_1) \mapsto \sigma(W_2h_1+b_2) \mapsto \cdots.

Thus deep learning alternates linear transformations with nonlinear coordinatewise operations.

140.6 Batches and Matrix Multiplication

Training uses batches of examples.

If a batch contains $B$ input vectors of dimension $d$ , they are stored as a matrix

X\in\mathbb{R}^{B\times d}.

A linear layer applied to the whole batch is

Y=XW+B_0,

where $W$ is a weight matrix and $B_0$ broadcasts the bias.

This turns many vector operations into one matrix multiplication.

Matrix multiplication is the computational core of neural network training and inference. Modern hardware accelerators are designed around fast dense matrix and tensor operations.

140.7 Loss Functions and Gradients

Training adjusts weights to minimize a loss function.

Let

\theta

denote all model parameters. Training solves approximately:

\min_\theta L(\theta).

The gradient

\nabla_\theta L

points in the direction of greatest local increase. Optimization algorithms move in the opposite direction.

A typical update is:

\theta_{k+1} = \theta_k-\alpha_k\nabla_\theta L(\theta_k).

This is gradient descent in parameter space.

In modern AI, $\theta$ may contain millions or billions of parameters, but the principle remains ordinary vector calculus and linear algebra.

140.8 Backpropagation

Backpropagation computes gradients through a composed function.

If a model is a composition

f=f_n\circ f_{n-1}\circ\cdots\circ f_1,

then the chain rule says that derivatives multiply in reverse order.

For Jacobians,

J_f = J_{f_n}J_{f_{n-1}}\cdots J_{f_1}.

Backpropagation applies this rule efficiently without explicitly forming every large Jacobian.

Instead, it propagates vector-Jacobian products backward through the computation graph.

This is why matrix calculus is essential for deep learning.

140.9 Attention

Attention is one of the most important linear-algebraic mechanisms in modern AI.

Given input vectors collected in a matrix

X,

a transformer forms query, key, and value matrices:

Q=XW^Q,\qquad K=XW^K,\qquad V=XW^V.

The attention score matrix is

QK^T.

Scaled dot-product attention is

\operatorname{Attention}(Q,K,V) = \operatorname{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V.

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The query, key, and value matrices are produced by learned linear projections, and the attention scores are computed by matrix multiplication.

140.10 Multi-Head Attention

Multi-head attention uses several attention operations in parallel.

Each head has its own projection matrices:

W_i^Q,\qquad W_i^K,\qquad W_i^V.

Each head computes attention in a different learned subspace.

The outputs are concatenated and multiplied by another matrix:

\operatorname{MultiHead}(X) = \operatorname{Concat}(H_1,\ldots,H_h)W^O.

This allows the model to represent multiple kinds of relationships at once.

Some heads may track local syntax. Others may track long-range dependencies, positional structure, or semantic relations.

140.11 Low-Rank Structure in Attention

The attention matrix has size roughly

n\times n,

where $n$ is the sequence length.

This can be expensive for long sequences.

Many efficient transformer methods use linear algebraic structure to reduce cost. One common idea is low-rank approximation. Linformer, for example, proposed approximating self-attention by a low-rank matrix to reduce the sequence-length cost from quadratic to linear under its approximation assumptions.

The general principle is:

\text{large dense matrix} \approx \text{smaller structured factors}.

This connects transformer efficiency with matrix approximation.

140.12 State Space Models

Some modern sequence models use state space equations instead of full attention.

A linear state space model has the form

h_{t+1}=Ah_t+Bx_t,

y_t=Ch_t+Dx_t.

Here:

Symbol	Meaning
$h_t$	Hidden state
$x_t$	Input
$y_t$	Output
$A,B,C,D$	Learned matrices

State space models use recurrence, convolution, and structured matrices to process long sequences.

Mamba is one modern architecture based on selective state spaces and designed for efficient sequence modeling.

140.13 Singular Value Decomposition in AI

The singular value decomposition writes a matrix as

A=U\Sigma V^T.

SVD appears in AI through:

Use	Role
Dimensionality reduction	Keep leading singular vectors
Compression	Approximate weight matrices
Denoising	Remove small singular components
Latent semantic analysis	Factor term-document matrices
Model analysis	Study learned representations

Low-rank approximation is especially important when models are large.

If a weight matrix has effective low rank, it may be approximated by

W\approx UV^T,

where $U$ and $V$ are smaller matrices.

This reduces storage and computation.

140.14 Principal Component Analysis

Principal component analysis, or PCA, finds directions of maximal variance.

Given centered data matrix

X,

the covariance matrix is

\frac{1}{n}X^TX.

The principal components are eigenvectors of this covariance matrix.

PCA is used for:

Task	Purpose
Visualization	Reduce to 2 or 3 dimensions
Preprocessing	Remove redundant dimensions
Denoising	Keep dominant components
Representation analysis	Inspect embedding geometry

PCA is one of the classical bridges between linear algebra and data analysis.

140.15 Matrix Factorization for Recommendation

Recommender systems often use matrix factorization.

Let

R\in\mathbb{R}^{m\times n}

be a user-item rating matrix.

The goal is to approximate

R\approx UV^T,

where:

Matrix	Meaning
$U$	User factors
$V$	Item factors

The predicted rating for user $i$ and item $j$ is

u_i^Tv_j.

This model says that users and items live in the same latent vector space.

Recommendation becomes inner-product prediction.

140.16 Graph Neural Networks

Graphs are common in AI: social networks, molecules, knowledge graphs, citation networks, and recommendation systems.

A graph neural network updates node features using neighboring nodes.

A simple linear message-passing layer has the form

H_{k+1} = \sigma(\widetilde{A}H_kW_k),

where:

Symbol	Meaning
$\widetilde{A}$	Normalized adjacency matrix
$H_k$	Node feature matrix
$W_k$	Weight matrix
$\sigma$	Activation

The adjacency matrix determines how information flows across the graph.

This is spectral graph theory and neural networks combined.

140.17 Generative Models

Generative AI models produce new samples.

Linear algebra appears in several forms:

Model type	Linear algebra role
Language models	Token embeddings and attention
Diffusion models	Noise vectors and denoising networks
Image generators	Latent spaces and convolutional layers
Autoencoders	Encoder and decoder maps
GANs	Generator and discriminator matrices

Latent vector spaces are central. A model often maps a vector

z\in\mathbb{R}^d

to a generated object.

Interpolating between latent vectors can produce smooth changes in generated outputs.

140.18 Retrieval-Augmented Generation

Retrieval-augmented generation combines search with generation.

Documents are embedded as vectors:

d_1,\ldots,d_N.

A query is embedded as

q.

Retrieval selects documents with large similarity scores:

q^Td_i.

The selected documents are then passed to a language model as context.

Thus RAG systems depend heavily on:

Component	Linear algebra operation
Embedding model	Vector representation
Vector database	Nearest-neighbor search
Similarity scoring	Dot products or cosine similarity
Reranking	Matrix and vector scoring
Generation	Transformer inference

The retrieval step is essentially large-scale vector search.

140.19 Model Compression

Large AI models are expensive to store and run.

Linear algebra supports compression through:

Method	Linear algebra idea
Low-rank factorization	Replace $W$ by $UV^T$
Pruning	Remove small or unimportant weights
Quantization	Store lower-precision values
Sparse matrices	Exploit zeros
Distillation	Approximate one function by another

Low-rank methods explicitly use matrix factorization. Quantization changes the scalar representation. Sparse methods change the matrix storage pattern.

These techniques reduce memory bandwidth and computational cost.

140.20 Hardware and Tensor Algebra

AI hardware is optimized for tensor operations.

A tensor program is usually a sequence of operations such as:

C = AB,

Y = XW+b,

QK^T,

and reductions such as sums, norms, and softmax.

Performance depends on:

Factor	Linear algebra issue
Matrix shape	Arithmetic intensity
Memory layout	Data movement
Precision	Numerical error
Blocking	Cache and accelerator use
Sparsity	Irregular computation

Even when the model is described statistically, the execution is numerical linear algebra.

140.21 AI for Linear Algebra

AI is also being used to discover or accelerate linear algebra algorithms.

One example is AlphaTensor, which used reinforcement learning to search for matrix multiplication algorithms with fewer scalar multiplications. Matrix multiplication is a core operation in linear algebra and machine learning, so algorithmic improvements can matter at large scale.

This reverses the usual relationship.

Linear algebra supports AI, and AI can search for better linear algebra procedures.

140.22 Numerical Stability

Modern AI uses finite precision arithmetic.

Common formats include 32-bit floating point, 16-bit floating point, bfloat16, and lower-precision quantized formats.

Numerical issues include:

Issue	Effect
Overflow	Values exceed representable range
Underflow	Values become too small
Roundoff	Accumulated arithmetic error
Ill-conditioning	Small perturbations become large
Instability in softmax	Large exponentials

Stable implementations often subtract the maximum before applying softmax. This keeps exponentials in a safe numerical range and is standard in attention implementations.

140.23 Summary

Modern AI is applied linear algebra at large scale.

The central ideas are:

Concept	AI role
Vectors	Represent data and parameters
Matrices	Represent learned transformations
Tensors	Store batches, activations, and weights
Inner products	Similarity and attention scores
Matrix multiplication	Core computation
Gradients	Training signal
Jacobians	Chain rule and backpropagation
SVD	Compression and dimensionality reduction
Eigenvectors	PCA, graph learning, spectral methods
Low-rank approximation	Efficient models
Sparse matrices	Efficient storage and computation
State space matrices	Long-sequence modeling
Vector search	Retrieval and recommendation

AI systems may appear complex at the application level, but their computational core is a small set of linear-algebraic operations repeated at very large scale.