# Chapter 140. Modern Applications in AI

# Chapter 140. Modern Applications in AI

## 140.1 Introduction

Modern artificial intelligence is built on linear algebra.

Data is represented as vectors. Batches of data are matrices or tensors. Neural network layers are affine transformations followed by nonlinear functions. Training uses gradients, Jacobians, Hessians, and large-scale matrix operations. In transformer models, attention is expressed through matrix products involving query, key, and value matrices.

The central pattern is:

$$
\text{data}
\longrightarrow
\text{vectors}
\longrightarrow
\text{linear maps}
\longrightarrow
\text{optimization}.
$$

This chapter describes how the ideas of linear algebra appear in current AI systems.

## 140.2 Data as Vectors

AI systems begin by converting objects into vectors.

A word, image, audio segment, document, user profile, protein sequence, or graph node may be represented by a vector

$$
x\in\mathbb{R}^d.
$$

The dimension \(d\) depends on the model.

Examples:

| Object | Vector representation |
|---|---|
| Word | Embedding vector |
| Image patch | Pixel or feature vector |
| Document | Dense semantic vector |
| User | Preference vector |
| Graph node | Node embedding |
| Audio frame | Feature vector |

Once data is represented as vectors, linear algebra can be used to compare, transform, combine, and optimize it.

## 140.3 Embeddings

An embedding maps a discrete object into a vector space.

For example, a vocabulary item \(w\) may be mapped to

$$
e_w\in\mathbb{R}^d.
$$

Words with related meanings often have embeddings that are close under cosine similarity or inner product.

The embedding matrix has the form

$$
E\in\mathbb{R}^{V\times d},
$$

where \(V\) is the vocabulary size and \(d\) is the embedding dimension.

A token index selects one row of \(E\). Thus embedding lookup is a structured linear-algebra operation.

Embeddings are used in language models, recommender systems, image-text models, graph neural networks, and retrieval systems.

## 140.4 Similarity and Inner Products

Many AI systems compare vectors using inner products.

Given two vectors

$$
x,y\in\mathbb{R}^d,
$$

their dot product is

$$
x^Ty.
$$

Cosine similarity normalizes by vector lengths:

$$
\cos(x,y) =
\frac{x^Ty}{\|x\|\|y\|}.
$$

Large cosine similarity means that the two vectors point in similar directions.

This is used in:

| Task | Use |
|---|---|
| Search | Find nearby document vectors |
| Recommendation | Compare user and item vectors |
| Classification | Compare feature and class vectors |
| Clustering | Group similar embeddings |
| Retrieval-augmented generation | Retrieve relevant context |

Vector similarity is one of the most common uses of linear algebra in AI.

## 140.5 Neural Network Layers

A basic neural network layer has the form

$$
y=\sigma(Wx+b),
$$

where:

| Symbol | Meaning |
|---|---|
| \(x\) | Input vector |
| \(W\) | Weight matrix |
| \(b\) | Bias vector |
| \(\sigma\) | Nonlinear activation |
| \(y\) | Output vector |

The affine part

$$
Wx+b
$$

is linear algebra. The activation introduces nonlinearity.

A deep neural network composes many such layers:

$$
x
\mapsto
\sigma(W_1x+b_1)
\mapsto
\sigma(W_2h_1+b_2)
\mapsto
\cdots.
$$

Thus deep learning alternates linear transformations with nonlinear coordinatewise operations.

## 140.6 Batches and Matrix Multiplication

Training uses batches of examples.

If a batch contains \(B\) input vectors of dimension \(d\), they are stored as a matrix

$$
X\in\mathbb{R}^{B\times d}.
$$

A linear layer applied to the whole batch is

$$
Y=XW+B_0,
$$

where \(W\) is a weight matrix and \(B_0\) broadcasts the bias.

This turns many vector operations into one matrix multiplication.

Matrix multiplication is the computational core of neural network training and inference. Modern hardware accelerators are designed around fast dense matrix and tensor operations.

## 140.7 Loss Functions and Gradients

Training adjusts weights to minimize a loss function.

Let

$$
\theta
$$

denote all model parameters. Training solves approximately:

$$
\min_\theta L(\theta).
$$

The gradient

$$
\nabla_\theta L
$$

points in the direction of greatest local increase. Optimization algorithms move in the opposite direction.

A typical update is:

$$
\theta_{k+1} =
\theta_k-\alpha_k\nabla_\theta L(\theta_k).
$$

This is gradient descent in parameter space.

In modern AI, \(\theta\) may contain millions or billions of parameters, but the principle remains ordinary vector calculus and linear algebra.

## 140.8 Backpropagation

Backpropagation computes gradients through a composed function.

If a model is a composition

$$
f=f_n\circ f_{n-1}\circ\cdots\circ f_1,
$$

then the chain rule says that derivatives multiply in reverse order.

For Jacobians,

$$
J_f =
J_{f_n}J_{f_{n-1}}\cdots J_{f_1}.
$$

Backpropagation applies this rule efficiently without explicitly forming every large Jacobian.

Instead, it propagates vector-Jacobian products backward through the computation graph.

This is why matrix calculus is essential for deep learning.

## 140.9 Attention

Attention is one of the most important linear-algebraic mechanisms in modern AI.

Given input vectors collected in a matrix

$$
X,
$$

a transformer forms query, key, and value matrices:

$$
Q=XW^Q,\qquad K=XW^K,\qquad V=XW^V.
$$

The attention score matrix is

$$
QK^T.
$$

Scaled dot-product attention is

$$
\operatorname{Attention}(Q,K,V) =
\operatorname{softmax}
\left(
\frac{QK^T}{\sqrt{d_k}}
\right)V.
$$

\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The query, key, and value matrices are produced by learned linear projections, and the attention scores are computed by matrix multiplication.

## 140.10 Multi-Head Attention

Multi-head attention uses several attention operations in parallel.

Each head has its own projection matrices:

$$
W_i^Q,\qquad W_i^K,\qquad W_i^V.
$$

Each head computes attention in a different learned subspace.

The outputs are concatenated and multiplied by another matrix:

$$
\operatorname{MultiHead}(X) =
\operatorname{Concat}(H_1,\ldots,H_h)W^O.
$$

This allows the model to represent multiple kinds of relationships at once.

Some heads may track local syntax. Others may track long-range dependencies, positional structure, or semantic relations.

## 140.11 Low-Rank Structure in Attention

The attention matrix has size roughly

$$
n\times n,
$$

where \(n\) is the sequence length.

This can be expensive for long sequences.

Many efficient transformer methods use linear algebraic structure to reduce cost. One common idea is low-rank approximation. Linformer, for example, proposed approximating self-attention by a low-rank matrix to reduce the sequence-length cost from quadratic to linear under its approximation assumptions.

The general principle is:

$$
\text{large dense matrix}
\approx
\text{smaller structured factors}.
$$

This connects transformer efficiency with matrix approximation.

## 140.12 State Space Models

Some modern sequence models use state space equations instead of full attention.

A linear state space model has the form

$$
h_{t+1}=Ah_t+Bx_t,
$$

$$
y_t=Ch_t+Dx_t.
$$

Here:

| Symbol | Meaning |
|---|---|
| \(h_t\) | Hidden state |
| \(x_t\) | Input |
| \(y_t\) | Output |
| \(A,B,C,D\) | Learned matrices |

State space models use recurrence, convolution, and structured matrices to process long sequences.

Mamba is one modern architecture based on selective state spaces and designed for efficient sequence modeling.

## 140.13 Singular Value Decomposition in AI

The singular value decomposition writes a matrix as

$$
A=U\Sigma V^T.
$$

SVD appears in AI through:

| Use | Role |
|---|---|
| Dimensionality reduction | Keep leading singular vectors |
| Compression | Approximate weight matrices |
| Denoising | Remove small singular components |
| Latent semantic analysis | Factor term-document matrices |
| Model analysis | Study learned representations |

Low-rank approximation is especially important when models are large.

If a weight matrix has effective low rank, it may be approximated by

$$
W\approx UV^T,
$$

where \(U\) and \(V\) are smaller matrices.

This reduces storage and computation.

## 140.14 Principal Component Analysis

Principal component analysis, or PCA, finds directions of maximal variance.

Given centered data matrix

$$
X,
$$

the covariance matrix is

$$
\frac{1}{n}X^TX.
$$

The principal components are eigenvectors of this covariance matrix.

PCA is used for:

| Task | Purpose |
|---|---|
| Visualization | Reduce to 2 or 3 dimensions |
| Preprocessing | Remove redundant dimensions |
| Denoising | Keep dominant components |
| Representation analysis | Inspect embedding geometry |

PCA is one of the classical bridges between linear algebra and data analysis.

## 140.15 Matrix Factorization for Recommendation

Recommender systems often use matrix factorization.

Let

$$
R\in\mathbb{R}^{m\times n}
$$

be a user-item rating matrix.

The goal is to approximate

$$
R\approx UV^T,
$$

where:

| Matrix | Meaning |
|---|---|
| \(U\) | User factors |
| \(V\) | Item factors |

The predicted rating for user \(i\) and item \(j\) is

$$
u_i^Tv_j.
$$

This model says that users and items live in the same latent vector space.

Recommendation becomes inner-product prediction.

## 140.16 Graph Neural Networks

Graphs are common in AI: social networks, molecules, knowledge graphs, citation networks, and recommendation systems.

A graph neural network updates node features using neighboring nodes.

A simple linear message-passing layer has the form

$$
H_{k+1} =
\sigma(\widetilde{A}H_kW_k),
$$

where:

| Symbol | Meaning |
|---|---|
| \(\widetilde{A}\) | Normalized adjacency matrix |
| \(H_k\) | Node feature matrix |
| \(W_k\) | Weight matrix |
| \(\sigma\) | Activation |

The adjacency matrix determines how information flows across the graph.

This is spectral graph theory and neural networks combined.

## 140.17 Generative Models

Generative AI models produce new samples.

Linear algebra appears in several forms:

| Model type | Linear algebra role |
|---|---|
| Language models | Token embeddings and attention |
| Diffusion models | Noise vectors and denoising networks |
| Image generators | Latent spaces and convolutional layers |
| Autoencoders | Encoder and decoder maps |
| GANs | Generator and discriminator matrices |

Latent vector spaces are central. A model often maps a vector

$$
z\in\mathbb{R}^d
$$

to a generated object.

Interpolating between latent vectors can produce smooth changes in generated outputs.

## 140.18 Retrieval-Augmented Generation

Retrieval-augmented generation combines search with generation.

Documents are embedded as vectors:

$$
d_1,\ldots,d_N.
$$

A query is embedded as

$$
q.
$$

Retrieval selects documents with large similarity scores:

$$
q^Td_i.
$$

The selected documents are then passed to a language model as context.

Thus RAG systems depend heavily on:

| Component | Linear algebra operation |
|---|---|
| Embedding model | Vector representation |
| Vector database | Nearest-neighbor search |
| Similarity scoring | Dot products or cosine similarity |
| Reranking | Matrix and vector scoring |
| Generation | Transformer inference |

The retrieval step is essentially large-scale vector search.

## 140.19 Model Compression

Large AI models are expensive to store and run.

Linear algebra supports compression through:

| Method | Linear algebra idea |
|---|---|
| Low-rank factorization | Replace \(W\) by \(UV^T\) |
| Pruning | Remove small or unimportant weights |
| Quantization | Store lower-precision values |
| Sparse matrices | Exploit zeros |
| Distillation | Approximate one function by another |

Low-rank methods explicitly use matrix factorization. Quantization changes the scalar representation. Sparse methods change the matrix storage pattern.

These techniques reduce memory bandwidth and computational cost.

## 140.20 Hardware and Tensor Algebra

AI hardware is optimized for tensor operations.

A tensor program is usually a sequence of operations such as:

$$
C = AB,
$$

$$
Y = XW+b,
$$

$$
QK^T,
$$

and reductions such as sums, norms, and softmax.

Performance depends on:

| Factor | Linear algebra issue |
|---|---|
| Matrix shape | Arithmetic intensity |
| Memory layout | Data movement |
| Precision | Numerical error |
| Blocking | Cache and accelerator use |
| Sparsity | Irregular computation |

Even when the model is described statistically, the execution is numerical linear algebra.

## 140.21 AI for Linear Algebra

AI is also being used to discover or accelerate linear algebra algorithms.

One example is AlphaTensor, which used reinforcement learning to search for matrix multiplication algorithms with fewer scalar multiplications. Matrix multiplication is a core operation in linear algebra and machine learning, so algorithmic improvements can matter at large scale.

This reverses the usual relationship.

Linear algebra supports AI, and AI can search for better linear algebra procedures.

## 140.22 Numerical Stability

Modern AI uses finite precision arithmetic.

Common formats include 32-bit floating point, 16-bit floating point, bfloat16, and lower-precision quantized formats.

Numerical issues include:

| Issue | Effect |
|---|---|
| Overflow | Values exceed representable range |
| Underflow | Values become too small |
| Roundoff | Accumulated arithmetic error |
| Ill-conditioning | Small perturbations become large |
| Instability in softmax | Large exponentials |

Stable implementations often subtract the maximum before applying softmax. This keeps exponentials in a safe numerical range and is standard in attention implementations.

## 140.23 Summary

Modern AI is applied linear algebra at large scale.

The central ideas are:

| Concept | AI role |
|---|---|
| Vectors | Represent data and parameters |
| Matrices | Represent learned transformations |
| Tensors | Store batches, activations, and weights |
| Inner products | Similarity and attention scores |
| Matrix multiplication | Core computation |
| Gradients | Training signal |
| Jacobians | Chain rule and backpropagation |
| SVD | Compression and dimensionality reduction |
| Eigenvectors | PCA, graph learning, spectral methods |
| Low-rank approximation | Efficient models |
| Sparse matrices | Efficient storage and computation |
| State space matrices | Long-sequence modeling |
| Vector search | Retrieval and recommendation |

AI systems may appear complex at the application level, but their computational core is a small set of linear-algebraic operations repeated at very large scale.