Skip to content

Matrix Operations

Matrix operations are the main arithmetic language of deep learning.

Matrix operations are the main arithmetic language of deep learning. A linear layer, an attention head, an embedding projection, and many normalization steps can be written as matrix expressions. PyTorch gives direct support for these operations through @, torch.matmul, torch.mm, torch.bmm, and functions in torch.linalg.

This section introduces the matrix operations needed for neural network implementation and shape reasoning.

Matrix Shape

A matrix is a two-dimensional tensor:

ARm×n. A \in \mathbb{R}^{m \times n}.

The first dimension is the number of rows. The second dimension is the number of columns.

import torch

A = torch.randn(3, 4)

print(A.shape)  # torch.Size([3, 4])

The matrix has 3 rows and 4 columns. It contains 3×4=123 \times 4 = 12 entries.

For deep learning, a batch of feature vectors is often stored as a matrix:

XRB×D. X \in \mathbb{R}^{B \times D}.

Here BB is batch size and DD is feature dimension. Each row is one example.

Matrix Addition

Two matrices can be added when they have the same shape.

If

A,BRm×n, A, B \in \mathbb{R}^{m \times n},

then

C=A+B C = A + B

also has shape

CRm×n. C \in \mathbb{R}^{m \times n}.

Each entry is added independently:

Cij=Aij+Bij. C_{ij} = A_{ij} + B_{ij}.

In PyTorch:

A = torch.randn(3, 4)
B = torch.randn(3, 4)

C = A + B

print(C.shape)  # torch.Size([3, 4])

Matrix addition is elementwise. It does not mix rows or columns.

Scalar Multiplication

A matrix can be multiplied by a scalar.

C=αA. C = \alpha A.

Each entry is scaled:

Cij=αAij. C_{ij} = \alpha A_{ij}.

In PyTorch:

A = torch.randn(3, 4)

C = 0.1 * A

print(C.shape)  # torch.Size([3, 4])

Scalar multiplication appears in learning rates, weight decay, residual scaling, temperature scaling, and normalization.

Elementwise Matrix Multiplication

The expression A * B performs elementwise multiplication.

If

A,BRm×n, A, B \in \mathbb{R}^{m \times n},

then

C=AB C = A \odot B

has entries

Cij=AijBij. C_{ij} = A_{ij}B_{ij}.

In PyTorch:

A = torch.tensor([
    [1.0, 2.0],
    [3.0, 4.0],
])

B = torch.tensor([
    [10.0, 20.0],
    [30.0, 40.0],
])

C = A * B

print(C)

Output:

tensor([[ 10.,  40.],
        [ 90., 160.]])

This operation is also called the Hadamard product. It is used in gates, masks, dropout, attention masks, and elementwise feature interactions.

Matrix Multiplication

Matrix multiplication combines rows of one matrix with columns of another.

If

ARm×n,BRn×p, A \in \mathbb{R}^{m \times n}, \quad B \in \mathbb{R}^{n \times p},

then

C=AB C = AB

has shape

CRm×p. C \in \mathbb{R}^{m \times p}.

Each entry is a dot product:

Cij=k=1nAikBkj. C_{ij} = \sum_{k=1}^{n} A_{ik}B_{kj}.

In PyTorch:

A = torch.randn(5, 3)
B = torch.randn(3, 2)

C = A @ B

print(C.shape)  # torch.Size([5, 2])

The inner dimensions must match:

(m×n)(n×p)(m×p). (m \times n)(n \times p) \rightarrow (m \times p).

If the inner dimensions differ, matrix multiplication is undefined.

A = torch.randn(5, 3)
B = torch.randn(4, 2)

# C = A @ B  # error: 3 does not match 4

Matrix Multiplication as Linear Transformation

A matrix can represent a linear transformation.

If

xRn x \in \mathbb{R}^{n}

and

WRm×n, W \in \mathbb{R}^{m \times n},

then

y=Wx y = Wx

produces

yRm. y \in \mathbb{R}^{m}.

The matrix WW maps an input vector from Rn\mathbb{R}^{n} to an output vector in Rm\mathbb{R}^{m}.

In PyTorch:

W = torch.randn(4, 3)
x = torch.randn(3)

y = W @ x

print(y.shape)  # torch.Size([4])

A neural network layer usually adds a bias:

y=Wx+b. y = Wx + b.
b = torch.randn(4)

y = W @ x + b

This is the mathematical form of a fully connected layer.

Batch Matrix Form of a Linear Layer

PyTorch usually stores a batch of inputs as rows:

XRB×Din. X \in \mathbb{R}^{B \times D_{\text{in}}}.

A linear layer maps each input to an output vector:

YRB×Dout. Y \in \mathbb{R}^{B \times D_{\text{out}}}.

Using a weight matrix

WRDout×Din, W \in \mathbb{R}^{D_{\text{out}} \times D_{\text{in}}},

the operation is commonly written as

Y=XW+b. Y = XW^\top + b.

Shapes:

(B×Din)(Din×Dout)=(B×Dout). (B \times D_{\text{in}}) (D_{\text{in}} \times D_{\text{out}}) = (B \times D_{\text{out}}).

In PyTorch:

B = 32
din = 128
dout = 64

X = torch.randn(B, din)
W = torch.randn(dout, din)
b = torch.randn(dout)

Y = X @ W.T + b

print(Y.shape)  # torch.Size([32, 64])

This matches the behavior of torch.nn.Linear.

layer = torch.nn.Linear(din, dout)

Y = layer(X)

print(Y.shape)  # torch.Size([32, 64])
print(layer.weight.shape)  # torch.Size([64, 128])

PyTorch stores linear weights as [out_features, in_features].

Transpose

The transpose of a matrix swaps rows and columns.

If

ARm×n, A \in \mathbb{R}^{m \times n},

then

ARn×m. A^\top \in \mathbb{R}^{n \times m}.

If

A=[123456], A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix},

then

A=[142536]. A^\top = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}.

In PyTorch:

A = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
])

print(A.T)

For tensors with more than two axes, use transpose(dim0, dim1) or permute.

X = torch.randn(32, 10, 64)

Y = X.transpose(1, 2)

print(Y.shape)  # torch.Size([32, 64, 10])

Transpose is common in attention, convolution lowering, and shape alignment for matrix multiplication.

Matrix Multiplication and Transpose

Transpose often appears because different conventions store features along different axes.

Suppose a batch has shape:

XRB×D. X \in \mathbb{R}^{B \times D}.

A linear layer weight is stored as:

WRK×D. W \in \mathbb{R}^{K \times D}.

To compute output logits:

Z=XW. Z = XW^\top.
B, D, K = 16, 128, 10

X = torch.randn(B, D)
W = torch.randn(K, D)

Z = X @ W.T

print(Z.shape)  # torch.Size([16, 10])

Each row of Z contains KK class scores for one input example.

Batched Matrix Multiplication

A 3D tensor can represent a batch of matrices.

If

ARB×m×n,BRB×n×p, A \in \mathbb{R}^{B \times m \times n}, \quad B \in \mathbb{R}^{B \times n \times p},

then

CRB×m×p. C \in \mathbb{R}^{B \times m \times p}.

Each batch item performs one matrix multiplication.

In PyTorch:

A = torch.randn(8, 5, 3)
B = torch.randn(8, 3, 2)

C = torch.bmm(A, B)

print(C.shape)  # torch.Size([8, 5, 2])

torch.bmm requires both tensors to be 3D and have the same batch size.

torch.matmul is more general and supports broadcasting over leading dimensions.

A = torch.randn(8, 5, 3)
B = torch.randn(3, 2)

C = torch.matmul(A, B)

print(C.shape)  # torch.Size([8, 5, 2])

The matrix B is broadcast across the batch dimension.

Matrix Multiplication in Attention

Attention uses matrix multiplication to compare queries and keys.

Suppose:

Q,K,VRB×H×T×D. Q, K, V \in \mathbb{R}^{B \times H \times T \times D}.

Here BB is batch size, HH is number of heads, TT is sequence length, and DD is head dimension.

Attention scores are computed as:

S=QK. S = QK^\top.

For the last two dimensions:

(T×D)(D×T)=(T×T). (T \times D)(D \times T) = (T \times T).

In PyTorch:

B, H, T, D = 2, 4, 8, 16

Q = torch.randn(B, H, T, D)
K = torch.randn(B, H, T, D)

scores = Q @ K.transpose(-2, -1)

print(scores.shape)  # torch.Size([2, 4, 8, 8])

The result contains one attention score for every query-token and key-token pair.

Identity Matrix

The identity matrix has ones on the diagonal and zeros elsewhere.

I=[100010001]. I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}.

For compatible shapes,

AI=A AI = A

and

IA=A. IA = A.

In PyTorch:

I = torch.eye(3)

A = torch.randn(3, 3)

print(A @ I)

Identity matrices appear in residual connections, covariance matrices, regularization, and numerical linear algebra.

Diagonal Matrices

A diagonal matrix has nonzero entries only on the diagonal.

Given a vector

dRn, d \in \mathbb{R}^{n},

PyTorch can create a diagonal matrix:

d = torch.tensor([1.0, 2.0, 3.0])

D = torch.diag(d)

print(D)

Output:

tensor([[1., 0., 0.],
        [0., 2., 0.],
        [0., 0., 3.]])

Multiplying by a diagonal matrix scales each coordinate.

x = torch.tensor([10.0, 10.0, 10.0])

print(D @ x)

Output:

tensor([10., 20., 30.])

Normalization layers can be viewed partly as coordinate-wise scaling and shifting, though practical implementations avoid explicitly constructing large diagonal matrices.

Matrix Inverse

For a square matrix AA, the inverse A1A^{-1} satisfies:

A1A=I. A^{-1}A = I.

In PyTorch:

A = torch.randn(3, 3)
A_inv = torch.linalg.inv(A)

I = A_inv @ A

print(I)

Matrix inverse is important in linear algebra, but direct inversion is usually avoided in deep learning code. Solving a linear system is often more stable and efficient.

Instead of computing

x=A1b, x = A^{-1}b,

use:

A = torch.randn(3, 3)
b = torch.randn(3)

x = torch.linalg.solve(A, b)

This computes the solution of

Ax=b. Ax = b.

Determinant and Trace

The determinant is a scalar value associated with a square matrix.

A = torch.randn(3, 3)

det = torch.linalg.det(A)

The trace is the sum of diagonal entries:

tr(A)=iAii. \operatorname{tr}(A) = \sum_i A_{ii}.

In PyTorch:

tr = torch.trace(A)

Determinants and traces appear in probabilistic models, Gaussian distributions, normalizing flows, covariance analysis, and optimization theory.

Rank

The rank of a matrix is the number of linearly independent rows or columns.

A = torch.randn(5, 3)

rank = torch.linalg.matrix_rank(A)

print(rank)

Low-rank structure is important in modern deep learning. Low-rank adaptation, matrix factorization, compression, and efficient fine-tuning all rely on the idea that a large matrix can sometimes be approximated by a product of smaller matrices.

For example, instead of learning

WRd×d, W \in \mathbb{R}^{d \times d},

we may learn:

WAB, W \approx AB,

where

ARd×r,BRr×d,rd. A \in \mathbb{R}^{d \times r}, \quad B \in \mathbb{R}^{r \times d}, \quad r \ll d.

This reduces parameter count from d2d^2 to 2dr2dr.

Matrix Norms

A matrix norm measures the size of a matrix.

The Frobenius norm is:

AF=ijAij2. \|A\|_F = \sqrt{ \sum_i \sum_j A_{ij}^2 }.

In PyTorch:

A = torch.randn(3, 4)

norm = torch.linalg.norm(A, ord="fro")

print(norm)

Matrix norms are used in regularization, stability analysis, gradient clipping, and spectral analysis.

Practical Shape Rules

The most important matrix shape rules are:

OperationShape rule
AdditionSame shape, or broadcast-compatible
Elementwise multiplicationSame shape, or broadcast-compatible
Matrix multiplicationInner dimensions must match
TransposeSwaps selected axes
Linear layer[B, Din] -> [B, Dout]
Attention scores[B, H, T, D] @ [B, H, D, T] -> [B, H, T, T]

When debugging matrix code, write the shapes beside each expression.

Example:

# X: [B, Din]
# W: [Dout, Din]
# b: [Dout]
# Y: [B, Dout]

Y = X @ W.T + b

This habit prevents most shape errors.

Summary

Matrix operations are the computational core of neural networks. Elementwise operations act independently on entries. Matrix multiplication mixes coordinates through dot products. Transpose changes axis order. Batched matrix multiplication applies many matrix products in parallel.

A PyTorch linear layer is matrix multiplication plus bias addition. Attention is batched matrix multiplication between queries, keys, and values. Low-rank methods, normalization, probabilistic models, and optimization theory all rely on matrix operations.

Correct matrix programming requires tracking shapes. Efficient matrix programming also requires awareness of layout, contiguity, and broadcasting.