Matrix operations are the main arithmetic language of deep learning. A linear layer, an attention head, an embedding projection, and many normalization steps can be written as matrix expressions. PyTorch gives direct support for these operations through @, torch.matmul, torch.mm, torch.bmm, and functions in torch.linalg.
This section introduces the matrix operations needed for neural network implementation and shape reasoning.
Matrix Shape
A matrix is a two-dimensional tensor:
The first dimension is the number of rows. The second dimension is the number of columns.
import torch
A = torch.randn(3, 4)
print(A.shape) # torch.Size([3, 4])The matrix has 3 rows and 4 columns. It contains entries.
For deep learning, a batch of feature vectors is often stored as a matrix:
Here is batch size and is feature dimension. Each row is one example.
Matrix Addition
Two matrices can be added when they have the same shape.
If
then
also has shape
Each entry is added independently:
In PyTorch:
A = torch.randn(3, 4)
B = torch.randn(3, 4)
C = A + B
print(C.shape) # torch.Size([3, 4])Matrix addition is elementwise. It does not mix rows or columns.
Scalar Multiplication
A matrix can be multiplied by a scalar.
Each entry is scaled:
In PyTorch:
A = torch.randn(3, 4)
C = 0.1 * A
print(C.shape) # torch.Size([3, 4])Scalar multiplication appears in learning rates, weight decay, residual scaling, temperature scaling, and normalization.
Elementwise Matrix Multiplication
The expression A * B performs elementwise multiplication.
If
then
has entries
In PyTorch:
A = torch.tensor([
[1.0, 2.0],
[3.0, 4.0],
])
B = torch.tensor([
[10.0, 20.0],
[30.0, 40.0],
])
C = A * B
print(C)Output:
tensor([[ 10., 40.],
[ 90., 160.]])This operation is also called the Hadamard product. It is used in gates, masks, dropout, attention masks, and elementwise feature interactions.
Matrix Multiplication
Matrix multiplication combines rows of one matrix with columns of another.
If
then
has shape
Each entry is a dot product:
In PyTorch:
A = torch.randn(5, 3)
B = torch.randn(3, 2)
C = A @ B
print(C.shape) # torch.Size([5, 2])The inner dimensions must match:
If the inner dimensions differ, matrix multiplication is undefined.
A = torch.randn(5, 3)
B = torch.randn(4, 2)
# C = A @ B # error: 3 does not match 4Matrix Multiplication as Linear Transformation
A matrix can represent a linear transformation.
If
and
then
produces
The matrix maps an input vector from to an output vector in .
In PyTorch:
W = torch.randn(4, 3)
x = torch.randn(3)
y = W @ x
print(y.shape) # torch.Size([4])A neural network layer usually adds a bias:
b = torch.randn(4)
y = W @ x + bThis is the mathematical form of a fully connected layer.
Batch Matrix Form of a Linear Layer
PyTorch usually stores a batch of inputs as rows:
A linear layer maps each input to an output vector:
Using a weight matrix
the operation is commonly written as
Shapes:
In PyTorch:
B = 32
din = 128
dout = 64
X = torch.randn(B, din)
W = torch.randn(dout, din)
b = torch.randn(dout)
Y = X @ W.T + b
print(Y.shape) # torch.Size([32, 64])This matches the behavior of torch.nn.Linear.
layer = torch.nn.Linear(din, dout)
Y = layer(X)
print(Y.shape) # torch.Size([32, 64])
print(layer.weight.shape) # torch.Size([64, 128])PyTorch stores linear weights as [out_features, in_features].
Transpose
The transpose of a matrix swaps rows and columns.
If
then
If
then
In PyTorch:
A = torch.tensor([
[1, 2, 3],
[4, 5, 6],
])
print(A.T)For tensors with more than two axes, use transpose(dim0, dim1) or permute.
X = torch.randn(32, 10, 64)
Y = X.transpose(1, 2)
print(Y.shape) # torch.Size([32, 64, 10])Transpose is common in attention, convolution lowering, and shape alignment for matrix multiplication.
Matrix Multiplication and Transpose
Transpose often appears because different conventions store features along different axes.
Suppose a batch has shape:
A linear layer weight is stored as:
To compute output logits:
B, D, K = 16, 128, 10
X = torch.randn(B, D)
W = torch.randn(K, D)
Z = X @ W.T
print(Z.shape) # torch.Size([16, 10])Each row of Z contains class scores for one input example.
Batched Matrix Multiplication
A 3D tensor can represent a batch of matrices.
If
then
Each batch item performs one matrix multiplication.
In PyTorch:
A = torch.randn(8, 5, 3)
B = torch.randn(8, 3, 2)
C = torch.bmm(A, B)
print(C.shape) # torch.Size([8, 5, 2])torch.bmm requires both tensors to be 3D and have the same batch size.
torch.matmul is more general and supports broadcasting over leading dimensions.
A = torch.randn(8, 5, 3)
B = torch.randn(3, 2)
C = torch.matmul(A, B)
print(C.shape) # torch.Size([8, 5, 2])The matrix B is broadcast across the batch dimension.
Matrix Multiplication in Attention
Attention uses matrix multiplication to compare queries and keys.
Suppose:
Here is batch size, is number of heads, is sequence length, and is head dimension.
Attention scores are computed as:
For the last two dimensions:
In PyTorch:
B, H, T, D = 2, 4, 8, 16
Q = torch.randn(B, H, T, D)
K = torch.randn(B, H, T, D)
scores = Q @ K.transpose(-2, -1)
print(scores.shape) # torch.Size([2, 4, 8, 8])The result contains one attention score for every query-token and key-token pair.
Identity Matrix
The identity matrix has ones on the diagonal and zeros elsewhere.
For compatible shapes,
and
In PyTorch:
I = torch.eye(3)
A = torch.randn(3, 3)
print(A @ I)Identity matrices appear in residual connections, covariance matrices, regularization, and numerical linear algebra.
Diagonal Matrices
A diagonal matrix has nonzero entries only on the diagonal.
Given a vector
PyTorch can create a diagonal matrix:
d = torch.tensor([1.0, 2.0, 3.0])
D = torch.diag(d)
print(D)Output:
tensor([[1., 0., 0.],
[0., 2., 0.],
[0., 0., 3.]])Multiplying by a diagonal matrix scales each coordinate.
x = torch.tensor([10.0, 10.0, 10.0])
print(D @ x)Output:
tensor([10., 20., 30.])Normalization layers can be viewed partly as coordinate-wise scaling and shifting, though practical implementations avoid explicitly constructing large diagonal matrices.
Matrix Inverse
For a square matrix , the inverse satisfies:
In PyTorch:
A = torch.randn(3, 3)
A_inv = torch.linalg.inv(A)
I = A_inv @ A
print(I)Matrix inverse is important in linear algebra, but direct inversion is usually avoided in deep learning code. Solving a linear system is often more stable and efficient.
Instead of computing
use:
A = torch.randn(3, 3)
b = torch.randn(3)
x = torch.linalg.solve(A, b)This computes the solution of
Determinant and Trace
The determinant is a scalar value associated with a square matrix.
A = torch.randn(3, 3)
det = torch.linalg.det(A)The trace is the sum of diagonal entries:
In PyTorch:
tr = torch.trace(A)Determinants and traces appear in probabilistic models, Gaussian distributions, normalizing flows, covariance analysis, and optimization theory.
Rank
The rank of a matrix is the number of linearly independent rows or columns.
A = torch.randn(5, 3)
rank = torch.linalg.matrix_rank(A)
print(rank)Low-rank structure is important in modern deep learning. Low-rank adaptation, matrix factorization, compression, and efficient fine-tuning all rely on the idea that a large matrix can sometimes be approximated by a product of smaller matrices.
For example, instead of learning
we may learn:
where
This reduces parameter count from to .
Matrix Norms
A matrix norm measures the size of a matrix.
The Frobenius norm is:
In PyTorch:
A = torch.randn(3, 4)
norm = torch.linalg.norm(A, ord="fro")
print(norm)Matrix norms are used in regularization, stability analysis, gradient clipping, and spectral analysis.
Practical Shape Rules
The most important matrix shape rules are:
| Operation | Shape rule |
|---|---|
| Addition | Same shape, or broadcast-compatible |
| Elementwise multiplication | Same shape, or broadcast-compatible |
| Matrix multiplication | Inner dimensions must match |
| Transpose | Swaps selected axes |
| Linear layer | [B, Din] -> [B, Dout] |
| Attention scores | [B, H, T, D] @ [B, H, D, T] -> [B, H, T, T] |
When debugging matrix code, write the shapes beside each expression.
Example:
# X: [B, Din]
# W: [Dout, Din]
# b: [Dout]
# Y: [B, Dout]
Y = X @ W.T + bThis habit prevents most shape errors.
Summary
Matrix operations are the computational core of neural networks. Elementwise operations act independently on entries. Matrix multiplication mixes coordinates through dot products. Transpose changes axis order. Batched matrix multiplication applies many matrix products in parallel.
A PyTorch linear layer is matrix multiplication plus bias addition. Attention is batched matrix multiplication between queries, keys, and values. Low-rank methods, normalization, probabilistic models, and optimization theory all rely on matrix operations.
Correct matrix programming requires tracking shapes. Efficient matrix programming also requires awareness of layout, contiguity, and broadcasting.