Skip to content

Scalars, Vectors, Matrices, and Tensors

Deep learning represents data and computation using arrays of numbers.

Deep learning represents data and computation using arrays of numbers. These arrays may have different numbers of axes. A single number is a scalar. A one-dimensional array is a vector. A two-dimensional array is a matrix. An array with any number of axes is a tensor.

This language is used throughout deep learning because neural networks operate on numerical data. Images, text, audio, graphs, and actions must all be encoded as tensors before a model can process them. The model itself is also made of tensors: weights, biases, activations, gradients, optimizer states, and losses.

Scalars

A scalar is a single number. Examples include

3,1.7,π,0.001. 3,\quad -1.7,\quad \pi,\quad 0.001.

In deep learning, scalars appear as losses, learning rates, regularization constants, probabilities, and individual tensor entries.

For example, the learning rate in gradient descent is usually a scalar:

η=0.001. \eta = 0.001.

The loss value for one batch is also a scalar:

L=2.43. L = 2.43.

A scalar has no axis. In PyTorch, a scalar tensor has shape

torch.Size([])

For example:

import torch

x = torch.tensor(3.0)
print(x.shape)  # torch.Size([])

Although this object is stored as a tensor, mathematically it represents a scalar.

Vectors

A vector is an ordered list of numbers. A vector is usually written as

x=[x1x2xn]. x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}.

The vector xx has nn components. We say that xRnx\in\mathbb{R}^n, meaning that xx is a vector of length nn with real-valued entries.

For example,

x=[1.20.73.1]R3. x = \begin{bmatrix} 1.2 \\ 0.7 \\ -3.1 \end{bmatrix} \in \mathbb{R}^3.

Vectors are used to represent feature lists, embeddings, model parameters, gradients, and hidden states.

A data point with three features can be represented as

x=[heightweightage]. x = \begin{bmatrix} \text{height} \\ \text{weight} \\ \text{age} \end{bmatrix}.

A word embedding may be a vector in R768\mathbb{R}^{768}. A hidden state in a transformer may be a vector in R4096\mathbb{R}^{4096}.

In PyTorch:

x = torch.tensor([1.2, 0.7, -3.1])
print(x.shape)  # torch.Size([3])

The shape [3] means that the tensor has one axis of length 3.

Matrices

A matrix is a rectangular array of numbers with rows and columns. A matrix with mm rows and nn columns is written as

A=[a11a12a1na21a22a2nam1am2amn]. A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}.

We write

ARm×n. A\in\mathbb{R}^{m\times n}.

The entry in row ii and column jj is denoted by aija_{ij} or AijA_{ij}.

Matrices are central to neural networks. A linear layer is defined by a weight matrix and a bias vector. If xRnx\in\mathbb{R}^n, WRm×nW\in\mathbb{R}^{m\times n}, and bRmb\in\mathbb{R}^m, then a linear layer computes

y=Wx+b. y = Wx + b.

Here yRmy\in\mathbb{R}^m. The matrix WW transforms the input vector into another vector, and the bias bb shifts the result.

In PyTorch:

W = torch.randn(4, 3)
x = torch.randn(3)
b = torch.randn(4)

y = W @ x + b
print(y.shape)  # torch.Size([4])

The expression W @ x performs matrix-vector multiplication.

Batches as Matrices

Deep learning models usually process many examples at once. This collection of examples is called a batch.

Suppose each input example is a vector in Rd\mathbb{R}^d. A batch of BB examples can be stored as a matrix

XRB×d. X\in\mathbb{R}^{B\times d}.

Each row is one example:

$$ X = \begin{bmatrix}

  • & x_1^\top & - \
  • & x_2^\top & - \ & \vdots & \
  • & x_B^\top & - \end{bmatrix}. $$

If a linear layer has weight matrix WRd×hW\in\mathbb{R}^{d\times h} and bias bRhb\in\mathbb{R}^h, then the whole batch can be transformed as

Y=XW+b. Y = XW + b.

Here

YRB×h. Y\in\mathbb{R}^{B\times h}.

In PyTorch:

B = 32
d = 128
h = 64

X = torch.randn(B, d)
W = torch.randn(d, h)
b = torch.randn(h)

Y = X @ W + b
print(Y.shape)  # torch.Size([32, 64])

The bias vector b is automatically broadcast across the batch dimension.

Tensors

A tensor is a multidimensional array. Scalars, vectors, and matrices are special cases:

ObjectNumber of axesExample shape
Scalar0[]
Vector1[d]
Matrix2[m, n]
Tensor3 or more[B, C, H, W]

In deep learning, the word tensor often means any array, regardless of the number of axes. Thus a scalar tensor, vector tensor, matrix tensor, and four-dimensional tensor are all tensors in PyTorch.

A color image is often represented as a 3-dimensional tensor:

XRC×H×W, X\in\mathbb{R}^{C\times H\times W},

where CC is the number of channels, HH is height, and WW is width.

A batch of images is represented as

XRB×C×H×W. X\in\mathbb{R}^{B\times C\times H\times W}.

For example, a batch of 32 RGB images of size 224×224224\times224 has shape

[32, 3, 224, 224]

In PyTorch:

images = torch.randn(32, 3, 224, 224)
print(images.shape)  # torch.Size([32, 3, 224, 224])

For text models, a batch of token embeddings may have shape

[batch_size, sequence_length, embedding_dim]

For example:

tokens = torch.randn(16, 128, 768)

This tensor may represent 16 sequences, each with 128 tokens, where each token has a 768-dimensional embedding.

Axes and Shape

The shape of a tensor gives the length of each axis. If

XRB×T×D, X\in\mathbb{R}^{B\times T\times D},

then XX has three axes:

AxisMeaningSize
0Batch axisBB
1Sequence axisTT
2Feature axisDD

In PyTorch:

X = torch.randn(8, 10, 64)

print(X.shape)     # torch.Size([8, 10, 64])
print(X.ndim)      # 3
print(X.shape[0])  # 8
print(X.shape[1])  # 10
print(X.shape[2])  # 64

Shape discipline is essential. Most neural network errors are shape errors. A model may be mathematically correct but fail because two tensors have incompatible shapes.

For example, matrix multiplication requires matching inner dimensions:

ARm×n,BRn×p. A\in\mathbb{R}^{m\times n},\quad B\in\mathbb{R}^{n\times p}.

Then

ABRm×p. AB\in\mathbb{R}^{m\times p}.

But if the second matrix has shape q×pq\times p with qnq\neq n, the product is undefined.

In PyTorch:

A = torch.randn(5, 3)
B = torch.randn(3, 2)

C = A @ B
print(C.shape)  # torch.Size([5, 2])

Indexing Tensor Entries

A vector entry is indexed by one integer:

xi. x_i.

A matrix entry is indexed by two integers:

Aij. A_{ij}.

A 3-dimensional tensor entry is indexed by three integers:

Xijk. X_{ijk}.

A 4-dimensional image batch entry may be written as

Xbchw, X_{bchw},

where bb is the batch index, cc is the channel index, hh is the vertical pixel coordinate, and ww is the horizontal pixel coordinate.

In PyTorch:

X = torch.randn(32, 3, 224, 224)

pixel = X[0, 1, 20, 30]

This selects one scalar value: image 0, channel 1, row 20, column 30.

Slicing selects a subtensor:

first_image = X[0]
print(first_image.shape)  # torch.Size([3, 224, 224])

first_channel = X[:, 0, :, :]
print(first_channel.shape)  # torch.Size([32, 224, 224])

Indexing reduces or selects axes. This is one of the most common operations in model implementation.

Data Types

Tensors also have data types. The shape describes the arrangement of entries. The data type describes how each entry is stored.

Common PyTorch data types include:

PyTorch dtypeMeaningCommon use
torch.float3232-bit floating pointStandard neural network training
torch.float1616-bit floating pointMixed precision training
torch.bfloat16Brain floating pointLarge model training
torch.float6464-bit floating pointNumerical analysis, scientific computing
torch.int6464-bit integerToken IDs, class labels
torch.boolBooleanMasks

Example:

x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
labels = torch.tensor([0, 1, 4], dtype=torch.int64)
mask = torch.tensor([True, False, True])

A model’s parameters are usually floating-point tensors. Class labels are usually integer tensors. Attention masks are often Boolean tensors.

Tensors as Data, Parameters, and Gradients

In deep learning, tensors appear in three main roles.

First, tensors represent data. Images, tokens, audio frames, tabular records, and graph features are converted into tensors.

Second, tensors represent parameters. The weights and biases of a model are tensors learned from data.

Third, tensors represent gradients. During training, automatic differentiation computes derivatives of the loss with respect to parameters. These derivatives are also stored as tensors.

For a parameter tensor WW, the corresponding gradient tensor has the same shape:

WRm×n,WLRm×n. W\in\mathbb{R}^{m\times n}, \quad \nabla_W L\in\mathbb{R}^{m\times n}.

In PyTorch:

linear = torch.nn.Linear(3, 4)

print(linear.weight.shape)       # torch.Size([4, 3])
print(linear.bias.shape)         # torch.Size([4])

x = torch.randn(5, 3)
y = linear(x)

loss = y.sum()
loss.backward()

print(linear.weight.grad.shape)  # torch.Size([4, 3])
print(linear.bias.grad.shape)    # torch.Size([4])

The gradient of each parameter matches the parameter’s shape because each parameter entry receives its own derivative.

Broadcasting

Broadcasting allows operations between tensors of different but compatible shapes.

For example, suppose

XRB×d X\in\mathbb{R}^{B\times d}

and

bRd. b\in\mathbb{R}^d.

Then X+bX+b adds the same vector bb to every row of XX. PyTorch performs this automatically:

X = torch.randn(32, 64)
b = torch.randn(64)

Y = X + b
print(Y.shape)  # torch.Size([32, 64])

Broadcasting follows shape rules. Dimensions are compared from the right. Two dimensions are compatible when they are equal or one of them is 1.

Examples:

Shape AShape BResult
[32, 64][64][32, 64]
[32, 64][1, 64][32, 64]
[32, 10, 64][64][32, 10, 64]
[32, 10, 64][10, 64][32, 10, 64]
[32, 10, 64][32]Invalid

Broadcasting is convenient, but it can hide mistakes. When a model behaves incorrectly, checking broadcasted operations is often necessary.

Reshaping Tensors

Reshaping changes how tensor entries are viewed without changing their values. In PyTorch, common reshaping operations include reshape, view, flatten, unsqueeze, squeeze, and permute.

Example:

X = torch.randn(32, 3, 224, 224)

flat = X.reshape(32, -1)
print(flat.shape)  # torch.Size([32, 150528])

The value -1 asks PyTorch to infer the missing dimension.

Adding an axis:

x = torch.randn(64)
x = x.unsqueeze(0)

print(x.shape)  # torch.Size([1, 64])

Removing an axis of length 1:

x = torch.randn(1, 64)
x = x.squeeze(0)

print(x.shape)  # torch.Size([64])

Changing axis order:

X = torch.randn(32, 224, 224, 3)  # NHWC
Y = X.permute(0, 3, 1, 2)         # NCHW

print(Y.shape)  # torch.Size([32, 3, 224, 224])

This operation is common when moving image data between libraries. Many image libraries use height-width-channel layout, while PyTorch convolution layers usually expect channel-first layout.

Tensor Conventions in Deep Learning

Different fields use different tensor shape conventions.

For computer vision in PyTorch, image batches usually use

[B,C,H,W]. [B, C, H, W].

For language models, token batches often use

[B,T] [B, T]

for token IDs and

[B,T,D] [B, T, D]

for embeddings.

For audio, one may use

[B,C,T] [B, C, T]

for waveforms or

[B,F,T] [B, F, T]

for spectrograms.

For graph neural networks, node features are often stored as

[N,D], [N, D],

where NN is the number of nodes and DD is the feature dimension. Edge information may be stored separately as an edge index tensor.

A clear naming convention prevents errors. Throughout this book, we will commonly use:

SymbolMeaning
BBBatch size
CCNumber of channels
HHHeight
WWWidth
TTSequence length
DDFeature or embedding dimension
NNNumber of samples or nodes
VVVocabulary size

Why Tensor Thinking Matters

Deep learning is easiest to understand when we track both the mathematical operation and the tensor shape.

For example, a classifier may transform an input batch

XRB×d X\in\mathbb{R}^{B\times d}

into logits

ZRB×K, Z\in\mathbb{R}^{B\times K},

where KK is the number of classes.

Each row of ZZ contains the class scores for one example. A softmax function converts these scores into probabilities. A loss function compares the probabilities with the true labels. Backpropagation computes gradients with respect to every parameter tensor.

The full training step can be summarized by tensor flow:

XZLθL. X \longrightarrow Z \longrightarrow L \longrightarrow \nabla_\theta L.

The input XX is a batch tensor. The logits ZZ are an output tensor. The loss LL is a scalar tensor. The gradients θL\nabla_\theta L are tensors with the same shapes as the model parameters.

A neural network implementation is therefore a program that moves tensors through differentiable operations.

Summary

A scalar is a single number. A vector is a one-dimensional array. A matrix is a two-dimensional array. A tensor is a multidimensional array.

Deep learning systems use tensors to represent data, model parameters, activations, losses, and gradients. Tensor shape controls which operations are valid. Tensor data type controls numerical representation. Tensor layout affects performance and compatibility with model layers.

A good PyTorch programmer learns to read tensor code by asking three questions: what is the shape, what does each axis mean, and what operation changes it.