# Scalars, Vectors, Matrices, and Tensors

Deep learning represents data and computation using arrays of numbers. These arrays may have different numbers of axes. A single number is a scalar. A one-dimensional array is a vector. A two-dimensional array is a matrix. An array with any number of axes is a tensor.

This language is used throughout deep learning because neural networks operate on numerical data. Images, text, audio, graphs, and actions must all be encoded as tensors before a model can process them. The model itself is also made of tensors: weights, biases, activations, gradients, optimizer states, and losses.

### Scalars

A scalar is a single number. Examples include

$$
3,\quad -1.7,\quad \pi,\quad 0.001.
$$

In deep learning, scalars appear as losses, learning rates, regularization constants, probabilities, and individual tensor entries.

For example, the learning rate in gradient descent is usually a scalar:

$$
\eta = 0.001.
$$

The loss value for one batch is also a scalar:

$$
L = 2.43.
$$

A scalar has no axis. In PyTorch, a scalar tensor has shape

```python
torch.Size([])
```

For example:

```python
import torch

x = torch.tensor(3.0)
print(x.shape)  # torch.Size([])
```

Although this object is stored as a tensor, mathematically it represents a scalar.

### Vectors

A vector is an ordered list of numbers. A vector is usually written as

$$
x =
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix}.
$$

The vector \(x\) has \(n\) components. We say that \(x\in\mathbb{R}^n\), meaning that \(x\) is a vector of length \(n\) with real-valued entries.

For example,

$$
x =
\begin{bmatrix}
1.2 \\
0.7 \\
-3.1
\end{bmatrix}
\in \mathbb{R}^3.
$$

Vectors are used to represent feature lists, embeddings, model parameters, gradients, and hidden states.

A data point with three features can be represented as

$$
x =
\begin{bmatrix}
\text{height} \\
\text{weight} \\
\text{age}
\end{bmatrix}.
$$

A word embedding may be a vector in \(\mathbb{R}^{768}\). A hidden state in a transformer may be a vector in \(\mathbb{R}^{4096}\).

In PyTorch:

```python
x = torch.tensor([1.2, 0.7, -3.1])
print(x.shape)  # torch.Size([3])
```

The shape `[3]` means that the tensor has one axis of length 3.

### Matrices

A matrix is a rectangular array of numbers with rows and columns. A matrix with \(m\) rows and \(n\) columns is written as

$$
A =
\begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1n} \\
a_{21} & a_{22} & \cdots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1} & a_{m2} & \cdots & a_{mn}
\end{bmatrix}.
$$

We write

$$
A\in\mathbb{R}^{m\times n}.
$$

The entry in row \(i\) and column \(j\) is denoted by \(a_{ij}\) or \(A_{ij}\).

Matrices are central to neural networks. A linear layer is defined by a weight matrix and a bias vector. If \(x\in\mathbb{R}^n\), \(W\in\mathbb{R}^{m\times n}\), and \(b\in\mathbb{R}^m\), then a linear layer computes

$$
y = Wx + b.
$$

Here \(y\in\mathbb{R}^m\). The matrix \(W\) transforms the input vector into another vector, and the bias \(b\) shifts the result.

In PyTorch:

```python
W = torch.randn(4, 3)
x = torch.randn(3)
b = torch.randn(4)

y = W @ x + b
print(y.shape)  # torch.Size([4])
```

The expression `W @ x` performs matrix-vector multiplication.

### Batches as Matrices

Deep learning models usually process many examples at once. This collection of examples is called a batch.

Suppose each input example is a vector in \(\mathbb{R}^d\). A batch of \(B\) examples can be stored as a matrix

$$
X\in\mathbb{R}^{B\times d}.
$$

Each row is one example:

$$
X =
\begin{bmatrix}
- & x_1^\top & - \\
- & x_2^\top & - \\
& \vdots & \\
- & x_B^\top & -
\end{bmatrix}.
$$

If a linear layer has weight matrix \(W\in\mathbb{R}^{d\times h}\) and bias \(b\in\mathbb{R}^h\), then the whole batch can be transformed as

$$
Y = XW + b.
$$

Here

$$
Y\in\mathbb{R}^{B\times h}.
$$

In PyTorch:

```python
B = 32
d = 128
h = 64

X = torch.randn(B, d)
W = torch.randn(d, h)
b = torch.randn(h)

Y = X @ W + b
print(Y.shape)  # torch.Size([32, 64])
```

The bias vector `b` is automatically broadcast across the batch dimension.

### Tensors

A tensor is a multidimensional array. Scalars, vectors, and matrices are special cases:

| Object | Number of axes | Example shape |
|---|---:|---|
| Scalar | 0 | `[]` |
| Vector | 1 | `[d]` |
| Matrix | 2 | `[m, n]` |
| Tensor | 3 or more | `[B, C, H, W]` |

In deep learning, the word tensor often means any array, regardless of the number of axes. Thus a scalar tensor, vector tensor, matrix tensor, and four-dimensional tensor are all tensors in PyTorch.

A color image is often represented as a 3-dimensional tensor:

$$
X\in\mathbb{R}^{C\times H\times W},
$$

where \(C\) is the number of channels, \(H\) is height, and \(W\) is width.

A batch of images is represented as

$$
X\in\mathbb{R}^{B\times C\times H\times W}.
$$

For example, a batch of 32 RGB images of size \(224\times224\) has shape

```python
[32, 3, 224, 224]
```

In PyTorch:

```python
images = torch.randn(32, 3, 224, 224)
print(images.shape)  # torch.Size([32, 3, 224, 224])
```

For text models, a batch of token embeddings may have shape

```python
[batch_size, sequence_length, embedding_dim]
```

For example:

```python
tokens = torch.randn(16, 128, 768)
```

This tensor may represent 16 sequences, each with 128 tokens, where each token has a 768-dimensional embedding.

### Axes and Shape

The shape of a tensor gives the length of each axis. If

$$
X\in\mathbb{R}^{B\times T\times D},
$$

then \(X\) has three axes:

| Axis | Meaning | Size |
|---|---|---:|
| 0 | Batch axis | \(B\) |
| 1 | Sequence axis | \(T\) |
| 2 | Feature axis | \(D\) |

In PyTorch:

```python
X = torch.randn(8, 10, 64)

print(X.shape)     # torch.Size([8, 10, 64])
print(X.ndim)      # 3
print(X.shape[0])  # 8
print(X.shape[1])  # 10
print(X.shape[2])  # 64
```

Shape discipline is essential. Most neural network errors are shape errors. A model may be mathematically correct but fail because two tensors have incompatible shapes.

For example, matrix multiplication requires matching inner dimensions:

$$
A\in\mathbb{R}^{m\times n},\quad B\in\mathbb{R}^{n\times p}.
$$

Then

$$
AB\in\mathbb{R}^{m\times p}.
$$

But if the second matrix has shape \(q\times p\) with \(q\neq n\), the product is undefined.

In PyTorch:

```python
A = torch.randn(5, 3)
B = torch.randn(3, 2)

C = A @ B
print(C.shape)  # torch.Size([5, 2])
```

### Indexing Tensor Entries

A vector entry is indexed by one integer:

$$
x_i.
$$

A matrix entry is indexed by two integers:

$$
A_{ij}.
$$

A 3-dimensional tensor entry is indexed by three integers:

$$
X_{ijk}.
$$

A 4-dimensional image batch entry may be written as

$$
X_{bchw},
$$

where \(b\) is the batch index, \(c\) is the channel index, \(h\) is the vertical pixel coordinate, and \(w\) is the horizontal pixel coordinate.

In PyTorch:

```python
X = torch.randn(32, 3, 224, 224)

pixel = X[0, 1, 20, 30]
```

This selects one scalar value: image 0, channel 1, row 20, column 30.

Slicing selects a subtensor:

```python
first_image = X[0]
print(first_image.shape)  # torch.Size([3, 224, 224])

first_channel = X[:, 0, :, :]
print(first_channel.shape)  # torch.Size([32, 224, 224])
```

Indexing reduces or selects axes. This is one of the most common operations in model implementation.

### Data Types

Tensors also have data types. The shape describes the arrangement of entries. The data type describes how each entry is stored.

Common PyTorch data types include:

| PyTorch dtype | Meaning | Common use |
|---|---|---|
| `torch.float32` | 32-bit floating point | Standard neural network training |
| `torch.float16` | 16-bit floating point | Mixed precision training |
| `torch.bfloat16` | Brain floating point | Large model training |
| `torch.float64` | 64-bit floating point | Numerical analysis, scientific computing |
| `torch.int64` | 64-bit integer | Token IDs, class labels |
| `torch.bool` | Boolean | Masks |

Example:

```python
x = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
labels = torch.tensor([0, 1, 4], dtype=torch.int64)
mask = torch.tensor([True, False, True])
```

A model’s parameters are usually floating-point tensors. Class labels are usually integer tensors. Attention masks are often Boolean tensors.

### Tensors as Data, Parameters, and Gradients

In deep learning, tensors appear in three main roles.

First, tensors represent data. Images, tokens, audio frames, tabular records, and graph features are converted into tensors.

Second, tensors represent parameters. The weights and biases of a model are tensors learned from data.

Third, tensors represent gradients. During training, automatic differentiation computes derivatives of the loss with respect to parameters. These derivatives are also stored as tensors.

For a parameter tensor \(W\), the corresponding gradient tensor has the same shape:

$$
W\in\mathbb{R}^{m\times n},
\quad
\nabla_W L\in\mathbb{R}^{m\times n}.
$$

In PyTorch:

```python
linear = torch.nn.Linear(3, 4)

print(linear.weight.shape)       # torch.Size([4, 3])
print(linear.bias.shape)         # torch.Size([4])

x = torch.randn(5, 3)
y = linear(x)

loss = y.sum()
loss.backward()

print(linear.weight.grad.shape)  # torch.Size([4, 3])
print(linear.bias.grad.shape)    # torch.Size([4])
```

The gradient of each parameter matches the parameter’s shape because each parameter entry receives its own derivative.

### Broadcasting

Broadcasting allows operations between tensors of different but compatible shapes.

For example, suppose

$$
X\in\mathbb{R}^{B\times d}
$$

and

$$
b\in\mathbb{R}^d.
$$

Then \(X+b\) adds the same vector \(b\) to every row of \(X\). PyTorch performs this automatically:

```python
X = torch.randn(32, 64)
b = torch.randn(64)

Y = X + b
print(Y.shape)  # torch.Size([32, 64])
```

Broadcasting follows shape rules. Dimensions are compared from the right. Two dimensions are compatible when they are equal or one of them is 1.

Examples:

| Shape A | Shape B | Result |
|---|---|---|
| `[32, 64]` | `[64]` | `[32, 64]` |
| `[32, 64]` | `[1, 64]` | `[32, 64]` |
| `[32, 10, 64]` | `[64]` | `[32, 10, 64]` |
| `[32, 10, 64]` | `[10, 64]` | `[32, 10, 64]` |
| `[32, 10, 64]` | `[32]` | Invalid |

Broadcasting is convenient, but it can hide mistakes. When a model behaves incorrectly, checking broadcasted operations is often necessary.

### Reshaping Tensors

Reshaping changes how tensor entries are viewed without changing their values. In PyTorch, common reshaping operations include `reshape`, `view`, `flatten`, `unsqueeze`, `squeeze`, and `permute`.

Example:

```python
X = torch.randn(32, 3, 224, 224)

flat = X.reshape(32, -1)
print(flat.shape)  # torch.Size([32, 150528])
```

The value `-1` asks PyTorch to infer the missing dimension.

Adding an axis:

```python
x = torch.randn(64)
x = x.unsqueeze(0)

print(x.shape)  # torch.Size([1, 64])
```

Removing an axis of length 1:

```python
x = torch.randn(1, 64)
x = x.squeeze(0)

print(x.shape)  # torch.Size([64])
```

Changing axis order:

```python
X = torch.randn(32, 224, 224, 3)  # NHWC
Y = X.permute(0, 3, 1, 2)         # NCHW

print(Y.shape)  # torch.Size([32, 3, 224, 224])
```

This operation is common when moving image data between libraries. Many image libraries use height-width-channel layout, while PyTorch convolution layers usually expect channel-first layout.

### Tensor Conventions in Deep Learning

Different fields use different tensor shape conventions.

For computer vision in PyTorch, image batches usually use

$$
[B, C, H, W].
$$

For language models, token batches often use

$$
[B, T]
$$

for token IDs and

$$
[B, T, D]
$$

for embeddings.

For audio, one may use

$$
[B, C, T]
$$

for waveforms or

$$
[B, F, T]
$$

for spectrograms.

For graph neural networks, node features are often stored as

$$
[N, D],
$$

where \(N\) is the number of nodes and \(D\) is the feature dimension. Edge information may be stored separately as an edge index tensor.

A clear naming convention prevents errors. Throughout this book, we will commonly use:

| Symbol | Meaning |
|---|---|
| \(B\) | Batch size |
| \(C\) | Number of channels |
| \(H\) | Height |
| \(W\) | Width |
| \(T\) | Sequence length |
| \(D\) | Feature or embedding dimension |
| \(N\) | Number of samples or nodes |
| \(V\) | Vocabulary size |

### Why Tensor Thinking Matters

Deep learning is easiest to understand when we track both the mathematical operation and the tensor shape.

For example, a classifier may transform an input batch

$$
X\in\mathbb{R}^{B\times d}
$$

into logits

$$
Z\in\mathbb{R}^{B\times K},
$$

where \(K\) is the number of classes.

Each row of \(Z\) contains the class scores for one example. A softmax function converts these scores into probabilities. A loss function compares the probabilities with the true labels. Backpropagation computes gradients with respect to every parameter tensor.

The full training step can be summarized by tensor flow:

$$
X
\longrightarrow
Z
\longrightarrow
L
\longrightarrow
\nabla_\theta L.
$$

The input \(X\) is a batch tensor. The logits \(Z\) are an output tensor. The loss \(L\) is a scalar tensor. The gradients \(\nabla_\theta L\) are tensors with the same shapes as the model parameters.

A neural network implementation is therefore a program that moves tensors through differentiable operations.

### Summary

A scalar is a single number. A vector is a one-dimensional array. A matrix is a two-dimensional array. A tensor is a multidimensional array.

Deep learning systems use tensors to represent data, model parameters, activations, losses, and gradients. Tensor shape controls which operations are valid. Tensor data type controls numerical representation. Tensor layout affects performance and compatibility with model layers.

A good PyTorch programmer learns to read tensor code by asking three questions: what is the shape, what does each axis mean, and what operation changes it.