# Tensor Shapes, Dimensions, and Memory Layout

Deep learning systems manipulate tensors with millions or billions of numerical entries. Understanding the shape, dimensional structure, and memory organization of tensors is essential for building efficient neural networks in PyTorch.

Many deep learning errors arise from incorrect tensor shapes rather than incorrect mathematics. Likewise, many performance problems arise from inefficient memory layouts or unnecessary tensor copies. A strong understanding of tensor structure therefore affects both correctness and computational efficiency.

### Tensor Shape

The shape of a tensor describes the size of each axis.

If a tensor has shape

$$
(3, 4),
$$

then it has two axes. The first axis has size 3 and the second axis has size 4.

For example:

```python id="6kz4c1"
import torch

X = torch.randn(3, 4)

print(X.shape)
```

Output:

```python id="5f1nnr"
torch.Size([3, 4])
```

This tensor contains

$$
3 \times 4 = 12
$$

entries.

A tensor with shape

$$
(2, 3, 4)
$$

contains

$$
2 \times 3 \times 4 = 24
$$

entries.

In general, if a tensor has shape

$$
(d_1, d_2, \dots, d_n),
$$

then the total number of elements is

$$
\prod_{i=1}^{n} d_i.
$$

PyTorch provides the total number of elements through `numel()`:

```python id="n0e2rh"
X = torch.randn(2, 3, 4)

print(X.numel())
```

Output:

```python id="m1l2j4"
24
```

### Tensor Dimensions

The number of axes in a tensor is called its dimension, rank, or order.

| Tensor | Shape example | Number of dimensions |
|---|---|---:|
| Scalar | `[]` | 0 |
| Vector | `[5]` | 1 |
| Matrix | `[3, 4]` | 2 |
| 3D tensor | `[2, 3, 4]` | 3 |
| 4D tensor | `[32, 3, 224, 224]` | 4 |

In PyTorch:

```python id="2pbmbf"
X = torch.randn(32, 3, 224, 224)

print(X.ndim)
```

Output:

```python id="svu2eh"
4
```

The term “4D tensor” means that the tensor has four axes, not that it represents physical four-dimensional space.

### Semantic Meaning of Axes

The axes of a tensor usually carry semantic meaning.

A batch of RGB images commonly uses shape

$$
[B, C, H, W],
$$

where:

| Symbol | Meaning |
|---|---|
| \(B\) | Batch size |
| \(C\) | Number of channels |
| \(H\) | Image height |
| \(W\) | Image width |

For example:

```python id="xv4x3t"
images = torch.randn(32, 3, 224, 224)
```

This tensor represents:

- 32 images
- 3 color channels
- height 224
- width 224

Similarly, transformer models often use

$$
[B, T, D],
$$

where:

| Symbol | Meaning |
|---|---|
| \(B\) | Batch size |
| \(T\) | Sequence length |
| \(D\) | Embedding dimension |

Example:

```python id="0x3l6w"
tokens = torch.randn(16, 128, 768)
```

This may represent:

- 16 sequences
- 128 tokens per sequence
- 768-dimensional embeddings

Tensor programming requires tracking both the numerical shape and the meaning of each axis.

### Reshaping Tensors

Reshaping changes the view of tensor data without changing the underlying entries.

Suppose a tensor has shape

$$
(2, 3, 4).
$$

Since it contains 24 entries, it can be reshaped into any compatible shape whose dimensions multiply to 24.

Example:

```python id="gr1f1f"
X = torch.randn(2, 3, 4)

Y = X.reshape(6, 4)

print(Y.shape)
```

Output:

```python id="b4jrzh"
torch.Size([6, 4])
```

The entries remain the same. Only the interpretation changes.

PyTorch allows automatic dimension inference with `-1`:

```python id="3oxtj5"
X = torch.randn(2, 3, 4)

Y = X.reshape(2, -1)

print(Y.shape)
```

Output:

```python id="0bp9vv"
torch.Size([2, 12])
```

PyTorch inferred the missing dimension automatically.

### Flattening

Flattening converts multiple axes into one axis.

This operation is common before fully connected layers.

Example:

```python id="m9k77s"
X = torch.randn(32, 3, 224, 224)

Y = X.flatten(start_dim=1)

print(Y.shape)
```

Output:

```python id="0a9txm"
torch.Size([32, 150528])
```

The batch axis is preserved while the remaining dimensions are collapsed.

Mathematically,

$$
3 \times 224 \times 224 = 150528.
$$

### Adding and Removing Dimensions

PyTorch provides operations for inserting or removing singleton dimensions.

A singleton dimension has size 1.

#### Unsqueeze

`unsqueeze()` inserts a new axis.

```python id="b4v7lv"
x = torch.randn(64)

print(x.shape)

x = x.unsqueeze(0)

print(x.shape)
```

Output:

```python id="g0q5t7"
torch.Size([64])
torch.Size([1, 64])
```

This operation is common when converting a single example into a batch.

#### Squeeze

`squeeze()` removes axes of size 1.

```python id="3h8h8m"
x = torch.randn(1, 64, 1)

print(x.shape)

x = x.squeeze()

print(x.shape)
```

Output:

```python id="jwd2ru"
torch.Size([1, 64, 1])
torch.Size([64])
```

### Permuting Axes

Sometimes tensor axes must be reordered.

PyTorch uses `permute()` for this purpose.

Example:

```python id="c1s1ij"
X = torch.randn(32, 224, 224, 3)

Y = X.permute(0, 3, 1, 2)

print(Y.shape)
```

Output:

```python id="0l4h5f"
torch.Size([32, 3, 224, 224])
```

The tensor originally used NHWC layout:

$$
[N, H, W, C].
$$

After permutation it uses NCHW layout:

$$
[N, C, H, W].
$$

PyTorch convolution layers typically expect channel-first tensors.

### Tensor Memory Layout

Tensor shape describes logical structure. Memory layout describes physical storage in memory.

A tensor may appear multidimensional while its entries are stored linearly in memory.

Consider:

```python id="7rlm0l"
X = torch.tensor([
    [1, 2, 3],
    [4, 5, 6]
])
```

Logically:

$$
X =
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{bmatrix}.
$$

Physically, memory stores:

```python id="i8n0hx"
[1, 2, 3, 4, 5, 6]
```

Most PyTorch tensors use row-major layout, meaning rows are stored contiguously.

Memory layout affects performance because modern hardware reads contiguous memory more efficiently.

### Strides

A stride tells PyTorch how many memory positions must be skipped to move along each axis.

Example:

```python id="z4d8gl"
X = torch.randn(3, 4)

print(X.stride())
```

Possible output:

```python id="xjdb3w"
(4, 1)
```

Interpretation:

- moving along axis 0 skips 4 entries
- moving along axis 1 skips 1 entry

Strides allow PyTorch to create tensor views without copying memory.

### Contiguous and Noncontiguous Tensors

Some operations produce noncontiguous tensors.

Example:

```python id="sy1cnm"
X = torch.randn(2, 3)

Y = X.t()

print(Y.is_contiguous())
```

Output:

```python id="rmlyw8"
False
```

The transpose changes tensor interpretation without rearranging memory.

Some PyTorch operations require contiguous tensors. In such cases:

```python id="vpk2gf"
Y = Y.contiguous()
```

This creates a contiguous copy in memory.

Understanding contiguity becomes important in high-performance systems and custom CUDA kernels.

### Views Versus Copies

Many tensor operations create views rather than copies.

A view shares the same underlying storage.

Example:

```python id="lqjlwm"
X = torch.arange(12)

Y = X.reshape(3, 4)

Y[0, 0] = -1

print(X)
```

Output:

```python id="nx4n52"
tensor([-1,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
```

Changing `Y` also changed `X` because both tensors share storage.

This behavior improves efficiency but can produce subtle bugs.

### Broadcasting and Shape Expansion

Broadcasting allows operations between tensors of compatible shapes.

Example:

```python id="u3c1e2"
X = torch.randn(32, 64)
b = torch.randn(64)

Y = X + b
```

PyTorch conceptually expands `b` from shape

$$
(64)
$$

to

$$
(32, 64).
$$

This expansion usually occurs without allocating new memory.

Broadcasting rules compare dimensions from right to left.

Dimensions are compatible if:

- they are equal, or
- one of them is 1

Examples:

| Shape A | Shape B | Result |
|---|---|---|
| `[32, 64]` | `[64]` | `[32, 64]` |
| `[32, 10, 64]` | `[64]` | `[32, 10, 64]` |
| `[32, 10, 64]` | `[1, 64]` | `[32, 10, 64]` |
| `[32, 10, 64]` | `[32]` | Invalid |

Broadcasting is one of the most important tensor operations in deep learning.

### Tensor Layouts in Deep Learning

Different applications use different tensor layouts.

#### Images

Computer vision commonly uses:

$$
[B, C, H, W].
$$

#### Text

Transformers often use:

$$
[B, T, D].
$$

#### Audio

Waveforms may use:

$$
[B, C, T].
$$

Spectrograms may use:

$$
[B, F, T].
$$

#### Video

Video models often use:

$$
[B, T, C, H, W].
$$

#### Graphs

Node features often use:

$$
[N, D].
$$

Edge indices may use:

$$
[2, E].
$$

Understanding these conventions is necessary for reading modern research papers and implementing architectures correctly.

### Shape Transformations in Neural Networks

Neural networks continuously transform tensor shapes.

A convolutional network might perform:

| Layer | Input shape | Output shape |
|---|---|---|
| Input | `[32, 3, 224, 224]` | `[32, 3, 224, 224]` |
| Conv2D | `[32, 3, 224, 224]` | `[32, 64, 112, 112]` |
| Pooling | `[32, 64, 112, 112]` | `[32, 64, 56, 56]` |
| Flatten | `[32, 64, 56, 56]` | `[32, 200704]` |
| Linear | `[32, 200704]` | `[32, 1000]` |

A transformer may perform:

| Layer | Input shape | Output shape |
|---|---|---|
| Token IDs | `[16, 128]` | `[16, 128]` |
| Embedding | `[16, 128]` | `[16, 128, 768]` |
| Attention | `[16, 128, 768]` | `[16, 128, 768]` |
| Output logits | `[16, 128, 50000]` | `[16, 128, 50000]` |

Much of practical deep learning consists of reasoning about these transformations correctly.

### Shape Errors and Debugging

Shape mismatches are among the most common PyTorch errors.

Example:

```python id="3jv6ll"
A = torch.randn(4, 3)
B = torch.randn(5, 2)

C = A @ B
```

This produces an error because matrix multiplication requires matching inner dimensions.

Valid multiplication requires:

$$
A\in\mathbb{R}^{m\times n},
\quad
B\in\mathbb{R}^{n\times p}.
$$

Then:

$$
AB\in\mathbb{R}^{m\times p}.
$$

A good debugging practice is to print tensor shapes at each stage:

```python id="7r4j94"
print(X.shape)
```

Experienced PyTorch programmers mentally track tensor shapes throughout the model.

### Summary

Tensor shape describes the size of each axis. Tensor dimension describes the number of axes. Tensor layout describes how entries are stored in memory.

PyTorch provides operations for reshaping, flattening, permuting, broadcasting, and slicing tensors. These operations often create views rather than copies, allowing efficient computation.

Modern deep learning systems are fundamentally tensor transformation systems. Understanding how shapes propagate through a model is therefore one of the core skills in neural network engineering.