Tensor Shapes, Dimensions, and Memory Layout

Deep learning systems manipulate tensors with millions or billions of numerical entries. Understanding the shape, dimensional structure, and memory organization of tensors is essential for building efficient neural networks in PyTorch.

Many deep learning errors arise from incorrect tensor shapes rather than incorrect mathematics. Likewise, many performance problems arise from inefficient memory layouts or unnecessary tensor copies. A strong understanding of tensor structure therefore affects both correctness and computational efficiency.

Tensor Shape

The shape of a tensor describes the size of each axis.

If a tensor has shape

(3, 4),

then it has two axes. The first axis has size 3 and the second axis has size 4.

For example:

import torch

X = torch.randn(3, 4)

print(X.shape)

Output:

torch.Size([3, 4])

This tensor contains

3 \times 4 = 12

entries.

A tensor with shape

(2, 3, 4)

contains

2 \times 3 \times 4 = 24

entries.

In general, if a tensor has shape

(d_1, d_2, \dots, d_n),

then the total number of elements is

\prod_{i=1}^{n} d_i.

PyTorch provides the total number of elements through numel():

X = torch.randn(2, 3, 4)

print(X.numel())

Output:

Tensor Dimensions

The number of axes in a tensor is called its dimension, rank, or order.

Tensor	Shape example	Number of dimensions
Scalar	`[]`	0
Vector	`[5]`	1
Matrix	`[3, 4]`	2
3D tensor	`[2, 3, 4]`	3
4D tensor	`[32, 3, 224, 224]`	4

In PyTorch:

X = torch.randn(32, 3, 224, 224)

print(X.ndim)

Output:

The term “4D tensor” means that the tensor has four axes, not that it represents physical four-dimensional space.

Semantic Meaning of Axes

The axes of a tensor usually carry semantic meaning.

A batch of RGB images commonly uses shape

[B, C, H, W],

where:

Symbol	Meaning
$B$	Batch size
$C$	Number of channels
$H$	Image height
$W$	Image width

For example:

images = torch.randn(32, 3, 224, 224)

This tensor represents:

32 images
3 color channels
height 224
width 224

Similarly, transformer models often use

[B, T, D],

where:

Symbol	Meaning
$B$	Batch size
$T$	Sequence length
$D$	Embedding dimension

Example:

tokens = torch.randn(16, 128, 768)

This may represent:

16 sequences
128 tokens per sequence
768-dimensional embeddings

Tensor programming requires tracking both the numerical shape and the meaning of each axis.

Reshaping Tensors

Reshaping changes the view of tensor data without changing the underlying entries.

Suppose a tensor has shape

(2, 3, 4).

Since it contains 24 entries, it can be reshaped into any compatible shape whose dimensions multiply to 24.

Example:

X = torch.randn(2, 3, 4)

Y = X.reshape(6, 4)

print(Y.shape)

Output:

torch.Size([6, 4])

The entries remain the same. Only the interpretation changes.

PyTorch allows automatic dimension inference with -1:

X = torch.randn(2, 3, 4)

Y = X.reshape(2, -1)

print(Y.shape)

Output:

torch.Size([2, 12])

PyTorch inferred the missing dimension automatically.

Flattening

Flattening converts multiple axes into one axis.

This operation is common before fully connected layers.

Example:

X = torch.randn(32, 3, 224, 224)

Y = X.flatten(start_dim=1)

print(Y.shape)

Output:

torch.Size([32, 150528])

The batch axis is preserved while the remaining dimensions are collapsed.

Mathematically,

3 \times 224 \times 224 = 150528.

Adding and Removing Dimensions

PyTorch provides operations for inserting or removing singleton dimensions.

A singleton dimension has size 1.

Unsqueeze

unsqueeze() inserts a new axis.

x = torch.randn(64)

print(x.shape)

x = x.unsqueeze(0)

print(x.shape)

Output:

torch.Size([64])
torch.Size([1, 64])

This operation is common when converting a single example into a batch.

Squeeze

squeeze() removes axes of size 1.

x = torch.randn(1, 64, 1)

print(x.shape)

x = x.squeeze()

print(x.shape)

Output:

torch.Size([1, 64, 1])
torch.Size([64])

Permuting Axes

Sometimes tensor axes must be reordered.

PyTorch uses permute() for this purpose.

Example:

X = torch.randn(32, 224, 224, 3)

Y = X.permute(0, 3, 1, 2)

print(Y.shape)

Output:

torch.Size([32, 3, 224, 224])

The tensor originally used NHWC layout:

[N, H, W, C].

After permutation it uses NCHW layout:

[N, C, H, W].

PyTorch convolution layers typically expect channel-first tensors.

Tensor Memory Layout

Tensor shape describes logical structure. Memory layout describes physical storage in memory.

A tensor may appear multidimensional while its entries are stored linearly in memory.

Consider:

X = torch.tensor([
    [1, 2, 3],
    [4, 5, 6]
])

Logically:

X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}.

Physically, memory stores:

[1, 2, 3, 4, 5, 6]

Most PyTorch tensors use row-major layout, meaning rows are stored contiguously.

Memory layout affects performance because modern hardware reads contiguous memory more efficiently.

Strides

A stride tells PyTorch how many memory positions must be skipped to move along each axis.

Example:

X = torch.randn(3, 4)

print(X.stride())

Possible output:

(4, 1)

Interpretation:

moving along axis 0 skips 4 entries
moving along axis 1 skips 1 entry

Strides allow PyTorch to create tensor views without copying memory.

Contiguous and Noncontiguous Tensors

Some operations produce noncontiguous tensors.

Example:

X = torch.randn(2, 3)

Y = X.t()

print(Y.is_contiguous())

Output:

False

The transpose changes tensor interpretation without rearranging memory.

Some PyTorch operations require contiguous tensors. In such cases:

Y = Y.contiguous()

This creates a contiguous copy in memory.

Understanding contiguity becomes important in high-performance systems and custom CUDA kernels.

Views Versus Copies

Many tensor operations create views rather than copies.

A view shares the same underlying storage.

Example:

X = torch.arange(12)

Y = X.reshape(3, 4)

Y[0, 0] = -1

print(X)

Output:

tensor([-1,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Changing Y also changed X because both tensors share storage.

This behavior improves efficiency but can produce subtle bugs.

Broadcasting and Shape Expansion

Broadcasting allows operations between tensors of compatible shapes.

Example:

X = torch.randn(32, 64)
b = torch.randn(64)

Y = X + b

PyTorch conceptually expands b from shape

(64)

(32, 64).

This expansion usually occurs without allocating new memory.

Broadcasting rules compare dimensions from right to left.

Dimensions are compatible if:

they are equal, or
one of them is 1

Examples:

Shape A	Shape B	Result
`[32, 64]`	`[64]`	`[32, 64]`
`[32, 10, 64]`	`[64]`	`[32, 10, 64]`
`[32, 10, 64]`	`[1, 64]`	`[32, 10, 64]`
`[32, 10, 64]`	`[32]`	Invalid

Broadcasting is one of the most important tensor operations in deep learning.

Tensor Layouts in Deep Learning

Different applications use different tensor layouts.

Images

Computer vision commonly uses:

[B, C, H, W].

Text

Transformers often use:

[B, T, D].

Audio

Waveforms may use:

[B, C, T].

Spectrograms may use:

[B, F, T].

Video

Video models often use:

[B, T, C, H, W].

Graphs

Node features often use:

[N, D].

Edge indices may use:

[2, E].

Understanding these conventions is necessary for reading modern research papers and implementing architectures correctly.

Shape Transformations in Neural Networks

Neural networks continuously transform tensor shapes.

A convolutional network might perform:

Layer	Input shape	Output shape
Input	`[32, 3, 224, 224]`	`[32, 3, 224, 224]`
Conv2D	`[32, 3, 224, 224]`	`[32, 64, 112, 112]`
Pooling	`[32, 64, 112, 112]`	`[32, 64, 56, 56]`
Flatten	`[32, 64, 56, 56]`	`[32, 200704]`
Linear	`[32, 200704]`	`[32, 1000]`

A transformer may perform:

Layer	Input shape	Output shape
Token IDs	`[16, 128]`	`[16, 128]`
Embedding	`[16, 128]`	`[16, 128, 768]`
Attention	`[16, 128, 768]`	`[16, 128, 768]`
Output logits	`[16, 128, 50000]`	`[16, 128, 50000]`

Much of practical deep learning consists of reasoning about these transformations correctly.

Shape Errors and Debugging

Shape mismatches are among the most common PyTorch errors.

Example:

A = torch.randn(4, 3)
B = torch.randn(5, 2)

C = A @ B

This produces an error because matrix multiplication requires matching inner dimensions.

Valid multiplication requires:

A\in\mathbb{R}^{m\times n}, \quad B\in\mathbb{R}^{n\times p}.

Then:

AB\in\mathbb{R}^{m\times p}.

A good debugging practice is to print tensor shapes at each stage:

print(X.shape)

Experienced PyTorch programmers mentally track tensor shapes throughout the model.

Summary

Tensor shape describes the size of each axis. Tensor dimension describes the number of axes. Tensor layout describes how entries are stored in memory.

PyTorch provides operations for reshaping, flattening, permuting, broadcasting, and slicing tensors. These operations often create views rather than copies, allowing efficient computation.

Modern deep learning systems are fundamentally tensor transformation systems. Understanding how shapes propagate through a model is therefore one of the core skills in neural network engineering.