# Tensor Memory Layout and Performance

A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.

Most PyTorch code can be written without thinking about memory layout. However, layout becomes important when code becomes slow, when `view()` fails, when a tensor is noncontiguous, or when we write performance-critical training and inference code.

### Logical Shape Versus Physical Storage

Consider a matrix:

$$
X =
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{bmatrix}.
$$

Its logical shape is:

```python id="a72fev"
[2, 3]
```

It has 2 rows and 3 columns. Internally, the values are usually stored in one linear block of memory:

```python id="q349z1"
[1, 2, 3, 4, 5, 6]
```

The tensor object stores metadata that tells PyTorch how to interpret that storage:

| Metadata | Meaning |
|---|---|
| Shape | Size of each axis |
| Stride | Step size in storage for each axis |
| Storage offset | Where the tensor begins inside storage |
| dtype | Bytes and numeric format of each element |
| device | CPU, CUDA, or other accelerator |

A tensor is therefore more than its values. It is a view over storage with shape and stride metadata.

### Row-Major Layout

PyTorch commonly uses row-major layout for contiguous tensors. In a 2D matrix, entries in the last dimension are adjacent in memory.

For a tensor with shape `[2, 3]`:

```python id="ydx8sq"
X = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
])
```

The memory order is:

```python id="h0ehwq"
X[0, 0], X[0, 1], X[0, 2], X[1, 0], X[1, 1], X[1, 2]
```

The last axis changes fastest.

This convention matters because contiguous memory access is usually faster than strided memory access. CPUs and GPUs are designed to read blocks of nearby memory efficiently.

### Strides

A stride tells us how many storage positions to move when increasing an index along an axis by one.

```python id="4yx4q1"
import torch

X = torch.randn(2, 3)

print(X.shape)
print(X.stride())
```

Typical output:

```python id="v9oq8z"
torch.Size([2, 3])
(3, 1)
```

The stride `(3, 1)` means:

| Axis | Meaning | Stride |
|---|---|---:|
| 0 | Row axis | 3 |
| 1 | Column axis | 1 |

Moving one row forward skips 3 elements. Moving one column forward skips 1 element.

For a contiguous tensor with shape `[B, C, H, W]`, typical strides are:

```python id="bl0wmr"
[C * H * W, H * W, W, 1]
```

Example:

```python id="mjr2p5"
X = torch.randn(32, 3, 224, 224)

print(X.stride())
```

Typical output:

```python id="rh96ks"
(150528, 50176, 224, 1)
```

The width axis is contiguous because its stride is 1.

### Storage Offset

A tensor view may begin inside another tensor’s storage.

```python id="i4fg6y"
x = torch.arange(10)
y = x[3:8]

print(y)
print(y.storage_offset())
```

Output:

```python id="p3d07d"
tensor([3, 4, 5, 6, 7])
3
```

The tensor `y` starts at storage position 3 of `x`. It does not need separate storage.

This is why slicing is cheap. It usually creates new metadata over the same storage.

### Contiguous Tensors

A tensor is contiguous when its layout matches the standard memory order for its shape.

```python id="18jmg1"
X = torch.randn(2, 3)

print(X.is_contiguous())
```

Output:

```python id="qxxmxh"
True
```

Contiguous tensors are often faster for many operations because their memory access pattern is simple.

Some operations require contiguous tensors, especially operations that reinterpret storage directly, such as `view()`.

```python id="12gx68"
X = torch.arange(12)
Y = X.view(3, 4)

print(Y.shape)
```

Here `view()` works because the underlying memory layout supports the requested shape.

### Noncontiguous Tensors

Transpose commonly produces a noncontiguous tensor.

```python id="lvkmx0"
X = torch.randn(2, 3)
Y = X.T

print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())
```

Typical output:

```python id="h2z3ae"
torch.Size([3, 2])
(1, 3)
False
```

The tensor `Y` has shape `[3, 2]`, but its memory is still the same storage as `X`. PyTorch changed the strides instead of copying the data.

This is efficient, but the resulting tensor has nonstandard memory order.

### `contiguous()`

The method `contiguous()` returns a contiguous tensor. If the tensor is already contiguous, it may return the same tensor. If not, it creates a copy with standard layout.

```python id="flroby"
X = torch.randn(2, 3)
Y = X.T

Z = Y.contiguous()

print(Y.is_contiguous())
print(Z.is_contiguous())
```

Output:

```python id="z4gbor"
False
True
```

Use `contiguous()` when a later operation needs standard layout.

A common pattern is:

```python id="1s9wif"
Y = X.permute(0, 2, 1).contiguous()
Y = Y.view(Y.shape[0], -1)
```

### `view()` Versus `reshape()`

`view()` requires a compatible memory layout. It only changes metadata.

```python id="hlnmnd"
X = torch.arange(12)
Y = X.view(3, 4)
```

After a transpose, `view()` may fail:

```python id="bgzfde"
X = torch.arange(12).view(3, 4)
Y = X.T

# Z = Y.view(12)  # may fail
```

Use `reshape()` when you want PyTorch to return a view if possible and a copy if necessary.

```python id="r8w90r"
Z = Y.reshape(12)
```

Practical rule:

| Operation | Behavior |
|---|---|
| `view()` | Requires compatible strides |
| `reshape()` | View if possible, copy if needed |
| `contiguous()` | Copy if tensor is noncontiguous |
| `clone()` | Always creates separate storage |

For model code, `reshape()` is often safer. For performance-sensitive code, understand whether a copy happens.

### `permute()` and Axis Order

`permute()` changes the order of axes.

```python id="lgzttr"
X = torch.randn(32, 224, 224, 3)  # NHWC
Y = X.permute(0, 3, 1, 2)         # NCHW

print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())
```

The result often has a noncontiguous layout because PyTorch changes stride metadata rather than physically moving data.

If the next layer expects contiguous NCHW layout, use:

```python id="ohk0sd"
Y = Y.contiguous()
```

However, avoid unnecessary calls to `contiguous()`. Each copy costs memory bandwidth and time.

### Channels-First and Channels-Last

Image tensors often use one of two layouts:

| Layout | Shape | Meaning |
|---|---|---|
| NCHW | `[N, C, H, W]` | Channels first |
| NHWC | `[N, H, W, C]` | Channels last |

PyTorch convolution layers traditionally use NCHW. Some hardware and kernels may perform better with channels-last memory format.

PyTorch supports channels-last memory format:

```python id="kpy2pa"
X = torch.randn(32, 3, 224, 224)

X = X.to(memory_format=torch.channels_last)

print(X.shape)
print(X.stride())
print(X.is_contiguous(memory_format=torch.channels_last))
```

The logical shape remains `[N, C, H, W]`, but the physical memory format changes.

This distinction is important. Channels-last memory format does not mean the tensor shape becomes `[N, H, W, C]`. It means the tensor is still indexed as `[N, C, H, W]`, but stored in memory in a channels-last-friendly layout.

### Memory Format and Convolution Performance

Convolution performance depends on device, dtype, input size, and backend kernel. On some GPUs, channels-last with mixed precision can improve throughput.

A typical pattern:

```python id="4jae5k"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)
model = model.to(device)
model = model.to(memory_format=torch.channels_last)

X = torch.randn(32, 3, 224, 224, device=device)
X = X.to(memory_format=torch.channels_last)

Y = model(X)
```

This should be benchmarked for the actual workload. Memory format choices can help, but they are not universal speedups.

### Tensor Copies

Copies occur more often than beginners expect.

Common copy-producing operations include:

| Operation | Copy behavior |
|---|---|
| `clone()` | Always copies |
| `contiguous()` | Copies if needed |
| `to(device)` | Copies if device changes |
| `to(dtype)` | Copies if dtype changes |
| Advanced indexing | Usually copies |
| `reshape()` | May copy |
| `permute()` | Usually does not copy |
| Basic slicing | Usually does not copy |

Copies can be expensive for large tensors. For example, copying a `[32, 3, 224, 224]` float32 tensor moves about 18.4 MiB of data. Copying activations repeatedly inside a training loop can become a bottleneck.

### Measuring Memory Use

For a tensor:

```python id="wjx33x"
X = torch.randn(32, 3, 224, 224)

num_bytes = X.numel() * X.element_size()

print(num_bytes)
```

For CUDA memory:

```python id="1jrp42"
if torch.cuda.is_available():
    print(torch.cuda.memory_allocated())
    print(torch.cuda.memory_reserved())
```

`memory_allocated()` reports memory occupied by tensors. `memory_reserved()` reports memory held by PyTorch’s caching allocator.

The reserved memory may be larger than allocated memory because PyTorch keeps memory blocks for reuse.

### The CUDA Caching Allocator

On CUDA, PyTorch uses a caching allocator. When a tensor is freed, PyTorch may keep the memory block reserved so that future allocations are faster.

This means GPU memory shown by system tools may remain high even after tensors are deleted.

```python id="h1b78i"
del X

if torch.cuda.is_available():
    torch.cuda.empty_cache()
```

`empty_cache()` releases unused cached memory back to the CUDA driver. It does not free memory still used by live tensors. It also should not be called routinely inside training loops because it can hurt performance.

### Avoiding Unnecessary Allocations

Repeated allocation can slow training. Prefer reusing tensors when practical, and avoid creating large temporary tensors inside inner loops.

Less efficient:

```python id="9o4xca"
for step in range(1000):
    mask = torch.ones(1024, 1024, device=device)
    # use mask
```

Better when the mask is constant:

```python id="1xialm"
mask = torch.ones(1024, 1024, device=device)

for step in range(1000):
    # use mask
    pass
```

For model constants, register buffers so they move with the model:

```python id="icswpm"
class MaskedModel(torch.nn.Module):
    def __init__(self, T):
        super().__init__()
        mask = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
        self.register_buffer("mask", mask)

    def forward(self, x):
        return x
```

### In-Place Operations and Memory

In-place operations can reduce memory use:

```python id="u4ca9u"
x = torch.randn(1024, 1024)

x.relu_()
```

The underscore indicates mutation.

However, in-place operations can break autograd if they overwrite values needed for backward computation.

Safer default:

```python id="4pmy72"
y = torch.relu(x)
```

Use in-place operations only when memory pressure is real and the operation is known to be safe.

### Memory and Autograd

During training, PyTorch stores intermediate activations needed for backpropagation. These saved tensors often consume more memory than the parameters.

Example:

```python id="8183ai"
y = model(x)
loss = criterion(y, target)
loss.backward()
```

Before `backward()`, PyTorch keeps the computation graph and saved activations. After `backward()`, many saved tensors can be released unless the graph is retained.

Avoid this pattern unless needed:

```python id="k6aj06"
loss.backward(retain_graph=True)
```

`retain_graph=True` keeps the graph and increases memory use.

For inference, disable gradient tracking:

```python id="0nhbf2"
with torch.no_grad():
    y = model(x)
```

Or use inference mode:

```python id="9f7o9f"
with torch.inference_mode():
    y = model(x)
```

Inference mode can reduce overhead further when gradients are not needed.

### Gradient Checkpointing

Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, PyTorch recomputes some of them during backward.

```python id="8p1zqs"
from torch.utils.checkpoint import checkpoint

def block_forward(x):
    return block(x)

y = checkpoint(block_forward, x)
```

This can make larger models fit in memory, at the cost of extra computation.

Gradient checkpointing is common in transformer training, diffusion models, and other deep networks with large activation memory.

### Performance Principles

Practical tensor performance depends on several factors:

| Principle | Reason |
|---|---|
| Prefer large batched operations | Better hardware utilization |
| Avoid Python loops over tensor elements | Python overhead is high |
| Keep tensors on the same device | Device transfers are expensive |
| Avoid unnecessary copies | Copies consume memory bandwidth |
| Use contiguous layout when kernels need it | Strided access can be slower |
| Use mixed precision where safe | Lower memory and higher throughput |
| Benchmark actual workloads | Performance depends on hardware and shapes |

For example, prefer:

```python id="hzbhbl"
Y = X @ W
```

over:

```python id="frz4t9"
rows = []
for i in range(X.shape[0]):
    rows.append(X[i] @ W)
Y = torch.stack(rows)
```

The first version uses optimized matrix multiplication kernels. The second version spends time in Python and launches many smaller operations.

### Profiling Tensor Code

PyTorch includes profiling tools for measuring runtime and memory.

A simple timing pattern for CUDA requires synchronization:

```python id="g0mcj2"
import time

torch.cuda.synchronize()
start = time.time()

Y = X @ W

torch.cuda.synchronize()
end = time.time()

print(end - start)
```

CUDA operations are asynchronous, so timing without synchronization can be misleading.

For deeper profiling, use:

```python id="ck90bk"
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    Y = X @ W

print(prof.key_averages().table(sort_by="cuda_time_total"))
```

Profiling should guide optimization. Guessing often leads to unnecessary complexity.

### Summary

Tensor memory layout describes how a tensor’s logical indices map to physical storage. Shape gives the size of each axis. Stride gives the memory step along each axis. Storage offset gives the starting position inside storage.

Basic slicing, transpose, and permute often create views by changing metadata. Advanced indexing, clone, device transfer, dtype conversion, and some reshapes create copies. Contiguous tensors usually have simpler and faster access patterns, but copying to make a tensor contiguous also costs time.

Efficient PyTorch code uses large batched operations, avoids unnecessary transfers and copies, keeps dtype and device choices consistent, and relies on profiling rather than assumptions.

