Tensor Memory Layout and Performance

A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.

Most PyTorch code can be written without thinking about memory layout. However, layout becomes important when code becomes slow, when view() fails, when a tensor is noncontiguous, or when we write performance-critical training and inference code.

Logical Shape Versus Physical Storage

Consider a matrix:

X = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}.

Its logical shape is:

[2, 3]

It has 2 rows and 3 columns. Internally, the values are usually stored in one linear block of memory:

[1, 2, 3, 4, 5, 6]

The tensor object stores metadata that tells PyTorch how to interpret that storage:

Metadata	Meaning
Shape	Size of each axis
Stride	Step size in storage for each axis
Storage offset	Where the tensor begins inside storage
dtype	Bytes and numeric format of each element
device	CPU, CUDA, or other accelerator

A tensor is therefore more than its values. It is a view over storage with shape and stride metadata.

Row-Major Layout

PyTorch commonly uses row-major layout for contiguous tensors. In a 2D matrix, entries in the last dimension are adjacent in memory.

For a tensor with shape [2, 3]:

X = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
])

The memory order is:

X[0, 0], X[0, 1], X[0, 2], X[1, 0], X[1, 1], X[1, 2]

The last axis changes fastest.

This convention matters because contiguous memory access is usually faster than strided memory access. CPUs and GPUs are designed to read blocks of nearby memory efficiently.

Strides

A stride tells us how many storage positions to move when increasing an index along an axis by one.

import torch

X = torch.randn(2, 3)

print(X.shape)
print(X.stride())

Typical output:

torch.Size([2, 3])
(3, 1)

The stride (3, 1) means:

Axis	Meaning	Stride
0	Row axis	3
1	Column axis	1

Moving one row forward skips 3 elements. Moving one column forward skips 1 element.

For a contiguous tensor with shape [B, C, H, W], typical strides are:

[C * H * W, H * W, W, 1]

Example:

X = torch.randn(32, 3, 224, 224)

print(X.stride())

Typical output:

(150528, 50176, 224, 1)

The width axis is contiguous because its stride is 1.

Storage Offset

A tensor view may begin inside another tensor’s storage.

x = torch.arange(10)
y = x[3:8]

print(y)
print(y.storage_offset())

Output:

tensor([3, 4, 5, 6, 7])
3

The tensor y starts at storage position 3 of x. It does not need separate storage.

This is why slicing is cheap. It usually creates new metadata over the same storage.

Contiguous Tensors

A tensor is contiguous when its layout matches the standard memory order for its shape.

X = torch.randn(2, 3)

print(X.is_contiguous())

Output:

True

Contiguous tensors are often faster for many operations because their memory access pattern is simple.

Some operations require contiguous tensors, especially operations that reinterpret storage directly, such as view().

X = torch.arange(12)
Y = X.view(3, 4)

print(Y.shape)

Here view() works because the underlying memory layout supports the requested shape.

Noncontiguous Tensors

Transpose commonly produces a noncontiguous tensor.

X = torch.randn(2, 3)
Y = X.T

print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())

Typical output:

torch.Size([3, 2])
(1, 3)
False

The tensor Y has shape [3, 2], but its memory is still the same storage as X. PyTorch changed the strides instead of copying the data.

This is efficient, but the resulting tensor has nonstandard memory order.

`contiguous()`

The method contiguous() returns a contiguous tensor. If the tensor is already contiguous, it may return the same tensor. If not, it creates a copy with standard layout.

X = torch.randn(2, 3)
Y = X.T

Z = Y.contiguous()

print(Y.is_contiguous())
print(Z.is_contiguous())

Output:

False
True

Use contiguous() when a later operation needs standard layout.

A common pattern is:

Y = X.permute(0, 2, 1).contiguous()
Y = Y.view(Y.shape[0], -1)

`view()` Versus `reshape()`

view() requires a compatible memory layout. It only changes metadata.

X = torch.arange(12)
Y = X.view(3, 4)

After a transpose, view() may fail:

X = torch.arange(12).view(3, 4)
Y = X.T

# Z = Y.view(12)  # may fail

Use reshape() when you want PyTorch to return a view if possible and a copy if necessary.

Z = Y.reshape(12)

Practical rule:

Operation	Behavior
`view()`	Requires compatible strides
`reshape()`	View if possible, copy if needed
`contiguous()`	Copy if tensor is noncontiguous
`clone()`	Always creates separate storage

For model code, reshape() is often safer. For performance-sensitive code, understand whether a copy happens.

`permute()` and Axis Order

permute() changes the order of axes.

X = torch.randn(32, 224, 224, 3)  # NHWC
Y = X.permute(0, 3, 1, 2)         # NCHW

print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())

The result often has a noncontiguous layout because PyTorch changes stride metadata rather than physically moving data.

If the next layer expects contiguous NCHW layout, use:

Y = Y.contiguous()

However, avoid unnecessary calls to contiguous(). Each copy costs memory bandwidth and time.

Channels-First and Channels-Last

Image tensors often use one of two layouts:

Layout	Shape	Meaning
NCHW	`[N, C, H, W]`	Channels first
NHWC	`[N, H, W, C]`	Channels last

PyTorch convolution layers traditionally use NCHW. Some hardware and kernels may perform better with channels-last memory format.

PyTorch supports channels-last memory format:

X = torch.randn(32, 3, 224, 224)

X = X.to(memory_format=torch.channels_last)

print(X.shape)
print(X.stride())
print(X.is_contiguous(memory_format=torch.channels_last))

The logical shape remains [N, C, H, W], but the physical memory format changes.

This distinction is important. Channels-last memory format does not mean the tensor shape becomes [N, H, W, C]. It means the tensor is still indexed as [N, C, H, W], but stored in memory in a channels-last-friendly layout.

Memory Format and Convolution Performance

Convolution performance depends on device, dtype, input size, and backend kernel. On some GPUs, channels-last with mixed precision can improve throughput.

A typical pattern:

device = "cuda" if torch.cuda.is_available() else "cpu"

model = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)
model = model.to(device)
model = model.to(memory_format=torch.channels_last)

X = torch.randn(32, 3, 224, 224, device=device)
X = X.to(memory_format=torch.channels_last)

Y = model(X)

This should be benchmarked for the actual workload. Memory format choices can help, but they are not universal speedups.

Tensor Copies

Copies occur more often than beginners expect.

Common copy-producing operations include:

Operation	Copy behavior
`clone()`	Always copies
`contiguous()`	Copies if needed
`to(device)`	Copies if device changes
`to(dtype)`	Copies if dtype changes
Advanced indexing	Usually copies
`reshape()`	May copy
`permute()`	Usually does not copy
Basic slicing	Usually does not copy

Copies can be expensive for large tensors. For example, copying a [32, 3, 224, 224] float32 tensor moves about 18.4 MiB of data. Copying activations repeatedly inside a training loop can become a bottleneck.

Measuring Memory Use

For a tensor:

X = torch.randn(32, 3, 224, 224)

num_bytes = X.numel() * X.element_size()

print(num_bytes)

For CUDA memory:

if torch.cuda.is_available():
    print(torch.cuda.memory_allocated())
    print(torch.cuda.memory_reserved())

memory_allocated() reports memory occupied by tensors. memory_reserved() reports memory held by PyTorch’s caching allocator.

The reserved memory may be larger than allocated memory because PyTorch keeps memory blocks for reuse.

The CUDA Caching Allocator

On CUDA, PyTorch uses a caching allocator. When a tensor is freed, PyTorch may keep the memory block reserved so that future allocations are faster.

This means GPU memory shown by system tools may remain high even after tensors are deleted.

del X

if torch.cuda.is_available():
    torch.cuda.empty_cache()

empty_cache() releases unused cached memory back to the CUDA driver. It does not free memory still used by live tensors. It also should not be called routinely inside training loops because it can hurt performance.

Avoiding Unnecessary Allocations

Repeated allocation can slow training. Prefer reusing tensors when practical, and avoid creating large temporary tensors inside inner loops.

Less efficient:

for step in range(1000):
    mask = torch.ones(1024, 1024, device=device)
    # use mask

Better when the mask is constant:

mask = torch.ones(1024, 1024, device=device)

for step in range(1000):
    # use mask
    pass

For model constants, register buffers so they move with the model:

class MaskedModel(torch.nn.Module):
    def __init__(self, T):
        super().__init__()
        mask = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
        self.register_buffer("mask", mask)

    def forward(self, x):
        return x

In-Place Operations and Memory

In-place operations can reduce memory use:

x = torch.randn(1024, 1024)

x.relu_()

The underscore indicates mutation.

However, in-place operations can break autograd if they overwrite values needed for backward computation.

Safer default:

y = torch.relu(x)

Use in-place operations only when memory pressure is real and the operation is known to be safe.

Memory and Autograd

During training, PyTorch stores intermediate activations needed for backpropagation. These saved tensors often consume more memory than the parameters.

Example:

y = model(x)
loss = criterion(y, target)
loss.backward()

Before backward(), PyTorch keeps the computation graph and saved activations. After backward(), many saved tensors can be released unless the graph is retained.

Avoid this pattern unless needed:

loss.backward(retain_graph=True)

retain_graph=True keeps the graph and increases memory use.

For inference, disable gradient tracking:

with torch.no_grad():
    y = model(x)

Or use inference mode:

with torch.inference_mode():
    y = model(x)

Inference mode can reduce overhead further when gradients are not needed.

Gradient Checkpointing

Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, PyTorch recomputes some of them during backward.

from torch.utils.checkpoint import checkpoint

def block_forward(x):
    return block(x)

y = checkpoint(block_forward, x)

This can make larger models fit in memory, at the cost of extra computation.

Gradient checkpointing is common in transformer training, diffusion models, and other deep networks with large activation memory.

Performance Principles

Practical tensor performance depends on several factors:

Principle	Reason
Prefer large batched operations	Better hardware utilization
Avoid Python loops over tensor elements	Python overhead is high
Keep tensors on the same device	Device transfers are expensive
Avoid unnecessary copies	Copies consume memory bandwidth
Use contiguous layout when kernels need it	Strided access can be slower
Use mixed precision where safe	Lower memory and higher throughput
Benchmark actual workloads	Performance depends on hardware and shapes

For example, prefer:

Y = X @ W

over:

rows = []
for i in range(X.shape[0]):
    rows.append(X[i] @ W)
Y = torch.stack(rows)

The first version uses optimized matrix multiplication kernels. The second version spends time in Python and launches many smaller operations.

Profiling Tensor Code

PyTorch includes profiling tools for measuring runtime and memory.

A simple timing pattern for CUDA requires synchronization:

import time

torch.cuda.synchronize()
start = time.time()

Y = X @ W

torch.cuda.synchronize()
end = time.time()

print(end - start)

CUDA operations are asynchronous, so timing without synchronization can be misleading.

For deeper profiling, use:

from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    Y = X @ W

print(prof.key_averages().table(sort_by="cuda_time_total"))

Profiling should guide optimization. Guessing often leads to unnecessary complexity.

Summary

Tensor memory layout describes how a tensor’s logical indices map to physical storage. Shape gives the size of each axis. Stride gives the memory step along each axis. Storage offset gives the starting position inside storage.

Basic slicing, transpose, and permute often create views by changing metadata. Advanced indexing, clone, device transfer, dtype conversion, and some reshapes create copies. Contiguous tensors usually have simpler and faster access patterns, but copying to make a tensor contiguous also costs time.

Efficient PyTorch code uses large batched operations, avoids unnecessary transfers and copies, keeps dtype and device choices consistent, and relies on profiling rather than assumptions.

Tensor Memory Layout and Performance

Logical Shape Versus Physical Storage

Row-Major Layout

Strides

Storage Offset

Contiguous Tensors

Noncontiguous Tensors

contiguous()

view() Versus reshape()

permute() and Axis Order

Channels-First and Channels-Last

Memory Format and Convolution Performance

Tensor Copies

Measuring Memory Use

The CUDA Caching Allocator

Avoiding Unnecessary Allocations

In-Place Operations and Memory

Memory and Autograd

Gradient Checkpointing

Performance Principles

Profiling Tensor Code

Summary

`contiguous()`

`view()` Versus `reshape()`

`permute()` and Axis Order