A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.
A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.
Most PyTorch code can be written without thinking about memory layout. However, layout becomes important when code becomes slow, when view() fails, when a tensor is noncontiguous, or when we write performance-critical training and inference code.
Logical Shape Versus Physical Storage
Consider a matrix:
Its logical shape is:
[2, 3]It has 2 rows and 3 columns. Internally, the values are usually stored in one linear block of memory:
[1, 2, 3, 4, 5, 6]The tensor object stores metadata that tells PyTorch how to interpret that storage:
| Metadata | Meaning |
|---|---|
| Shape | Size of each axis |
| Stride | Step size in storage for each axis |
| Storage offset | Where the tensor begins inside storage |
| dtype | Bytes and numeric format of each element |
| device | CPU, CUDA, or other accelerator |
A tensor is therefore more than its values. It is a view over storage with shape and stride metadata.
Row-Major Layout
PyTorch commonly uses row-major layout for contiguous tensors. In a 2D matrix, entries in the last dimension are adjacent in memory.
For a tensor with shape [2, 3]:
X = torch.tensor([
[1, 2, 3],
[4, 5, 6],
])The memory order is:
X[0, 0], X[0, 1], X[0, 2], X[1, 0], X[1, 1], X[1, 2]The last axis changes fastest.
This convention matters because contiguous memory access is usually faster than strided memory access. CPUs and GPUs are designed to read blocks of nearby memory efficiently.
Strides
A stride tells us how many storage positions to move when increasing an index along an axis by one.
import torch
X = torch.randn(2, 3)
print(X.shape)
print(X.stride())Typical output:
torch.Size([2, 3])
(3, 1)The stride (3, 1) means:
| Axis | Meaning | Stride |
|---|---|---|
| 0 | Row axis | 3 |
| 1 | Column axis | 1 |
Moving one row forward skips 3 elements. Moving one column forward skips 1 element.
For a contiguous tensor with shape [B, C, H, W], typical strides are:
[C * H * W, H * W, W, 1]Example:
X = torch.randn(32, 3, 224, 224)
print(X.stride())Typical output:
(150528, 50176, 224, 1)The width axis is contiguous because its stride is 1.
Storage Offset
A tensor view may begin inside another tensor’s storage.
x = torch.arange(10)
y = x[3:8]
print(y)
print(y.storage_offset())Output:
tensor([3, 4, 5, 6, 7])
3The tensor y starts at storage position 3 of x. It does not need separate storage.
This is why slicing is cheap. It usually creates new metadata over the same storage.
Contiguous Tensors
A tensor is contiguous when its layout matches the standard memory order for its shape.
X = torch.randn(2, 3)
print(X.is_contiguous())Output:
TrueContiguous tensors are often faster for many operations because their memory access pattern is simple.
Some operations require contiguous tensors, especially operations that reinterpret storage directly, such as view().
X = torch.arange(12)
Y = X.view(3, 4)
print(Y.shape)Here view() works because the underlying memory layout supports the requested shape.
Noncontiguous Tensors
Transpose commonly produces a noncontiguous tensor.
X = torch.randn(2, 3)
Y = X.T
print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())Typical output:
torch.Size([3, 2])
(1, 3)
FalseThe tensor Y has shape [3, 2], but its memory is still the same storage as X. PyTorch changed the strides instead of copying the data.
This is efficient, but the resulting tensor has nonstandard memory order.
contiguous()
The method contiguous() returns a contiguous tensor. If the tensor is already contiguous, it may return the same tensor. If not, it creates a copy with standard layout.
X = torch.randn(2, 3)
Y = X.T
Z = Y.contiguous()
print(Y.is_contiguous())
print(Z.is_contiguous())Output:
False
TrueUse contiguous() when a later operation needs standard layout.
A common pattern is:
Y = X.permute(0, 2, 1).contiguous()
Y = Y.view(Y.shape[0], -1)view() Versus reshape()
view() requires a compatible memory layout. It only changes metadata.
X = torch.arange(12)
Y = X.view(3, 4)After a transpose, view() may fail:
X = torch.arange(12).view(3, 4)
Y = X.T
# Z = Y.view(12) # may failUse reshape() when you want PyTorch to return a view if possible and a copy if necessary.
Z = Y.reshape(12)Practical rule:
| Operation | Behavior |
|---|---|
view() | Requires compatible strides |
reshape() | View if possible, copy if needed |
contiguous() | Copy if tensor is noncontiguous |
clone() | Always creates separate storage |
For model code, reshape() is often safer. For performance-sensitive code, understand whether a copy happens.
permute() and Axis Order
permute() changes the order of axes.
X = torch.randn(32, 224, 224, 3) # NHWC
Y = X.permute(0, 3, 1, 2) # NCHW
print(Y.shape)
print(Y.stride())
print(Y.is_contiguous())The result often has a noncontiguous layout because PyTorch changes stride metadata rather than physically moving data.
If the next layer expects contiguous NCHW layout, use:
Y = Y.contiguous()However, avoid unnecessary calls to contiguous(). Each copy costs memory bandwidth and time.
Channels-First and Channels-Last
Image tensors often use one of two layouts:
| Layout | Shape | Meaning |
|---|---|---|
| NCHW | [N, C, H, W] | Channels first |
| NHWC | [N, H, W, C] | Channels last |
PyTorch convolution layers traditionally use NCHW. Some hardware and kernels may perform better with channels-last memory format.
PyTorch supports channels-last memory format:
X = torch.randn(32, 3, 224, 224)
X = X.to(memory_format=torch.channels_last)
print(X.shape)
print(X.stride())
print(X.is_contiguous(memory_format=torch.channels_last))The logical shape remains [N, C, H, W], but the physical memory format changes.
This distinction is important. Channels-last memory format does not mean the tensor shape becomes [N, H, W, C]. It means the tensor is still indexed as [N, C, H, W], but stored in memory in a channels-last-friendly layout.
Memory Format and Convolution Performance
Convolution performance depends on device, dtype, input size, and backend kernel. On some GPUs, channels-last with mixed precision can improve throughput.
A typical pattern:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)
model = model.to(device)
model = model.to(memory_format=torch.channels_last)
X = torch.randn(32, 3, 224, 224, device=device)
X = X.to(memory_format=torch.channels_last)
Y = model(X)This should be benchmarked for the actual workload. Memory format choices can help, but they are not universal speedups.
Tensor Copies
Copies occur more often than beginners expect.
Common copy-producing operations include:
| Operation | Copy behavior |
|---|---|
clone() | Always copies |
contiguous() | Copies if needed |
to(device) | Copies if device changes |
to(dtype) | Copies if dtype changes |
| Advanced indexing | Usually copies |
reshape() | May copy |
permute() | Usually does not copy |
| Basic slicing | Usually does not copy |
Copies can be expensive for large tensors. For example, copying a [32, 3, 224, 224] float32 tensor moves about 18.4 MiB of data. Copying activations repeatedly inside a training loop can become a bottleneck.
Measuring Memory Use
For a tensor:
X = torch.randn(32, 3, 224, 224)
num_bytes = X.numel() * X.element_size()
print(num_bytes)For CUDA memory:
if torch.cuda.is_available():
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())memory_allocated() reports memory occupied by tensors. memory_reserved() reports memory held by PyTorch’s caching allocator.
The reserved memory may be larger than allocated memory because PyTorch keeps memory blocks for reuse.
The CUDA Caching Allocator
On CUDA, PyTorch uses a caching allocator. When a tensor is freed, PyTorch may keep the memory block reserved so that future allocations are faster.
This means GPU memory shown by system tools may remain high even after tensors are deleted.
del X
if torch.cuda.is_available():
torch.cuda.empty_cache()empty_cache() releases unused cached memory back to the CUDA driver. It does not free memory still used by live tensors. It also should not be called routinely inside training loops because it can hurt performance.
Avoiding Unnecessary Allocations
Repeated allocation can slow training. Prefer reusing tensors when practical, and avoid creating large temporary tensors inside inner loops.
Less efficient:
for step in range(1000):
mask = torch.ones(1024, 1024, device=device)
# use maskBetter when the mask is constant:
mask = torch.ones(1024, 1024, device=device)
for step in range(1000):
# use mask
passFor model constants, register buffers so they move with the model:
class MaskedModel(torch.nn.Module):
def __init__(self, T):
super().__init__()
mask = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
self.register_buffer("mask", mask)
def forward(self, x):
return xIn-Place Operations and Memory
In-place operations can reduce memory use:
x = torch.randn(1024, 1024)
x.relu_()The underscore indicates mutation.
However, in-place operations can break autograd if they overwrite values needed for backward computation.
Safer default:
y = torch.relu(x)Use in-place operations only when memory pressure is real and the operation is known to be safe.
Memory and Autograd
During training, PyTorch stores intermediate activations needed for backpropagation. These saved tensors often consume more memory than the parameters.
Example:
y = model(x)
loss = criterion(y, target)
loss.backward()Before backward(), PyTorch keeps the computation graph and saved activations. After backward(), many saved tensors can be released unless the graph is retained.
Avoid this pattern unless needed:
loss.backward(retain_graph=True)retain_graph=True keeps the graph and increases memory use.
For inference, disable gradient tracking:
with torch.no_grad():
y = model(x)Or use inference mode:
with torch.inference_mode():
y = model(x)Inference mode can reduce overhead further when gradients are not needed.
Gradient Checkpointing
Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, PyTorch recomputes some of them during backward.
from torch.utils.checkpoint import checkpoint
def block_forward(x):
return block(x)
y = checkpoint(block_forward, x)This can make larger models fit in memory, at the cost of extra computation.
Gradient checkpointing is common in transformer training, diffusion models, and other deep networks with large activation memory.
Performance Principles
Practical tensor performance depends on several factors:
| Principle | Reason |
|---|---|
| Prefer large batched operations | Better hardware utilization |
| Avoid Python loops over tensor elements | Python overhead is high |
| Keep tensors on the same device | Device transfers are expensive |
| Avoid unnecessary copies | Copies consume memory bandwidth |
| Use contiguous layout when kernels need it | Strided access can be slower |
| Use mixed precision where safe | Lower memory and higher throughput |
| Benchmark actual workloads | Performance depends on hardware and shapes |
For example, prefer:
Y = X @ Wover:
rows = []
for i in range(X.shape[0]):
rows.append(X[i] @ W)
Y = torch.stack(rows)The first version uses optimized matrix multiplication kernels. The second version spends time in Python and launches many smaller operations.
Profiling Tensor Code
PyTorch includes profiling tools for measuring runtime and memory.
A simple timing pattern for CUDA requires synchronization:
import time
torch.cuda.synchronize()
start = time.time()
Y = X @ W
torch.cuda.synchronize()
end = time.time()
print(end - start)CUDA operations are asynchronous, so timing without synchronization can be misleading.
For deeper profiling, use:
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
) as prof:
Y = X @ W
print(prof.key_averages().table(sort_by="cuda_time_total"))Profiling should guide optimization. Guessing often leads to unnecessary complexity.
Summary
Tensor memory layout describes how a tensor’s logical indices map to physical storage. Shape gives the size of each axis. Stride gives the memory step along each axis. Storage offset gives the starting position inside storage.
Basic slicing, transpose, and permute often create views by changing metadata. Advanced indexing, clone, device transfer, dtype conversion, and some reshapes create copies. Contiguous tensors usually have simpler and faster access patterns, but copying to make a tensor contiguous also costs time.
Efficient PyTorch code uses large batched operations, avoids unnecessary transfers and copies, keeps dtype and device choices consistent, and relies on profiling rather than assumptions.