PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute.
PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute. The most common devices are CPU and CUDA GPU, but modern PyTorch can also target other accelerators depending on the installation and hardware.
Device placement is part of tensor correctness. A tensor with the right shape and dtype can still fail if it lives on the wrong device. Training performance also depends heavily on moving data to the right device at the right time.
CPU Tensors
By default, PyTorch creates tensors on the CPU.
import torch
x = torch.randn(3, 4)
print(x.device)Output:
cpuCPU tensors are stored in system memory. CPU computation is useful for data preprocessing, small models, debugging, and operations that have no accelerator implementation.
Most PyTorch programs begin with CPU tensors because datasets are usually loaded from disk into CPU memory first. These tensors are then moved to the GPU before model computation.
CUDA Tensors
A CUDA tensor lives on an NVIDIA GPU.
x = torch.randn(3, 4, device="cuda")
print(x.device)Output:
cuda:0The string "cuda" usually means the default CUDA device, commonly cuda:0.
A specific GPU can be selected:
x = torch.randn(3, 4, device="cuda:1")This creates the tensor on the second CUDA GPU.
A safe pattern is:
device = "cuda" if torch.cuda.is_available() else "cpu"
x = torch.randn(32, 128, device=device)This lets the same code run on machines with or without a CUDA GPU.
Checking Available Devices
Check whether CUDA is available:
print(torch.cuda.is_available())Check the number of visible CUDA devices:
print(torch.cuda.device_count())Print device names:
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
print(i, torch.cuda.get_device_name(i))The availability of CUDA depends on hardware, drivers, the CUDA runtime, and the PyTorch build.
Moving Tensors Between Devices
Use .to() to move a tensor.
x = torch.randn(32, 128)
x = x.to("cuda")Move it back to CPU:
x_cpu = x.to("cpu")Convenience methods also exist:
x_cuda = x.cuda()
x_cpu = x_cuda.cpu()The .to() form is usually preferable because it also works for dtype conversion and supports device-agnostic code.
x = x.to(device=device, dtype=torch.float32)Device transfer creates a new tensor on the target device when the device changes. The original tensor remains on its original device unless overwritten by assignment.
Device Mismatch Errors
Most operations require all input tensors to be on the same device.
x = torch.randn(3, device="cuda")
y = torch.randn(3, device="cpu")
# z = x + y # errorThe fix is to move one tensor:
y = y.to(x.device)
z = x + yDevice mismatch errors often happen with labels, masks, positional encodings, or tensors created inside the forward pass.
Incorrect pattern:
def forward(self, x):
mask = torch.ones(x.shape[0], x.shape[1])
return x * maskIf x is on GPU, mask is still on CPU.
Correct pattern:
def forward(self, x):
mask = torch.ones(x.shape[0], x.shape[1], device=x.device, dtype=x.dtype)
return x * maskMoving Models to Devices
A model’s parameters must live on the same device as the input tensors.
import torch.nn as nn
model = nn.Linear(128, 10)
model = model.to(device)Then inputs should be moved to the same device:
x = torch.randn(32, 128).to(device)
logits = model(x)A useful check:
print(next(model.parameters()).device)Calling model.to(device) moves parameters and registered buffers. Ordinary tensor attributes that are not registered as parameters or buffers are not moved automatically.
Registered Buffers and Device Movement
Some tensors belong to a model but are not trainable parameters. Examples include masks, running means, running variances, fixed positional encodings, and lookup constants.
Use register_buffer() for these tensors.
class CausalMask(nn.Module):
def __init__(self, max_length):
super().__init__()
mask = torch.triu(
torch.ones(max_length, max_length, dtype=torch.bool),
diagonal=1,
)
self.register_buffer("mask", mask)
def forward(self, x):
T = x.shape[1]
return self.mask[:T, :T]Now:
model = CausalMask(1024).to(device)The buffer moves with the model.
Without register_buffer, the tensor may remain on CPU and later cause a device mismatch.
DataLoader and Device Transfer
A DataLoader usually returns CPU tensors.
for x, y in loader:
x = x.to(device)
y = y.to(device)
logits = model(x)This pattern is standard. The dataset and loader remain CPU-side, while each minibatch is moved to the accelerator before computation.
For CUDA training, pin_memory=True can improve transfer performance:
loader = torch.utils.data.DataLoader(
dataset,
batch_size=64,
shuffle=True,
pin_memory=True,
)Then use non-blocking copies:
for x, y in loader:
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
logits = model(x)This can help overlap CPU-to-GPU transfer with computation in some pipelines.
Avoiding Repeated Device Transfers
Device transfers are expensive relative to many tensor operations. Avoid moving tensors back and forth inside inner loops.
Poor pattern:
for x, y in loader:
x = x.to(device)
output = model(x)
output_cpu = output.cpu()
output = output_cpu.to(device)Better pattern:
for x, y in loader:
x = x.to(device)
output = model(x)Move tensors to CPU only when needed for logging, NumPy conversion, serialization, or interaction with CPU-only libraries.
For scalar logging:
loss_value = loss.item().item() transfers a scalar value to Python. Calling it too often can synchronize GPU execution and slow training. Use it for logging, not inside performance-critical arithmetic.
NumPy Conversion
A CUDA tensor cannot be converted directly to a NumPy array. NumPy uses CPU memory.
x = torch.randn(3, device="cuda")
# arr = x.numpy() # errorMove to CPU first:
arr = x.detach().cpu().numpy()The detach() call removes the tensor from the autograd graph. This is usually appropriate for logging, plotting, or metric computation outside PyTorch.
For CPU tensors:
x = torch.randn(3)
arr = x.numpy()The NumPy array may share memory with the tensor. Mutating one may affect the other.
CUDA Asynchrony
CUDA operations are often asynchronous. When you launch an operation, Python may continue before the GPU has finished.
y = x @ wThis line may return control to Python before the matrix multiplication has completed on the GPU.
This matters for timing. Incorrect timing:
import time
start = time.time()
y = x @ w
end = time.time()
print(end - start)Correct CUDA timing requires synchronization:
torch.cuda.synchronize()
start = time.time()
y = x @ w
torch.cuda.synchronize()
end = time.time()
print(end - start)Without synchronization, timing may measure launch overhead rather than actual computation.
GPU Memory Allocation
CUDA tensors use GPU memory.
if torch.cuda.is_available():
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())memory_allocated() reports memory used by live tensors.
memory_reserved() reports memory held by PyTorch’s caching allocator.
PyTorch may reserve more memory than currently allocated to tensors. This improves future allocation speed but can make GPU memory appear occupied in system tools.
Clearing CUDA Cache
Unused cached memory can be released with:
torch.cuda.empty_cache()This does not free memory used by live tensors. It only releases unused cached blocks back to the CUDA driver.
Do not call empty_cache() inside normal training loops. It can slow training by forcing PyTorch to request memory from the driver repeatedly.
Use it mainly during debugging, notebook experimentation, or after deleting large tensors.
del large_tensor
torch.cuda.empty_cache()Multiple GPUs
A tensor belongs to one device at a time.
x0 = torch.randn(3, device="cuda:0")
x1 = torch.randn(3, device="cuda:1")Operations across GPUs usually require explicit movement or distributed communication.
x1_on_0 = x1.to("cuda:0")
z = x0 + x1_on_0For serious multi-GPU training, use PyTorch distributed tools such as DistributedDataParallel, which will be discussed later in the book.
Avoid manually scattering tensors across GPUs unless the communication pattern is clear.
Apple Silicon and Other Accelerators
On Apple Silicon, PyTorch may support the Metal Performance Shaders backend through the "mps" device.
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = torch.randn(32, 128, device=device)Some operations may have different performance characteristics or incomplete support compared with CUDA. Device-agnostic code should use .to(device) and create helper tensors from existing tensor properties.
A general helper:
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"Device-Agnostic Training Step
A training step should not hardcode CPU or GPU inside the model logic.
def train_step(model, batch, optimizer, device):
model.train()
x, y = batch
x = x.to(device)
y = y.to(device)
logits = model(x)
loss = torch.nn.functional.cross_entropy(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.detach()This function works as long as the model and data are moved consistently.
Example setup:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = torch.nn.Linear(128, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)Creating Helper Tensors Correctly
Inside model code, helper tensors should usually inherit device and dtype from input tensors.
def normalize_with_epsilon(x):
eps = torch.tensor(1e-5, device=x.device, dtype=x.dtype)
return x / (x.norm(dim=-1, keepdim=True) + eps)Often a Python scalar is enough:
def normalize_with_epsilon(x):
return x / (x.norm(dim=-1, keepdim=True) + 1e-5)For tensors with shape, use x.new_* methods or *_like functions:
zeros = x.new_zeros(x.shape[0], 1)
noise = torch.randn_like(x)
mask = torch.ones_like(x, dtype=torch.bool)These patterns reduce device and dtype bugs.
Debugging Device Problems
Print shape, dtype, and device together:
print(x.shape, x.dtype, x.device)Inspect model parameter devices:
for name, param in model.named_parameters():
print(name, param.device)Inspect buffers:
for name, buffer in model.named_buffers():
print(name, buffer.device)When a device mismatch error occurs, check these objects first:
| Object | Common problem |
|---|---|
| Input batch | Still on CPU |
| Labels | Still on CPU |
| Mask | Created inside forward on CPU |
| Positional encoding | Stored as unregistered tensor |
| Loss weight tensor | Created on CPU |
| Cached hidden state | From previous device |
Most device errors come from tensors created after the model was moved.
CPU Versus GPU Performance
A GPU is usually faster for large dense tensor operations. A CPU may be faster or simpler for small operations, preprocessing, control-heavy code, or operations with limited GPU support.
GPU acceleration works best when:
| Condition | Reason |
|---|---|
| Operations are large and batched | Better parallel utilization |
| Data stays on GPU | Avoids transfer overhead |
| Computation uses optimized kernels | Matrix and convolution kernels are highly tuned |
| Python loops are minimized | Kernel launch overhead is reduced |
| Memory access is efficient | Bandwidth and cache behavior improve |
A small operation may run slower on GPU because transfer and launch overhead dominate. Performance depends on shape, dtype, hardware, and operation type.
Summary
CPU tensors live in system memory and run on the processor. CUDA tensors live in GPU memory and run on NVIDIA GPUs. Other accelerators, such as MPS, may also be available depending on hardware and PyTorch support.
Correct PyTorch programs keep inputs, labels, parameters, buffers, and helper tensors on compatible devices. Efficient programs avoid unnecessary CPU-GPU transfers, use pinned memory where appropriate, and respect CUDA asynchrony when timing code.
Device placement should be handled deliberately at the boundaries of the training loop and inside tensor-creation code. A reliable pattern is to move the model once, move each batch once, and create all helper tensors from the device and dtype of existing tensors.