CPU and GPU Tensors

PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute. The most common devices are CPU and CUDA GPU, but modern PyTorch can also target other accelerators depending on the installation and hardware.

Device placement is part of tensor correctness. A tensor with the right shape and dtype can still fail if it lives on the wrong device. Training performance also depends heavily on moving data to the right device at the right time.

CPU Tensors

By default, PyTorch creates tensors on the CPU.

import torch

x = torch.randn(3, 4)

print(x.device)

Output:

cpu

CPU tensors are stored in system memory. CPU computation is useful for data preprocessing, small models, debugging, and operations that have no accelerator implementation.

Most PyTorch programs begin with CPU tensors because datasets are usually loaded from disk into CPU memory first. These tensors are then moved to the GPU before model computation.

CUDA Tensors

A CUDA tensor lives on an NVIDIA GPU.

x = torch.randn(3, 4, device="cuda")
print(x.device)

Output:

cuda:0

The string "cuda" usually means the default CUDA device, commonly cuda:0.

A specific GPU can be selected:

x = torch.randn(3, 4, device="cuda:1")

This creates the tensor on the second CUDA GPU.

A safe pattern is:

device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.randn(32, 128, device=device)

This lets the same code run on machines with or without a CUDA GPU.

Checking Available Devices

Check whether CUDA is available:

print(torch.cuda.is_available())

Check the number of visible CUDA devices:

print(torch.cuda.device_count())

Print device names:

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(i, torch.cuda.get_device_name(i))

The availability of CUDA depends on hardware, drivers, the CUDA runtime, and the PyTorch build.

Moving Tensors Between Devices

Use .to() to move a tensor.

x = torch.randn(32, 128)

x = x.to("cuda")

Move it back to CPU:

x_cpu = x.to("cpu")

Convenience methods also exist:

x_cuda = x.cuda()
x_cpu = x_cuda.cpu()

The .to() form is usually preferable because it also works for dtype conversion and supports device-agnostic code.

x = x.to(device=device, dtype=torch.float32)

Device transfer creates a new tensor on the target device when the device changes. The original tensor remains on its original device unless overwritten by assignment.

Device Mismatch Errors

Most operations require all input tensors to be on the same device.

x = torch.randn(3, device="cuda")
y = torch.randn(3, device="cpu")

# z = x + y  # error

The fix is to move one tensor:

y = y.to(x.device)
z = x + y

Device mismatch errors often happen with labels, masks, positional encodings, or tensors created inside the forward pass.

Incorrect pattern:

def forward(self, x):
    mask = torch.ones(x.shape[0], x.shape[1])
    return x * mask

If x is on GPU, mask is still on CPU.

Correct pattern:

def forward(self, x):
    mask = torch.ones(x.shape[0], x.shape[1], device=x.device, dtype=x.dtype)
    return x * mask

Moving Models to Devices

A model’s parameters must live on the same device as the input tensors.

import torch.nn as nn

model = nn.Linear(128, 10)
model = model.to(device)

Then inputs should be moved to the same device:

x = torch.randn(32, 128).to(device)

logits = model(x)

A useful check:

print(next(model.parameters()).device)

Calling model.to(device) moves parameters and registered buffers. Ordinary tensor attributes that are not registered as parameters or buffers are not moved automatically.

Registered Buffers and Device Movement

Some tensors belong to a model but are not trainable parameters. Examples include masks, running means, running variances, fixed positional encodings, and lookup constants.

Use register_buffer() for these tensors.

class CausalMask(nn.Module):
    def __init__(self, max_length):
        super().__init__()
        mask = torch.triu(
            torch.ones(max_length, max_length, dtype=torch.bool),
            diagonal=1,
        )
        self.register_buffer("mask", mask)

    def forward(self, x):
        T = x.shape[1]
        return self.mask[:T, :T]

Now:

model = CausalMask(1024).to(device)

The buffer moves with the model.

Without register_buffer, the tensor may remain on CPU and later cause a device mismatch.

DataLoader and Device Transfer

A DataLoader usually returns CPU tensors.

for x, y in loader:
    x = x.to(device)
    y = y.to(device)

    logits = model(x)

This pattern is standard. The dataset and loader remain CPU-side, while each minibatch is moved to the accelerator before computation.

For CUDA training, pin_memory=True can improve transfer performance:

loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,
)

Then use non-blocking copies:

for x, y in loader:
    x = x.to(device, non_blocking=True)
    y = y.to(device, non_blocking=True)

    logits = model(x)

This can help overlap CPU-to-GPU transfer with computation in some pipelines.

Avoiding Repeated Device Transfers

Device transfers are expensive relative to many tensor operations. Avoid moving tensors back and forth inside inner loops.

Poor pattern:

for x, y in loader:
    x = x.to(device)

    output = model(x)

    output_cpu = output.cpu()
    output = output_cpu.to(device)

Better pattern:

for x, y in loader:
    x = x.to(device)

    output = model(x)

Move tensors to CPU only when needed for logging, NumPy conversion, serialization, or interaction with CPU-only libraries.

For scalar logging:

loss_value = loss.item()

.item() transfers a scalar value to Python. Calling it too often can synchronize GPU execution and slow training. Use it for logging, not inside performance-critical arithmetic.

NumPy Conversion

A CUDA tensor cannot be converted directly to a NumPy array. NumPy uses CPU memory.

x = torch.randn(3, device="cuda")

# arr = x.numpy()  # error

Move to CPU first:

arr = x.detach().cpu().numpy()

The detach() call removes the tensor from the autograd graph. This is usually appropriate for logging, plotting, or metric computation outside PyTorch.

For CPU tensors:

x = torch.randn(3)
arr = x.numpy()

The NumPy array may share memory with the tensor. Mutating one may affect the other.

CUDA Asynchrony

CUDA operations are often asynchronous. When you launch an operation, Python may continue before the GPU has finished.

y = x @ w

This line may return control to Python before the matrix multiplication has completed on the GPU.

This matters for timing. Incorrect timing:

import time

start = time.time()
y = x @ w
end = time.time()

print(end - start)

Correct CUDA timing requires synchronization:

torch.cuda.synchronize()
start = time.time()

y = x @ w

torch.cuda.synchronize()
end = time.time()

print(end - start)

Without synchronization, timing may measure launch overhead rather than actual computation.

GPU Memory Allocation

CUDA tensors use GPU memory.

if torch.cuda.is_available():
    print(torch.cuda.memory_allocated())
    print(torch.cuda.memory_reserved())

memory_allocated() reports memory used by live tensors.

memory_reserved() reports memory held by PyTorch’s caching allocator.

PyTorch may reserve more memory than currently allocated to tensors. This improves future allocation speed but can make GPU memory appear occupied in system tools.

Clearing CUDA Cache

Unused cached memory can be released with:

torch.cuda.empty_cache()

This does not free memory used by live tensors. It only releases unused cached blocks back to the CUDA driver.

Do not call empty_cache() inside normal training loops. It can slow training by forcing PyTorch to request memory from the driver repeatedly.

Use it mainly during debugging, notebook experimentation, or after deleting large tensors.

del large_tensor
torch.cuda.empty_cache()

Multiple GPUs

A tensor belongs to one device at a time.

x0 = torch.randn(3, device="cuda:0")
x1 = torch.randn(3, device="cuda:1")

Operations across GPUs usually require explicit movement or distributed communication.

x1_on_0 = x1.to("cuda:0")
z = x0 + x1_on_0

For serious multi-GPU training, use PyTorch distributed tools such as DistributedDataParallel, which will be discussed later in the book.

Avoid manually scattering tensors across GPUs unless the communication pattern is clear.

Apple Silicon and Other Accelerators

On Apple Silicon, PyTorch may support the Metal Performance Shaders backend through the "mps" device.

device = "mps" if torch.backends.mps.is_available() else "cpu"

x = torch.randn(32, 128, device=device)

Some operations may have different performance characteristics or incomplete support compared with CUDA. Device-agnostic code should use .to(device) and create helper tensors from existing tensor properties.

A general helper:

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

Device-Agnostic Training Step

A training step should not hardcode CPU or GPU inside the model logic.

def train_step(model, batch, optimizer, device):
    model.train()

    x, y = batch
    x = x.to(device)
    y = y.to(device)

    logits = model(x)
    loss = torch.nn.functional.cross_entropy(logits, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.detach()

This function works as long as the model and data are moved consistently.

Example setup:

device = "cuda" if torch.cuda.is_available() else "cpu"

model = torch.nn.Linear(128, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

Creating Helper Tensors Correctly

Inside model code, helper tensors should usually inherit device and dtype from input tensors.

def normalize_with_epsilon(x):
    eps = torch.tensor(1e-5, device=x.device, dtype=x.dtype)
    return x / (x.norm(dim=-1, keepdim=True) + eps)

Often a Python scalar is enough:

def normalize_with_epsilon(x):
    return x / (x.norm(dim=-1, keepdim=True) + 1e-5)

For tensors with shape, use x.new_* methods or *_like functions:

zeros = x.new_zeros(x.shape[0], 1)
noise = torch.randn_like(x)
mask = torch.ones_like(x, dtype=torch.bool)

These patterns reduce device and dtype bugs.

Debugging Device Problems

Print shape, dtype, and device together:

print(x.shape, x.dtype, x.device)

Inspect model parameter devices:

for name, param in model.named_parameters():
    print(name, param.device)

Inspect buffers:

for name, buffer in model.named_buffers():
    print(name, buffer.device)

When a device mismatch error occurs, check these objects first:

Object	Common problem
Input batch	Still on CPU
Labels	Still on CPU
Mask	Created inside forward on CPU
Positional encoding	Stored as unregistered tensor
Loss weight tensor	Created on CPU
Cached hidden state	From previous device

Most device errors come from tensors created after the model was moved.

CPU Versus GPU Performance

A GPU is usually faster for large dense tensor operations. A CPU may be faster or simpler for small operations, preprocessing, control-heavy code, or operations with limited GPU support.

GPU acceleration works best when:

Condition	Reason
Operations are large and batched	Better parallel utilization
Data stays on GPU	Avoids transfer overhead
Computation uses optimized kernels	Matrix and convolution kernels are highly tuned
Python loops are minimized	Kernel launch overhead is reduced
Memory access is efficient	Bandwidth and cache behavior improve

A small operation may run slower on GPU because transfer and launch overhead dominate. Performance depends on shape, dtype, hardware, and operation type.

Summary

CPU tensors live in system memory and run on the processor. CUDA tensors live in GPU memory and run on NVIDIA GPUs. Other accelerators, such as MPS, may also be available depending on hardware and PyTorch support.

Correct PyTorch programs keep inputs, labels, parameters, buffers, and helper tensors on compatible devices. Efficient programs avoid unnecessary CPU-GPU transfers, use pinned memory where appropriate, and respect CUDA asynchrony when timing code.

Device placement should be handled deliberately at the boundaries of the training loop and inside tensor-creation code. A reliable pattern is to move the model once, move each batch once, and create all helper tensors from the device and dtype of existing tensors.