A tensor has values, shape, data type, and device placement. Shape tells us how values are arranged. Data type tells us how each value is represented. Device placement tells us where the tensor lives: CPU, GPU, or another accelerator.
These properties affect correctness, memory use, speed, and numerical stability. A model can have the right equations and still fail because a tensor has the wrong dtype or lives on the wrong device.
Data Type
The data type, or dtype, defines how tensor entries are stored.
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x.dtype)Output:
torch.float32PyTorch inferred torch.float32 because the input values were floating-point numbers.
Integer inputs produce integer tensors:
y = torch.tensor([1, 2, 3])
print(y.dtype)Output:
torch.int64The dtype matters because neural networks usually use floating-point tensors for parameters and activations, integer tensors for labels and token IDs, and Boolean tensors for masks.
Common PyTorch Data Types
The most common dtypes are:
| dtype | Meaning | Common use |
|---|---|---|
torch.float32 | 32-bit floating point | Standard training and inference |
torch.float16 | 16-bit floating point | Mixed precision on GPUs |
torch.bfloat16 | 16-bit brain floating point | Large model training on supported hardware |
torch.float64 | 64-bit floating point | Scientific computing and numerical checks |
torch.int64 or torch.long | 64-bit integer | Class labels and token IDs |
torch.int32 | 32-bit integer | Some indexing and system interfaces |
torch.bool | Boolean | Masks and conditions |
Example:
features = torch.randn(32, 128, dtype=torch.float32)
labels = torch.randint(0, 10, (32,), dtype=torch.long)
mask = torch.ones(32, 128, dtype=torch.bool)A common convention is:
- inputs and model parameters: floating point
- class labels:
torch.long - token IDs:
torch.long - masks:
torch.bool
Floating-Point Precision
Floating-point dtypes trade off precision, range, memory, and speed.
float32 is the default for most training. It uses 32 bits per number. It gives a useful balance between numerical precision and performance.
float16 uses 16 bits per number. It is smaller and often faster on GPUs, but it has less numerical range. Values may underflow or overflow more easily.
bfloat16 also uses 16 bits per number, but keeps a larger exponent range than float16. This often makes it more stable for large model training, though it depends on hardware support.
float64 uses 64 bits per number. It is useful for numerical analysis and gradient checking but is usually slower and uses more memory.
Memory Cost of Data Types
The dtype determines memory use.
| dtype | Bytes per element |
|---|---|
float64 | 8 |
float32 | 4 |
float16 | 2 |
bfloat16 | 2 |
int64 | 8 |
int32 | 4 |
bool | usually 1 |
The memory required by a tensor is approximately:
For example, a tensor with shape [32, 3, 224, 224] has:
elements.
In float32, this requires approximately:
bytes, or about 18.4 MiB.
In float16, it requires about half that memory.
In PyTorch:
x = torch.randn(32, 3, 224, 224, dtype=torch.float32)
bytes_used = x.numel() * x.element_size()
print(bytes_used)Explicit dtype Creation
A dtype can be specified when creating a tensor:
x = torch.tensor([1, 2, 3], dtype=torch.float32)
y = torch.zeros(10, dtype=torch.long)
z = torch.ones(4, 4, dtype=torch.bool)Random constructors also accept dtype:
x = torch.randn(3, 4, dtype=torch.float16)Use explicit dtype when the role of the tensor matters.
For example, class labels for cross_entropy should be integer class indices:
logits = torch.randn(32, 10)
labels = torch.randint(0, 10, (32,), dtype=torch.long)
loss = torch.nn.functional.cross_entropy(logits, labels)If labels are floating-point values in this case, the loss call may fail or mean something different.
Type Conversion
Use .to(dtype) or dtype-specific methods to convert tensors.
x = torch.tensor([1, 2, 3])
x_float = x.to(torch.float32)
x_long = x_float.to(torch.long)Convenience methods:
x.float()
x.long()
x.bool()
x.half()Example:
labels = torch.tensor([0.0, 1.0, 2.0])
labels = labels.long()Be careful when converting from floating point to integer. Values are truncated:
x = torch.tensor([1.2, 1.9, -2.7])
print(x.long())Output:
tensor([ 1, 1, -2])Type Promotion
When two tensors with different dtypes are used together, PyTorch may promote them to a common dtype.
x = torch.tensor([1, 2, 3], dtype=torch.int64)
y = torch.tensor([0.5, 1.5, 2.5], dtype=torch.float32)
z = x + y
print(z.dtype)The result is floating point because the operation must represent fractional values.
Type promotion is convenient, but it can hide mistakes. For model code, it is better to make dtype choices explicit for inputs, labels, masks, and parameters.
Default dtype
PyTorch uses float32 as the default floating-point dtype.
print(torch.get_default_dtype())Output:
torch.float32You can change the default:
torch.set_default_dtype(torch.float64)This affects newly created floating-point tensors. It does not automatically change existing tensors.
Changing the global default dtype can make code harder to reason about. In most projects, prefer explicit dtype arguments where needed.
Device Placement
A tensor lives on a device. The device controls where computation happens.
x = torch.randn(3, 4)
print(x.device)Output:
cpuIf a CUDA GPU is available:
x = torch.randn(3, 4, device="cuda")
print(x.device)A common setup is:
device = "cuda" if torch.cuda.is_available() else "cpu"Then create or move tensors to that device:
x = torch.randn(32, 128, device=device)Moving Tensors Between Devices
Use .to(device) to move tensors.
x = torch.randn(32, 128)
x = x.to(device)A model can also be moved:
model = torch.nn.Linear(128, 10)
model = model.to(device)Inputs and model parameters must be on the same device:
x = torch.randn(32, 128).to(device)
logits = model(x)A common error occurs when the model is on GPU but the input remains on CPU, or when labels are left on CPU during loss computation.
Creating Tensors on the Right Device
Instead of creating a CPU tensor and moving it, create it directly on the target device.
x = torch.randn(32, 128, device=device)
labels = torch.randint(0, 10, (32,), device=device)When creating helper tensors inside a function, use the input tensor as reference:
def add_noise(x, std):
noise = torch.randn_like(x) * std
return x + noiseThis ensures noise has the same shape, dtype, and device as x.
Another useful pattern:
def causal_mask(T, x):
return torch.triu(
torch.ones(T, T, device=x.device, dtype=torch.bool),
diagonal=1,
)The mask is created on the same device as the tensor that will use it.
CPU and GPU Transfer Cost
Moving data between CPU and GPU costs time. GPU computation is fast when data is already on the GPU. Frequent small transfers can dominate runtime.
Poor pattern:
for batch in loader:
x, y = batch
x = x.to(device)
y = y.to(device)
# many small extra CPU to GPU transfers inside the loopBetter pattern:
for x, y in loader:
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
logits = model(x)When using a DataLoader, pinned memory can improve CPU-to-GPU transfer performance:
loader = torch.utils.data.DataLoader(
dataset,
batch_size=64,
shuffle=True,
pin_memory=True,
)Pinned memory is useful when training on CUDA devices.
Mixed Precision
Mixed precision uses lower-precision arithmetic where safe and higher precision where needed.
In PyTorch, automatic mixed precision is commonly written as:
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
x = x.to(device)
y = y.to(device)
optimizer.zero_grad()
with torch.cuda.amp.autocast():
logits = model(x)
loss = torch.nn.functional.cross_entropy(logits, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()On modern PyTorch versions, torch.amp.autocast may be preferred:
with torch.amp.autocast(device_type="cuda"):
logits = model(x)Mixed precision can reduce memory use and increase throughput. It can also introduce numerical issues if unstable operations are forced into low precision. PyTorch autocast keeps many sensitive operations in safer precision automatically.
Device-Agnostic Code
Good PyTorch code avoids hardcoding "cuda" inside model logic.
Less flexible:
mask = torch.ones(T, T, device="cuda")More flexible:
mask = torch.ones(T, T, device=x.device)This works on CPU, CUDA, and other supported accelerators.
For dtype:
scale = torch.tensor(0.5, device=x.device, dtype=x.dtype)Or use Python scalars when possible:
y = x * 0.5Python scalars are usually handled without creating explicit device mismatch problems.
Model Parameters and Buffers
A PyTorch module has parameters and buffers.
Parameters are learned by optimization:
model = torch.nn.Linear(128, 10)
for name, param in model.named_parameters():
print(name, param.shape, param.device, param.dtype)Buffers are tensors stored in the model but not optimized. Batch normalization running statistics are common examples.
for name, buffer in model.named_buffers():
print(name, buffer.shape)When calling:
model.to(device)PyTorch moves both parameters and buffers. This is one reason model state should be registered properly rather than stored as ordinary unregistered tensors.
Registering Buffers
Suppose a model needs a fixed tensor, such as a positional encoding or mask. If it should move with the model but should not be optimized, register it as a buffer.
import torch
import torch.nn as nn
class ScaleModel(nn.Module):
def __init__(self, dim):
super().__init__()
self.linear = nn.Linear(dim, dim)
scale = torch.ones(dim)
self.register_buffer("scale", scale)
def forward(self, x):
return self.linear(x) * self.scaleNow:
model = ScaleModel(128).to(device)Both linear parameters and scale buffer move to the target device.
Common dtype and Device Errors
Common errors include:
| Error type | Cause | Fix |
|---|---|---|
| Device mismatch | CPU tensor used with GPU tensor | Move tensors to same device |
| Wrong label dtype | Labels are float for class indices | Use labels.long() |
| Wrong mask dtype | Mask stored as float or int unexpectedly | Use mask.bool() when needed |
| Unregistered tensor | Tensor attribute does not move with model | Use register_buffer |
| Silent dtype promotion | Mixed int and float tensors | Make dtype explicit |
| Low-precision instability | float16 overflow or underflow | Use autocast, scaling, or bfloat16 |
Most errors can be diagnosed by printing:
print(x.shape, x.dtype, x.device)For model parameters:
next(model.parameters()).deviceSmall Training Example
The following code keeps dtype and device explicit.
import torch
import torch.nn as nn
import torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
model = nn.Linear(128, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
X = torch.randn(32, 128, dtype=torch.float32, device=device)
y = torch.randint(0, 10, (32,), dtype=torch.long, device=device)
logits = model(X)
loss = F.cross_entropy(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(loss.item())Shape, dtype, and device roles:
| Tensor | Shape | dtype | device | Role |
|---|---|---|---|---|
X | [32, 128] | float32 | device | Input features |
y | [32] | long | device | Class labels |
weight | [10, 128] | float32 | device | Learned parameter |
bias | [10] | float32 | device | Learned parameter |
logits | [32, 10] | float32 | device | Class scores |
loss | [] | float32 | device | Scalar objective |
Summary
A tensor’s dtype controls numerical representation. A tensor’s device controls where computation happens. These properties are as important as shape in practical PyTorch programming.
Floating-point tensors are used for inputs, activations, and parameters. Integer tensors are used for labels and token IDs. Boolean tensors are used for masks. CPU tensors and GPU tensors cannot usually participate in the same operation unless moved to a common device.
Reliable PyTorch code makes dtype and device choices explicit, creates helper tensors on the correct device, registers persistent non-parameter tensors as buffers, and uses mixed precision through supported PyTorch mechanisms rather than manual low-precision conversion.