Tensor Data Types and Devices

A tensor has values, shape, data type, and device placement. Shape tells us how values are arranged. Data type tells us how each value is represented. Device placement tells us where the tensor lives: CPU, GPU, or another accelerator.

These properties affect correctness, memory use, speed, and numerical stability. A model can have the right equations and still fail because a tensor has the wrong dtype or lives on the wrong device.

Data Type

The data type, or dtype, defines how tensor entries are stored.

import torch

x = torch.tensor([1.0, 2.0, 3.0])

print(x.dtype)

Output:

torch.float32

PyTorch inferred torch.float32 because the input values were floating-point numbers.

Integer inputs produce integer tensors:

y = torch.tensor([1, 2, 3])

print(y.dtype)

Output:

torch.int64

The dtype matters because neural networks usually use floating-point tensors for parameters and activations, integer tensors for labels and token IDs, and Boolean tensors for masks.

Common PyTorch Data Types

The most common dtypes are:

dtype	Meaning	Common use
`torch.float32`	32-bit floating point	Standard training and inference
`torch.float16`	16-bit floating point	Mixed precision on GPUs
`torch.bfloat16`	16-bit brain floating point	Large model training on supported hardware
`torch.float64`	64-bit floating point	Scientific computing and numerical checks
`torch.int64` or `torch.long`	64-bit integer	Class labels and token IDs
`torch.int32`	32-bit integer	Some indexing and system interfaces
`torch.bool`	Boolean	Masks and conditions

Example:

features = torch.randn(32, 128, dtype=torch.float32)
labels = torch.randint(0, 10, (32,), dtype=torch.long)
mask = torch.ones(32, 128, dtype=torch.bool)

A common convention is:

inputs and model parameters: floating point
class labels: torch.long
token IDs: torch.long
masks: torch.bool

Floating-Point Precision

Floating-point dtypes trade off precision, range, memory, and speed.

float32 is the default for most training. It uses 32 bits per number. It gives a useful balance between numerical precision and performance.

float16 uses 16 bits per number. It is smaller and often faster on GPUs, but it has less numerical range. Values may underflow or overflow more easily.

bfloat16 also uses 16 bits per number, but keeps a larger exponent range than float16. This often makes it more stable for large model training, though it depends on hardware support.

float64 uses 64 bits per number. It is useful for numerical analysis and gradient checking but is usually slower and uses more memory.

Memory Cost of Data Types

The dtype determines memory use.

dtype	Bytes per element
`float64`	8
`float32`	4
`float16`	2
`bfloat16`	2
`int64`	8
`int32`	4
`bool`	usually 1

The memory required by a tensor is approximately:

\text{memory bytes} = \text{number of elements} \times \text{bytes per element}.

For example, a tensor with shape [32, 3, 224, 224] has:

32 \times 3 \times 224 \times 224 = 4{,}816{,}896

elements.

In float32, this requires approximately:

4{,}816{,}896 \times 4 = 19{,}267{,}584

bytes, or about 18.4 MiB.

In float16, it requires about half that memory.

In PyTorch:

x = torch.randn(32, 3, 224, 224, dtype=torch.float32)

bytes_used = x.numel() * x.element_size()

print(bytes_used)

Explicit dtype Creation

A dtype can be specified when creating a tensor:

x = torch.tensor([1, 2, 3], dtype=torch.float32)
y = torch.zeros(10, dtype=torch.long)
z = torch.ones(4, 4, dtype=torch.bool)

Random constructors also accept dtype:

x = torch.randn(3, 4, dtype=torch.float16)

Use explicit dtype when the role of the tensor matters.

For example, class labels for cross_entropy should be integer class indices:

logits = torch.randn(32, 10)
labels = torch.randint(0, 10, (32,), dtype=torch.long)

loss = torch.nn.functional.cross_entropy(logits, labels)

If labels are floating-point values in this case, the loss call may fail or mean something different.

Type Conversion

Use .to(dtype) or dtype-specific methods to convert tensors.

x = torch.tensor([1, 2, 3])

x_float = x.to(torch.float32)
x_long = x_float.to(torch.long)

Convenience methods:

x.float()
x.long()
x.bool()
x.half()

Example:

labels = torch.tensor([0.0, 1.0, 2.0])
labels = labels.long()

Be careful when converting from floating point to integer. Values are truncated:

x = torch.tensor([1.2, 1.9, -2.7])
print(x.long())

Output:

tensor([ 1,  1, -2])

Type Promotion

When two tensors with different dtypes are used together, PyTorch may promote them to a common dtype.

x = torch.tensor([1, 2, 3], dtype=torch.int64)
y = torch.tensor([0.5, 1.5, 2.5], dtype=torch.float32)

z = x + y

print(z.dtype)

The result is floating point because the operation must represent fractional values.

Type promotion is convenient, but it can hide mistakes. For model code, it is better to make dtype choices explicit for inputs, labels, masks, and parameters.

Default dtype

PyTorch uses float32 as the default floating-point dtype.

print(torch.get_default_dtype())

Output:

torch.float32

You can change the default:

torch.set_default_dtype(torch.float64)

This affects newly created floating-point tensors. It does not automatically change existing tensors.

Changing the global default dtype can make code harder to reason about. In most projects, prefer explicit dtype arguments where needed.

Device Placement

A tensor lives on a device. The device controls where computation happens.

x = torch.randn(3, 4)

print(x.device)

Output:

cpu

If a CUDA GPU is available:

x = torch.randn(3, 4, device="cuda")

print(x.device)

A common setup is:

device = "cuda" if torch.cuda.is_available() else "cpu"

Then create or move tensors to that device:

x = torch.randn(32, 128, device=device)

Moving Tensors Between Devices

Use .to(device) to move tensors.

x = torch.randn(32, 128)

x = x.to(device)

A model can also be moved:

model = torch.nn.Linear(128, 10)
model = model.to(device)

Inputs and model parameters must be on the same device:

x = torch.randn(32, 128).to(device)

logits = model(x)

A common error occurs when the model is on GPU but the input remains on CPU, or when labels are left on CPU during loss computation.

Creating Tensors on the Right Device

Instead of creating a CPU tensor and moving it, create it directly on the target device.

x = torch.randn(32, 128, device=device)
labels = torch.randint(0, 10, (32,), device=device)

When creating helper tensors inside a function, use the input tensor as reference:

def add_noise(x, std):
    noise = torch.randn_like(x) * std
    return x + noise

This ensures noise has the same shape, dtype, and device as x.

Another useful pattern:

def causal_mask(T, x):
    return torch.triu(
        torch.ones(T, T, device=x.device, dtype=torch.bool),
        diagonal=1,
    )

The mask is created on the same device as the tensor that will use it.

CPU and GPU Transfer Cost

Moving data between CPU and GPU costs time. GPU computation is fast when data is already on the GPU. Frequent small transfers can dominate runtime.

Poor pattern:

for batch in loader:
    x, y = batch
    x = x.to(device)
    y = y.to(device)

    # many small extra CPU to GPU transfers inside the loop

Better pattern:

for x, y in loader:
    x = x.to(device, non_blocking=True)
    y = y.to(device, non_blocking=True)

    logits = model(x)

When using a DataLoader, pinned memory can improve CPU-to-GPU transfer performance:

loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,
)

Pinned memory is useful when training on CUDA devices.

Mixed Precision

Mixed precision uses lower-precision arithmetic where safe and higher precision where needed.

In PyTorch, automatic mixed precision is commonly written as:

scaler = torch.cuda.amp.GradScaler()

for x, y in loader:
    x = x.to(device)
    y = y.to(device)

    optimizer.zero_grad()

    with torch.cuda.amp.autocast():
        logits = model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

On modern PyTorch versions, torch.amp.autocast may be preferred:

with torch.amp.autocast(device_type="cuda"):
    logits = model(x)

Mixed precision can reduce memory use and increase throughput. It can also introduce numerical issues if unstable operations are forced into low precision. PyTorch autocast keeps many sensitive operations in safer precision automatically.

Device-Agnostic Code

Good PyTorch code avoids hardcoding "cuda" inside model logic.

Less flexible:

mask = torch.ones(T, T, device="cuda")

More flexible:

mask = torch.ones(T, T, device=x.device)

This works on CPU, CUDA, and other supported accelerators.

For dtype:

scale = torch.tensor(0.5, device=x.device, dtype=x.dtype)

Or use Python scalars when possible:

y = x * 0.5

Python scalars are usually handled without creating explicit device mismatch problems.

Model Parameters and Buffers

A PyTorch module has parameters and buffers.

Parameters are learned by optimization:

model = torch.nn.Linear(128, 10)

for name, param in model.named_parameters():
    print(name, param.shape, param.device, param.dtype)

Buffers are tensors stored in the model but not optimized. Batch normalization running statistics are common examples.

for name, buffer in model.named_buffers():
    print(name, buffer.shape)

When calling:

model.to(device)

PyTorch moves both parameters and buffers. This is one reason model state should be registered properly rather than stored as ordinary unregistered tensors.

Registering Buffers

Suppose a model needs a fixed tensor, such as a positional encoding or mask. If it should move with the model but should not be optimized, register it as a buffer.

import torch
import torch.nn as nn

class ScaleModel(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.linear = nn.Linear(dim, dim)

        scale = torch.ones(dim)
        self.register_buffer("scale", scale)

    def forward(self, x):
        return self.linear(x) * self.scale

Now:

model = ScaleModel(128).to(device)

Both linear parameters and scale buffer move to the target device.

Common dtype and Device Errors

Common errors include:

Error type	Cause	Fix
Device mismatch	CPU tensor used with GPU tensor	Move tensors to same device
Wrong label dtype	Labels are float for class indices	Use `labels.long()`
Wrong mask dtype	Mask stored as float or int unexpectedly	Use `mask.bool()` when needed
Unregistered tensor	Tensor attribute does not move with model	Use `register_buffer`
Silent dtype promotion	Mixed int and float tensors	Make dtype explicit
Low-precision instability	`float16` overflow or underflow	Use autocast, scaling, or `bfloat16`

Most errors can be diagnosed by printing:

print(x.shape, x.dtype, x.device)

For model parameters:

next(model.parameters()).device

Small Training Example

The following code keeps dtype and device explicit.

import torch
import torch.nn as nn
import torch.nn.functional as F

device = "cuda" if torch.cuda.is_available() else "cpu"

model = nn.Linear(128, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

X = torch.randn(32, 128, dtype=torch.float32, device=device)
y = torch.randint(0, 10, (32,), dtype=torch.long, device=device)

logits = model(X)
loss = F.cross_entropy(logits, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(loss.item())

Shape, dtype, and device roles:

Tensor	Shape	dtype	device	Role
`X`	`[32, 128]`	`float32`	`device`	Input features
`y`	`[32]`	`long`	`device`	Class labels
`weight`	`[10, 128]`	`float32`	`device`	Learned parameter
`bias`	`[10]`	`float32`	`device`	Learned parameter
`logits`	`[32, 10]`	`float32`	`device`	Class scores
`loss`	`[]`	`float32`	`device`	Scalar objective

Summary

A tensor’s dtype controls numerical representation. A tensor’s device controls where computation happens. These properties are as important as shape in practical PyTorch programming.

Floating-point tensors are used for inputs, activations, and parameters. Integer tensors are used for labels and token IDs. Boolean tensors are used for masks. CPU tensors and GPU tensors cannot usually participate in the same operation unless moved to a common device.

Reliable PyTorch code makes dtype and device choices explicit, creates helper tensors on the correct device, registers persistent non-parameter tensors as buffers, and uses mixed precision through supported PyTorch mechanisms rather than manual low-precision conversion.