# Tensor Data Types and Devices

A tensor has values, shape, data type, and device placement. Shape tells us how values are arranged. Data type tells us how each value is represented. Device placement tells us where the tensor lives: CPU, GPU, or another accelerator.

These properties affect correctness, memory use, speed, and numerical stability. A model can have the right equations and still fail because a tensor has the wrong dtype or lives on the wrong device.

### Data Type

The data type, or dtype, defines how tensor entries are stored.

```python id="860edi"
import torch

x = torch.tensor([1.0, 2.0, 3.0])

print(x.dtype)
```

Output:

```python id="69xjbj"
torch.float32
```

PyTorch inferred `torch.float32` because the input values were floating-point numbers.

Integer inputs produce integer tensors:

```python id="nl2wts"
y = torch.tensor([1, 2, 3])

print(y.dtype)
```

Output:

```python id="aqbtde"
torch.int64
```

The dtype matters because neural networks usually use floating-point tensors for parameters and activations, integer tensors for labels and token IDs, and Boolean tensors for masks.

### Common PyTorch Data Types

The most common dtypes are:

| dtype | Meaning | Common use |
|---|---|---|
| `torch.float32` | 32-bit floating point | Standard training and inference |
| `torch.float16` | 16-bit floating point | Mixed precision on GPUs |
| `torch.bfloat16` | 16-bit brain floating point | Large model training on supported hardware |
| `torch.float64` | 64-bit floating point | Scientific computing and numerical checks |
| `torch.int64` or `torch.long` | 64-bit integer | Class labels and token IDs |
| `torch.int32` | 32-bit integer | Some indexing and system interfaces |
| `torch.bool` | Boolean | Masks and conditions |

Example:

```python id="fssy5n"
features = torch.randn(32, 128, dtype=torch.float32)
labels = torch.randint(0, 10, (32,), dtype=torch.long)
mask = torch.ones(32, 128, dtype=torch.bool)
```

A common convention is:

- inputs and model parameters: floating point
- class labels: `torch.long`
- token IDs: `torch.long`
- masks: `torch.bool`

### Floating-Point Precision

Floating-point dtypes trade off precision, range, memory, and speed.

`float32` is the default for most training. It uses 32 bits per number. It gives a useful balance between numerical precision and performance.

`float16` uses 16 bits per number. It is smaller and often faster on GPUs, but it has less numerical range. Values may underflow or overflow more easily.

`bfloat16` also uses 16 bits per number, but keeps a larger exponent range than `float16`. This often makes it more stable for large model training, though it depends on hardware support.

`float64` uses 64 bits per number. It is useful for numerical analysis and gradient checking but is usually slower and uses more memory.

### Memory Cost of Data Types

The dtype determines memory use.

| dtype | Bytes per element |
|---|---:|
| `float64` | 8 |
| `float32` | 4 |
| `float16` | 2 |
| `bfloat16` | 2 |
| `int64` | 8 |
| `int32` | 4 |
| `bool` | usually 1 |

The memory required by a tensor is approximately:

$$
\text{memory bytes} =
\text{number of elements}
\times
\text{bytes per element}.
$$

For example, a tensor with shape `[32, 3, 224, 224]` has:

$$
32 \times 3 \times 224 \times 224 = 4{,}816{,}896
$$

elements.

In `float32`, this requires approximately:

$$
4{,}816{,}896 \times 4 =
19{,}267{,}584
$$

bytes, or about 18.4 MiB.

In `float16`, it requires about half that memory.

In PyTorch:

```python id="c922mb"
x = torch.randn(32, 3, 224, 224, dtype=torch.float32)

bytes_used = x.numel() * x.element_size()

print(bytes_used)
```

### Explicit dtype Creation

A dtype can be specified when creating a tensor:

```python id="15god7"
x = torch.tensor([1, 2, 3], dtype=torch.float32)
y = torch.zeros(10, dtype=torch.long)
z = torch.ones(4, 4, dtype=torch.bool)
```

Random constructors also accept dtype:

```python id="5zrz89"
x = torch.randn(3, 4, dtype=torch.float16)
```

Use explicit dtype when the role of the tensor matters.

For example, class labels for `cross_entropy` should be integer class indices:

```python id="ufxjby"
logits = torch.randn(32, 10)
labels = torch.randint(0, 10, (32,), dtype=torch.long)

loss = torch.nn.functional.cross_entropy(logits, labels)
```

If labels are floating-point values in this case, the loss call may fail or mean something different.

### Type Conversion

Use `.to(dtype)` or dtype-specific methods to convert tensors.

```python id="qksp6z"
x = torch.tensor([1, 2, 3])

x_float = x.to(torch.float32)
x_long = x_float.to(torch.long)
```

Convenience methods:

```python id="y5m72w"
x.float()
x.long()
x.bool()
x.half()
```

Example:

```python id="iwfpdq"
labels = torch.tensor([0.0, 1.0, 2.0])
labels = labels.long()
```

Be careful when converting from floating point to integer. Values are truncated:

```python id="y18ih5"
x = torch.tensor([1.2, 1.9, -2.7])
print(x.long())
```

Output:

```python id="to55vl"
tensor([ 1,  1, -2])
```

### Type Promotion

When two tensors with different dtypes are used together, PyTorch may promote them to a common dtype.

```python id="rlt84k"
x = torch.tensor([1, 2, 3], dtype=torch.int64)
y = torch.tensor([0.5, 1.5, 2.5], dtype=torch.float32)

z = x + y

print(z.dtype)
```

The result is floating point because the operation must represent fractional values.

Type promotion is convenient, but it can hide mistakes. For model code, it is better to make dtype choices explicit for inputs, labels, masks, and parameters.

### Default dtype

PyTorch uses `float32` as the default floating-point dtype.

```python id="m20q86"
print(torch.get_default_dtype())
```

Output:

```python id="tv0ctt"
torch.float32
```

You can change the default:

```python id="p87l8l"
torch.set_default_dtype(torch.float64)
```

This affects newly created floating-point tensors. It does not automatically change existing tensors.

Changing the global default dtype can make code harder to reason about. In most projects, prefer explicit dtype arguments where needed.

### Device Placement

A tensor lives on a device. The device controls where computation happens.

```python id="dat96c"
x = torch.randn(3, 4)

print(x.device)
```

Output:

```python id="s57q1l"
cpu
```

If a CUDA GPU is available:

```python id="x1slbm"
x = torch.randn(3, 4, device="cuda")

print(x.device)
```

A common setup is:

```python id="jqkez0"
device = "cuda" if torch.cuda.is_available() else "cpu"
```

Then create or move tensors to that device:

```python id="cuj3h1"
x = torch.randn(32, 128, device=device)
```

### Moving Tensors Between Devices

Use `.to(device)` to move tensors.

```python id="g57d8u"
x = torch.randn(32, 128)

x = x.to(device)
```

A model can also be moved:

```python id="k0w8f9"
model = torch.nn.Linear(128, 10)
model = model.to(device)
```

Inputs and model parameters must be on the same device:

```python id="8x2wz3"
x = torch.randn(32, 128).to(device)

logits = model(x)
```

A common error occurs when the model is on GPU but the input remains on CPU, or when labels are left on CPU during loss computation.

### Creating Tensors on the Right Device

Instead of creating a CPU tensor and moving it, create it directly on the target device.

```python id="6ddmdq"
x = torch.randn(32, 128, device=device)
labels = torch.randint(0, 10, (32,), device=device)
```

When creating helper tensors inside a function, use the input tensor as reference:

```python id="4gjh87"
def add_noise(x, std):
    noise = torch.randn_like(x) * std
    return x + noise
```

This ensures `noise` has the same shape, dtype, and device as `x`.

Another useful pattern:

```python id="2llgb3"
def causal_mask(T, x):
    return torch.triu(
        torch.ones(T, T, device=x.device, dtype=torch.bool),
        diagonal=1,
    )
```

The mask is created on the same device as the tensor that will use it.

### CPU and GPU Transfer Cost

Moving data between CPU and GPU costs time. GPU computation is fast when data is already on the GPU. Frequent small transfers can dominate runtime.

Poor pattern:

```python id="9tpbsz"
for batch in loader:
    x, y = batch
    x = x.to(device)
    y = y.to(device)

    # many small extra CPU to GPU transfers inside the loop
```

Better pattern:

```python id="cbdv23"
for x, y in loader:
    x = x.to(device, non_blocking=True)
    y = y.to(device, non_blocking=True)

    logits = model(x)
```

When using a `DataLoader`, pinned memory can improve CPU-to-GPU transfer performance:

```python id="a65bse"
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,
)
```

Pinned memory is useful when training on CUDA devices.

### Mixed Precision

Mixed precision uses lower-precision arithmetic where safe and higher precision where needed.

In PyTorch, automatic mixed precision is commonly written as:

```python id="4j9zyf"
scaler = torch.cuda.amp.GradScaler()

for x, y in loader:
    x = x.to(device)
    y = y.to(device)

    optimizer.zero_grad()

    with torch.cuda.amp.autocast():
        logits = model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

On modern PyTorch versions, `torch.amp.autocast` may be preferred:

```python id="qn2cfx"
with torch.amp.autocast(device_type="cuda"):
    logits = model(x)
```

Mixed precision can reduce memory use and increase throughput. It can also introduce numerical issues if unstable operations are forced into low precision. PyTorch autocast keeps many sensitive operations in safer precision automatically.

### Device-Agnostic Code

Good PyTorch code avoids hardcoding `"cuda"` inside model logic.

Less flexible:

```python id="ly0osr"
mask = torch.ones(T, T, device="cuda")
```

More flexible:

```python id="b43037"
mask = torch.ones(T, T, device=x.device)
```

This works on CPU, CUDA, and other supported accelerators.

For dtype:

```python id="iqqb6p"
scale = torch.tensor(0.5, device=x.device, dtype=x.dtype)
```

Or use Python scalars when possible:

```python id="5ezv1n"
y = x * 0.5
```

Python scalars are usually handled without creating explicit device mismatch problems.

### Model Parameters and Buffers

A PyTorch module has parameters and buffers.

Parameters are learned by optimization:

```python id="yvizzw"
model = torch.nn.Linear(128, 10)

for name, param in model.named_parameters():
    print(name, param.shape, param.device, param.dtype)
```

Buffers are tensors stored in the model but not optimized. Batch normalization running statistics are common examples.

```python id="n6hj43"
for name, buffer in model.named_buffers():
    print(name, buffer.shape)
```

When calling:

```python id="1aprie"
model.to(device)
```

PyTorch moves both parameters and buffers. This is one reason model state should be registered properly rather than stored as ordinary unregistered tensors.

### Registering Buffers

Suppose a model needs a fixed tensor, such as a positional encoding or mask. If it should move with the model but should not be optimized, register it as a buffer.

```python id="y5sa1s"
import torch
import torch.nn as nn

class ScaleModel(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.linear = nn.Linear(dim, dim)

        scale = torch.ones(dim)
        self.register_buffer("scale", scale)

    def forward(self, x):
        return self.linear(x) * self.scale
```

Now:

```python id="q1yn14"
model = ScaleModel(128).to(device)
```

Both `linear` parameters and `scale` buffer move to the target device.

### Common dtype and Device Errors

Common errors include:

| Error type | Cause | Fix |
|---|---|---|
| Device mismatch | CPU tensor used with GPU tensor | Move tensors to same device |
| Wrong label dtype | Labels are float for class indices | Use `labels.long()` |
| Wrong mask dtype | Mask stored as float or int unexpectedly | Use `mask.bool()` when needed |
| Unregistered tensor | Tensor attribute does not move with model | Use `register_buffer` |
| Silent dtype promotion | Mixed int and float tensors | Make dtype explicit |
| Low-precision instability | `float16` overflow or underflow | Use autocast, scaling, or `bfloat16` |

Most errors can be diagnosed by printing:

```python id="p2bmv0"
print(x.shape, x.dtype, x.device)
```

For model parameters:

```python id="zfvsgz"
next(model.parameters()).device
```

### Small Training Example

The following code keeps dtype and device explicit.

```python id="921sk9"
import torch
import torch.nn as nn
import torch.nn.functional as F

device = "cuda" if torch.cuda.is_available() else "cpu"

model = nn.Linear(128, 10).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

X = torch.randn(32, 128, dtype=torch.float32, device=device)
y = torch.randint(0, 10, (32,), dtype=torch.long, device=device)

logits = model(X)
loss = F.cross_entropy(logits, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

print(loss.item())
```

Shape, dtype, and device roles:

| Tensor | Shape | dtype | device | Role |
|---|---:|---|---|---|
| `X` | `[32, 128]` | `float32` | `device` | Input features |
| `y` | `[32]` | `long` | `device` | Class labels |
| `weight` | `[10, 128]` | `float32` | `device` | Learned parameter |
| `bias` | `[10]` | `float32` | `device` | Learned parameter |
| `logits` | `[32, 10]` | `float32` | `device` | Class scores |
| `loss` | `[]` | `float32` | `device` | Scalar objective |

### Summary

A tensor’s dtype controls numerical representation. A tensor’s device controls where computation happens. These properties are as important as shape in practical PyTorch programming.

Floating-point tensors are used for inputs, activations, and parameters. Integer tensors are used for labels and token IDs. Boolean tensors are used for masks. CPU tensors and GPU tensors cannot usually participate in the same operation unless moved to a common device.

Reliable PyTorch code makes dtype and device choices explicit, creates helper tensors on the correct device, registers persistent non-parameter tensors as buffers, and uses mixed precision through supported PyTorch mechanisms rather than manual low-precision conversion.