# Tensor Creation and Initialization

Neural networks start with tensors. Some tensors come from data. Others are created by the program: weights, biases, masks, counters, labels, random noise, and temporary buffers. PyTorch provides several ways to create tensors, and each choice controls shape, data type, device placement, and initialization.

Tensor creation is simple at the surface, but it affects numerical behavior and training stability. Poor initialization can make a network train slowly, diverge, or produce useless gradients.

### Creating Tensors from Python Data

The most direct way to create a tensor is `torch.tensor()`.

```python
import torch

x = torch.tensor([1.0, 2.0, 3.0])
print(x)
print(x.shape)
print(x.dtype)
```

Output:

```python
tensor([1., 2., 3.])
torch.Size([3])
torch.float32
```

A nested Python list creates a higher-rank tensor:

```python
A = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
])

print(A.shape)  # torch.Size([2, 3])
```

PyTorch infers the data type from the input unless a dtype is given explicitly:

```python
labels = torch.tensor([0, 1, 2], dtype=torch.long)
scores = torch.tensor([0.2, 0.7, 0.1], dtype=torch.float32)
```

Class labels for classification are commonly stored as `torch.long`, which is an alias for 64-bit integers.

### Creating Tensors with a Fixed Shape

Most tensors are created by specifying a shape.

```python
x = torch.zeros(3, 4)
y = torch.ones(3, 4)
z = torch.empty(3, 4)
```

`torch.zeros()` fills the tensor with zeros. `torch.ones()` fills it with ones. `torch.empty()` allocates memory without initializing the values.

```python
print(torch.empty(2, 3))
```

The result may contain arbitrary values already present in memory. Use `empty()` only when you will overwrite every element before reading it.

Common constructors:

| Function | Meaning |
|---|---|
| `torch.zeros(shape)` | Tensor filled with zeros |
| `torch.ones(shape)` | Tensor filled with ones |
| `torch.empty(shape)` | Uninitialized tensor |
| `torch.full(shape, value)` | Tensor filled with a constant |
| `torch.arange(start, end)` | Evenly spaced integer-like sequence |
| `torch.linspace(start, end, steps)` | Evenly spaced floating-point sequence |
| `torch.eye(n)` | Identity matrix |

Examples:

```python
bias = torch.zeros(64)
scale = torch.ones(64)
mask = torch.full((8, 8), float("-inf"))
positions = torch.arange(0, 128)
grid = torch.linspace(-1.0, 1.0, steps=100)
I = torch.eye(4)
```

### Shape-Like Tensor Creation

Often we want to create a tensor with the same shape, dtype, and device as another tensor.

PyTorch provides `*_like` functions:

```python
x = torch.randn(32, 64, device="cuda", dtype=torch.float16)

a = torch.zeros_like(x)
b = torch.ones_like(x)
c = torch.empty_like(x)
r = torch.randn_like(x)
```

These functions reduce bugs because they preserve the properties of the reference tensor.

This is especially useful when writing code that should work on both CPU and GPU:

```python
def add_noise(x, std):
    noise = torch.randn_like(x) * std
    return x + noise
```

The noise tensor is created on the same device as `x`.

### Random Tensor Creation

Deep learning uses random tensors for initialization, dropout, augmentation, sampling, and generative modeling.

Common random constructors:

| Function | Distribution |
|---|---|
| `torch.rand(shape)` | Uniform on \([0, 1)\) |
| `torch.randn(shape)` | Standard normal |
| `torch.randint(low, high, shape)` | Random integers |
| `torch.bernoulli(p)` | Bernoulli samples |
| `torch.multinomial(probs, n)` | Samples from categorical probabilities |

Examples:

```python
u = torch.rand(3, 4)
n = torch.randn(3, 4)
ids = torch.randint(0, 1000, (32,))
keep = torch.bernoulli(torch.full((10,), 0.8))
```

`torch.randn()` samples from a normal distribution with mean 0 and variance 1:

$$
x_i \sim \mathcal{N}(0, 1).
$$

Random tensors are not just implementation details. Initialization scale, noise scale, and sampling distribution directly affect model behavior.

### Reproducibility and Random Seeds

To make random results repeatable, set a random seed:

```python
torch.manual_seed(42)

x1 = torch.randn(3)
x2 = torch.randn(3)
```

Running the same code again with the same seed produces the same random sequence.

For full reproducibility, especially on GPUs, more settings may be needed:

```python
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
```

Deterministic settings can reduce performance. In research, reproducibility may matter more than speed. In production, speed may matter more than exact repeatability.

### Devices: CPU, GPU, and Accelerators

A tensor lives on a device. By default, tensors are created on the CPU.

```python
x = torch.randn(3, 4)
print(x.device)  # cpu
```

To create a tensor on a GPU:

```python
x = torch.randn(3, 4, device="cuda")
```

To move a tensor:

```python
x = x.to("cuda")
```

A common pattern is to choose the device once:

```python
device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.randn(32, 128, device=device)
```

All tensors used in the same operation must usually be on the same device.

```python
x = torch.randn(3, device="cuda")
y = torch.randn(3, device="cpu")

# x + y fails because the tensors are on different devices.
```

Device mismatches are common when manually creating masks, labels, or temporary tensors inside a model.

### Data Types During Creation

The dtype controls numerical representation.

```python
x = torch.randn(3, 4, dtype=torch.float32)
y = torch.randn(3, 4, dtype=torch.float16)
z = torch.tensor([1, 2, 3], dtype=torch.long)
```

Common choices:

| dtype | Use |
|---|---|
| `torch.float32` | Default training and inference |
| `torch.float16` | Mixed precision on GPUs |
| `torch.bfloat16` | Large model training on supported hardware |
| `torch.float64` | High-precision scientific computation |
| `torch.long` | Class labels and token IDs |
| `torch.bool` | Masks |

Model weights and activations are usually floating-point tensors. Token IDs and labels are integer tensors. Masks are often Boolean tensors.

### Parameter Initialization

A neural network parameter starts with an initial value. Training modifies that value using gradients.

For a linear layer,

$$
y = xW^\top + b,
$$

the learnable parameters are \(W\) and \(b\).

In PyTorch:

```python
layer = torch.nn.Linear(128, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])
```

PyTorch initializes built-in layers automatically. However, custom models often require explicit initialization.

```python
import torch.nn as nn

layer = nn.Linear(128, 64)

nn.init.zeros_(layer.bias)
nn.init.normal_(layer.weight, mean=0.0, std=0.02)
```

The trailing underscore means the operation modifies the tensor in place.

### Why Initialization Matters

If weights are too small, activations and gradients may shrink as they pass through layers. If weights are too large, activations and gradients may explode.

Consider a deep network with many layers. Each layer multiplies the signal by a weight matrix. Poor scale choices compound across depth.

A useful initialization keeps the variance of activations approximately stable from layer to layer. It also keeps the variance of gradients stable during backpropagation.

The practical goal is simple: start training in a numerical regime where information and gradients can flow.

### Xavier Initialization

Xavier initialization, also called Glorot initialization, is commonly used with sigmoid or tanh-style activations.

For a layer with fan-in \(n_\text{in}\) and fan-out \(n_\text{out}\), Xavier uniform initialization samples

$$
W_{ij} \sim U\left(-a, a\right)
$$

where

$$
a = \sqrt{\frac{6}{n_\text{in} + n_\text{out}}}.
$$

In PyTorch:

```python
layer = torch.nn.Linear(128, 64)

torch.nn.init.xavier_uniform_(layer.weight)
torch.nn.init.zeros_(layer.bias)
```

Xavier normal initialization samples from a normal distribution with variance

$$
\frac{2}{n_\text{in} + n_\text{out}}.
$$

```python
torch.nn.init.xavier_normal_(layer.weight)
```

### Kaiming Initialization

Kaiming initialization, also called He initialization, is commonly used with ReLU and ReLU-like activations.

The main idea is to account for the fact that ReLU sets many negative values to zero. For a layer with fan-in \(n_\text{in}\), Kaiming normal initialization uses approximately

$$
W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_\text{in}}\right).
$$

In PyTorch:

```python
layer = torch.nn.Linear(128, 64)

torch.nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
torch.nn.init.zeros_(layer.bias)
```

For convolutional layers:

```python
conv = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)

torch.nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
torch.nn.init.zeros_(conv.bias)
```

Kaiming initialization is a strong default for many ReLU-based networks.

### Initializing Embeddings

Embedding layers map integer IDs to vectors.

```python
embedding = torch.nn.Embedding(num_embeddings=50_000, embedding_dim=768)
```

The weight matrix has shape:

```python
print(embedding.weight.shape)
# torch.Size([50000, 768])
```

A common initialization is a small normal distribution:

```python
torch.nn.init.normal_(embedding.weight, mean=0.0, std=0.02)
```

Large language models often use this kind of small-scale normal initialization, although exact details vary by architecture.

### Initializing Normalization Layers

Normalization layers usually have scale and shift parameters.

For layer normalization:

```python
norm = torch.nn.LayerNorm(768)

print(norm.weight.shape)
print(norm.bias.shape)
```

A common initialization is:

```python
torch.nn.init.ones_(norm.weight)
torch.nn.init.zeros_(norm.bias)
```

This makes the normalization layer initially preserve scale and shift behavior in a neutral way.

### Custom Initialization Functions

For larger models, it is useful to define an initialization function and apply it recursively.

```python
import torch
import torch.nn as nn

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.LayerNorm):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

model.apply(init_weights)
```

The `apply()` method visits every submodule in the model.

### Initialization and Autograd

Parameter initialization should usually happen without tracking gradients.

PyTorch initialization functions already operate safely in place. When manually assigning values, use `torch.no_grad()`:

```python
layer = torch.nn.Linear(128, 64)

with torch.no_grad():
    layer.weight.normal_(0.0, 0.02)
    layer.bias.zero_()
```

This prevents initialization operations from becoming part of the computation graph.

### Practical Defaults

Good defaults depend on the architecture.

| Model component | Common initialization |
|---|---|
| Linear with ReLU | Kaiming normal or uniform |
| Linear with tanh | Xavier normal or uniform |
| Convolution with ReLU | Kaiming normal |
| Bias terms | Zeros |
| LayerNorm scale | Ones |
| LayerNorm bias | Zeros |
| Embeddings | Small normal distribution |
| Final classifier | Often default or smaller scale |

Initialization should be treated as part of the model design. It interacts with activation functions, normalization, residual connections, depth, optimizer choice, and learning rate.

### Summary

Tensor creation controls shape, dtype, device, and initial values. PyTorch provides constructors for constants, ranges, random samples, and tensors based on existing tensors.

Initialization controls the starting point of learning. Good initialization keeps activations and gradients in a stable range. Xavier initialization is commonly used for sigmoid or tanh networks. Kaiming initialization is commonly used for ReLU networks. Biases are often initialized to zero, while normalization scale parameters are often initialized to one.

A reliable PyTorch model begins with tensors that have the intended shape, dtype, device, and numerical scale.

