Tensor Creation and Initialization

Neural networks start with tensors. Some tensors come from data. Others are created by the program: weights, biases, masks, counters, labels, random noise, and temporary buffers. PyTorch provides several ways to create tensors, and each choice controls shape, data type, device placement, and initialization.

Tensor creation is simple at the surface, but it affects numerical behavior and training stability. Poor initialization can make a network train slowly, diverge, or produce useless gradients.

Creating Tensors from Python Data

The most direct way to create a tensor is torch.tensor().

import torch

x = torch.tensor([1.0, 2.0, 3.0])
print(x)
print(x.shape)
print(x.dtype)

Output:

tensor([1., 2., 3.])
torch.Size([3])
torch.float32

A nested Python list creates a higher-rank tensor:

A = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
])

print(A.shape)  # torch.Size([2, 3])

PyTorch infers the data type from the input unless a dtype is given explicitly:

labels = torch.tensor([0, 1, 2], dtype=torch.long)
scores = torch.tensor([0.2, 0.7, 0.1], dtype=torch.float32)

Class labels for classification are commonly stored as torch.long, which is an alias for 64-bit integers.

Creating Tensors with a Fixed Shape

Most tensors are created by specifying a shape.

x = torch.zeros(3, 4)
y = torch.ones(3, 4)
z = torch.empty(3, 4)

torch.zeros() fills the tensor with zeros. torch.ones() fills it with ones. torch.empty() allocates memory without initializing the values.

print(torch.empty(2, 3))

The result may contain arbitrary values already present in memory. Use empty() only when you will overwrite every element before reading it.

Common constructors:

Function	Meaning
`torch.zeros(shape)`	Tensor filled with zeros
`torch.ones(shape)`	Tensor filled with ones
`torch.empty(shape)`	Uninitialized tensor
`torch.full(shape, value)`	Tensor filled with a constant
`torch.arange(start, end)`	Evenly spaced integer-like sequence
`torch.linspace(start, end, steps)`	Evenly spaced floating-point sequence
`torch.eye(n)`	Identity matrix

Examples:

bias = torch.zeros(64)
scale = torch.ones(64)
mask = torch.full((8, 8), float("-inf"))
positions = torch.arange(0, 128)
grid = torch.linspace(-1.0, 1.0, steps=100)
I = torch.eye(4)

Shape-Like Tensor Creation

Often we want to create a tensor with the same shape, dtype, and device as another tensor.

PyTorch provides *_like functions:

x = torch.randn(32, 64, device="cuda", dtype=torch.float16)

a = torch.zeros_like(x)
b = torch.ones_like(x)
c = torch.empty_like(x)
r = torch.randn_like(x)

These functions reduce bugs because they preserve the properties of the reference tensor.

This is especially useful when writing code that should work on both CPU and GPU:

def add_noise(x, std):
    noise = torch.randn_like(x) * std
    return x + noise

The noise tensor is created on the same device as x.

Random Tensor Creation

Deep learning uses random tensors for initialization, dropout, augmentation, sampling, and generative modeling.

Common random constructors:

Function	Distribution
`torch.rand(shape)`	Uniform on $[0, 1)$
`torch.randn(shape)`	Standard normal
`torch.randint(low, high, shape)`	Random integers
`torch.bernoulli(p)`	Bernoulli samples
`torch.multinomial(probs, n)`	Samples from categorical probabilities

Examples:

u = torch.rand(3, 4)
n = torch.randn(3, 4)
ids = torch.randint(0, 1000, (32,))
keep = torch.bernoulli(torch.full((10,), 0.8))

torch.randn() samples from a normal distribution with mean 0 and variance 1:

x_i \sim \mathcal{N}(0, 1).

Random tensors are not just implementation details. Initialization scale, noise scale, and sampling distribution directly affect model behavior.

Reproducibility and Random Seeds

To make random results repeatable, set a random seed:

torch.manual_seed(42)

x1 = torch.randn(3)
x2 = torch.randn(3)

Running the same code again with the same seed produces the same random sequence.

For full reproducibility, especially on GPUs, more settings may be needed:

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Deterministic settings can reduce performance. In research, reproducibility may matter more than speed. In production, speed may matter more than exact repeatability.

Devices: CPU, GPU, and Accelerators

A tensor lives on a device. By default, tensors are created on the CPU.

x = torch.randn(3, 4)
print(x.device)  # cpu

To create a tensor on a GPU:

x = torch.randn(3, 4, device="cuda")

To move a tensor:

x = x.to("cuda")

A common pattern is to choose the device once:

device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.randn(32, 128, device=device)

All tensors used in the same operation must usually be on the same device.

x = torch.randn(3, device="cuda")
y = torch.randn(3, device="cpu")

# x + y fails because the tensors are on different devices.

Device mismatches are common when manually creating masks, labels, or temporary tensors inside a model.

Data Types During Creation

The dtype controls numerical representation.

x = torch.randn(3, 4, dtype=torch.float32)
y = torch.randn(3, 4, dtype=torch.float16)
z = torch.tensor([1, 2, 3], dtype=torch.long)

Common choices:

dtype	Use
`torch.float32`	Default training and inference
`torch.float16`	Mixed precision on GPUs
`torch.bfloat16`	Large model training on supported hardware
`torch.float64`	High-precision scientific computation
`torch.long`	Class labels and token IDs
`torch.bool`	Masks

Model weights and activations are usually floating-point tensors. Token IDs and labels are integer tensors. Masks are often Boolean tensors.

Parameter Initialization

A neural network parameter starts with an initial value. Training modifies that value using gradients.

For a linear layer,

y = xW^\top + b,

the learnable parameters are $W$ and $b$ .

In PyTorch:

layer = torch.nn.Linear(128, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])

PyTorch initializes built-in layers automatically. However, custom models often require explicit initialization.

import torch.nn as nn

layer = nn.Linear(128, 64)

nn.init.zeros_(layer.bias)
nn.init.normal_(layer.weight, mean=0.0, std=0.02)

The trailing underscore means the operation modifies the tensor in place.

Why Initialization Matters

If weights are too small, activations and gradients may shrink as they pass through layers. If weights are too large, activations and gradients may explode.

Consider a deep network with many layers. Each layer multiplies the signal by a weight matrix. Poor scale choices compound across depth.

A useful initialization keeps the variance of activations approximately stable from layer to layer. It also keeps the variance of gradients stable during backpropagation.

The practical goal is simple: start training in a numerical regime where information and gradients can flow.

Xavier Initialization

Xavier initialization, also called Glorot initialization, is commonly used with sigmoid or tanh-style activations.

For a layer with fan-in $n_\text{in}$ and fan-out $n_\text{out}$ , Xavier uniform initialization samples

W_{ij} \sim U\left(-a, a\right)

where

a = \sqrt{\frac{6}{n_\text{in} + n_\text{out}}}.

In PyTorch:

layer = torch.nn.Linear(128, 64)

torch.nn.init.xavier_uniform_(layer.weight)
torch.nn.init.zeros_(layer.bias)

Xavier normal initialization samples from a normal distribution with variance

\frac{2}{n_\text{in} + n_\text{out}}.

torch.nn.init.xavier_normal_(layer.weight)

Kaiming Initialization

Kaiming initialization, also called He initialization, is commonly used with ReLU and ReLU-like activations.

The main idea is to account for the fact that ReLU sets many negative values to zero. For a layer with fan-in $n_\text{in}$ , Kaiming normal initialization uses approximately

W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_\text{in}}\right).

In PyTorch:

layer = torch.nn.Linear(128, 64)

torch.nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
torch.nn.init.zeros_(layer.bias)

For convolutional layers:

conv = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)

torch.nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
torch.nn.init.zeros_(conv.bias)

Kaiming initialization is a strong default for many ReLU-based networks.

Initializing Embeddings

Embedding layers map integer IDs to vectors.

embedding = torch.nn.Embedding(num_embeddings=50_000, embedding_dim=768)

The weight matrix has shape:

print(embedding.weight.shape)
# torch.Size([50000, 768])

A common initialization is a small normal distribution:

torch.nn.init.normal_(embedding.weight, mean=0.0, std=0.02)

Large language models often use this kind of small-scale normal initialization, although exact details vary by architecture.

Initializing Normalization Layers

Normalization layers usually have scale and shift parameters.

For layer normalization:

norm = torch.nn.LayerNorm(768)

print(norm.weight.shape)
print(norm.bias.shape)

A common initialization is:

torch.nn.init.ones_(norm.weight)
torch.nn.init.zeros_(norm.bias)

This makes the normalization layer initially preserve scale and shift behavior in a neutral way.

Custom Initialization Functions

For larger models, it is useful to define an initialization function and apply it recursively.

import torch
import torch.nn as nn

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.LayerNorm):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

model.apply(init_weights)

The apply() method visits every submodule in the model.

Initialization and Autograd

Parameter initialization should usually happen without tracking gradients.

PyTorch initialization functions already operate safely in place. When manually assigning values, use torch.no_grad():

layer = torch.nn.Linear(128, 64)

with torch.no_grad():
    layer.weight.normal_(0.0, 0.02)
    layer.bias.zero_()

This prevents initialization operations from becoming part of the computation graph.

Practical Defaults

Good defaults depend on the architecture.

Model component	Common initialization
Linear with ReLU	Kaiming normal or uniform
Linear with tanh	Xavier normal or uniform
Convolution with ReLU	Kaiming normal
Bias terms	Zeros
LayerNorm scale	Ones
LayerNorm bias	Zeros
Embeddings	Small normal distribution
Final classifier	Often default or smaller scale

Initialization should be treated as part of the model design. It interacts with activation functions, normalization, residual connections, depth, optimizer choice, and learning rate.

Summary

Tensor creation controls shape, dtype, device, and initial values. PyTorch provides constructors for constants, ranges, random samples, and tensors based on existing tensors.

Initialization controls the starting point of learning. Good initialization keeps activations and gradients in a stable range. Xavier initialization is commonly used for sigmoid or tanh networks. Kaiming initialization is commonly used for ReLU networks. Biases are often initialized to zero, while normalization scale parameters are often initialized to one.

A reliable PyTorch model begins with tensors that have the intended shape, dtype, device, and numerical scale.