Skip to content

Tensor Creation and Initialization

Neural networks start with tensors. Some tensors come from data.

Neural networks start with tensors. Some tensors come from data. Others are created by the program: weights, biases, masks, counters, labels, random noise, and temporary buffers. PyTorch provides several ways to create tensors, and each choice controls shape, data type, device placement, and initialization.

Tensor creation is simple at the surface, but it affects numerical behavior and training stability. Poor initialization can make a network train slowly, diverge, or produce useless gradients.

Creating Tensors from Python Data

The most direct way to create a tensor is torch.tensor().

import torch

x = torch.tensor([1.0, 2.0, 3.0])
print(x)
print(x.shape)
print(x.dtype)

Output:

tensor([1., 2., 3.])
torch.Size([3])
torch.float32

A nested Python list creates a higher-rank tensor:

A = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
])

print(A.shape)  # torch.Size([2, 3])

PyTorch infers the data type from the input unless a dtype is given explicitly:

labels = torch.tensor([0, 1, 2], dtype=torch.long)
scores = torch.tensor([0.2, 0.7, 0.1], dtype=torch.float32)

Class labels for classification are commonly stored as torch.long, which is an alias for 64-bit integers.

Creating Tensors with a Fixed Shape

Most tensors are created by specifying a shape.

x = torch.zeros(3, 4)
y = torch.ones(3, 4)
z = torch.empty(3, 4)

torch.zeros() fills the tensor with zeros. torch.ones() fills it with ones. torch.empty() allocates memory without initializing the values.

print(torch.empty(2, 3))

The result may contain arbitrary values already present in memory. Use empty() only when you will overwrite every element before reading it.

Common constructors:

FunctionMeaning
torch.zeros(shape)Tensor filled with zeros
torch.ones(shape)Tensor filled with ones
torch.empty(shape)Uninitialized tensor
torch.full(shape, value)Tensor filled with a constant
torch.arange(start, end)Evenly spaced integer-like sequence
torch.linspace(start, end, steps)Evenly spaced floating-point sequence
torch.eye(n)Identity matrix

Examples:

bias = torch.zeros(64)
scale = torch.ones(64)
mask = torch.full((8, 8), float("-inf"))
positions = torch.arange(0, 128)
grid = torch.linspace(-1.0, 1.0, steps=100)
I = torch.eye(4)

Shape-Like Tensor Creation

Often we want to create a tensor with the same shape, dtype, and device as another tensor.

PyTorch provides *_like functions:

x = torch.randn(32, 64, device="cuda", dtype=torch.float16)

a = torch.zeros_like(x)
b = torch.ones_like(x)
c = torch.empty_like(x)
r = torch.randn_like(x)

These functions reduce bugs because they preserve the properties of the reference tensor.

This is especially useful when writing code that should work on both CPU and GPU:

def add_noise(x, std):
    noise = torch.randn_like(x) * std
    return x + noise

The noise tensor is created on the same device as x.

Random Tensor Creation

Deep learning uses random tensors for initialization, dropout, augmentation, sampling, and generative modeling.

Common random constructors:

FunctionDistribution
torch.rand(shape)Uniform on [0,1)[0, 1)
torch.randn(shape)Standard normal
torch.randint(low, high, shape)Random integers
torch.bernoulli(p)Bernoulli samples
torch.multinomial(probs, n)Samples from categorical probabilities

Examples:

u = torch.rand(3, 4)
n = torch.randn(3, 4)
ids = torch.randint(0, 1000, (32,))
keep = torch.bernoulli(torch.full((10,), 0.8))

torch.randn() samples from a normal distribution with mean 0 and variance 1:

xiN(0,1). x_i \sim \mathcal{N}(0, 1).

Random tensors are not just implementation details. Initialization scale, noise scale, and sampling distribution directly affect model behavior.

Reproducibility and Random Seeds

To make random results repeatable, set a random seed:

torch.manual_seed(42)

x1 = torch.randn(3)
x2 = torch.randn(3)

Running the same code again with the same seed produces the same random sequence.

For full reproducibility, especially on GPUs, more settings may be needed:

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Deterministic settings can reduce performance. In research, reproducibility may matter more than speed. In production, speed may matter more than exact repeatability.

Devices: CPU, GPU, and Accelerators

A tensor lives on a device. By default, tensors are created on the CPU.

x = torch.randn(3, 4)
print(x.device)  # cpu

To create a tensor on a GPU:

x = torch.randn(3, 4, device="cuda")

To move a tensor:

x = x.to("cuda")

A common pattern is to choose the device once:

device = "cuda" if torch.cuda.is_available() else "cpu"

x = torch.randn(32, 128, device=device)

All tensors used in the same operation must usually be on the same device.

x = torch.randn(3, device="cuda")
y = torch.randn(3, device="cpu")

# x + y fails because the tensors are on different devices.

Device mismatches are common when manually creating masks, labels, or temporary tensors inside a model.

Data Types During Creation

The dtype controls numerical representation.

x = torch.randn(3, 4, dtype=torch.float32)
y = torch.randn(3, 4, dtype=torch.float16)
z = torch.tensor([1, 2, 3], dtype=torch.long)

Common choices:

dtypeUse
torch.float32Default training and inference
torch.float16Mixed precision on GPUs
torch.bfloat16Large model training on supported hardware
torch.float64High-precision scientific computation
torch.longClass labels and token IDs
torch.boolMasks

Model weights and activations are usually floating-point tensors. Token IDs and labels are integer tensors. Masks are often Boolean tensors.

Parameter Initialization

A neural network parameter starts with an initial value. Training modifies that value using gradients.

For a linear layer,

y=xW+b, y = xW^\top + b,

the learnable parameters are WW and bb.

In PyTorch:

layer = torch.nn.Linear(128, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])

PyTorch initializes built-in layers automatically. However, custom models often require explicit initialization.

import torch.nn as nn

layer = nn.Linear(128, 64)

nn.init.zeros_(layer.bias)
nn.init.normal_(layer.weight, mean=0.0, std=0.02)

The trailing underscore means the operation modifies the tensor in place.

Why Initialization Matters

If weights are too small, activations and gradients may shrink as they pass through layers. If weights are too large, activations and gradients may explode.

Consider a deep network with many layers. Each layer multiplies the signal by a weight matrix. Poor scale choices compound across depth.

A useful initialization keeps the variance of activations approximately stable from layer to layer. It also keeps the variance of gradients stable during backpropagation.

The practical goal is simple: start training in a numerical regime where information and gradients can flow.

Xavier Initialization

Xavier initialization, also called Glorot initialization, is commonly used with sigmoid or tanh-style activations.

For a layer with fan-in ninn_\text{in} and fan-out noutn_\text{out}, Xavier uniform initialization samples

WijU(a,a) W_{ij} \sim U\left(-a, a\right)

where

a=6nin+nout. a = \sqrt{\frac{6}{n_\text{in} + n_\text{out}}}.

In PyTorch:

layer = torch.nn.Linear(128, 64)

torch.nn.init.xavier_uniform_(layer.weight)
torch.nn.init.zeros_(layer.bias)

Xavier normal initialization samples from a normal distribution with variance

2nin+nout. \frac{2}{n_\text{in} + n_\text{out}}.
torch.nn.init.xavier_normal_(layer.weight)

Kaiming Initialization

Kaiming initialization, also called He initialization, is commonly used with ReLU and ReLU-like activations.

The main idea is to account for the fact that ReLU sets many negative values to zero. For a layer with fan-in ninn_\text{in}, Kaiming normal initialization uses approximately

WijN(0,2nin). W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_\text{in}}\right).

In PyTorch:

layer = torch.nn.Linear(128, 64)

torch.nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
torch.nn.init.zeros_(layer.bias)

For convolutional layers:

conv = torch.nn.Conv2d(3, 64, kernel_size=3, padding=1)

torch.nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
torch.nn.init.zeros_(conv.bias)

Kaiming initialization is a strong default for many ReLU-based networks.

Initializing Embeddings

Embedding layers map integer IDs to vectors.

embedding = torch.nn.Embedding(num_embeddings=50_000, embedding_dim=768)

The weight matrix has shape:

print(embedding.weight.shape)
# torch.Size([50000, 768])

A common initialization is a small normal distribution:

torch.nn.init.normal_(embedding.weight, mean=0.0, std=0.02)

Large language models often use this kind of small-scale normal initialization, although exact details vary by architecture.

Initializing Normalization Layers

Normalization layers usually have scale and shift parameters.

For layer normalization:

norm = torch.nn.LayerNorm(768)

print(norm.weight.shape)
print(norm.bias.shape)

A common initialization is:

torch.nn.init.ones_(norm.weight)
torch.nn.init.zeros_(norm.bias)

This makes the normalization layer initially preserve scale and shift behavior in a neutral way.

Custom Initialization Functions

For larger models, it is useful to define an initialization function and apply it recursively.

import torch
import torch.nn as nn

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.LayerNorm):
        nn.init.ones_(module.weight)
        nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

model.apply(init_weights)

The apply() method visits every submodule in the model.

Initialization and Autograd

Parameter initialization should usually happen without tracking gradients.

PyTorch initialization functions already operate safely in place. When manually assigning values, use torch.no_grad():

layer = torch.nn.Linear(128, 64)

with torch.no_grad():
    layer.weight.normal_(0.0, 0.02)
    layer.bias.zero_()

This prevents initialization operations from becoming part of the computation graph.

Practical Defaults

Good defaults depend on the architecture.

Model componentCommon initialization
Linear with ReLUKaiming normal or uniform
Linear with tanhXavier normal or uniform
Convolution with ReLUKaiming normal
Bias termsZeros
LayerNorm scaleOnes
LayerNorm biasZeros
EmbeddingsSmall normal distribution
Final classifierOften default or smaller scale

Initialization should be treated as part of the model design. It interacts with activation functions, normalization, residual connections, depth, optimizer choice, and learning rate.

Summary

Tensor creation controls shape, dtype, device, and initial values. PyTorch provides constructors for constants, ranges, random samples, and tensors based on existing tensors.

Initialization controls the starting point of learning. Good initialization keeps activations and gradients in a stable range. Xavier initialization is commonly used for sigmoid or tanh networks. Kaiming initialization is commonly used for ReLU networks. Biases are often initialized to zero, while normalization scale parameters are often initialized to one.

A reliable PyTorch model begins with tensors that have the intended shape, dtype, device, and numerical scale.