Skip to content

Parameter Initialization

A neural network begins training with parameters that have not yet been learned from data.

A neural network begins training with parameters that have not yet been learned from data. These initial values matter. They determine the scale of activations in the forward pass and the scale of gradients in the backward pass. Poor initialization can make training slow, unstable, or impossible.

Parameter initialization is the rule used to choose the starting values of weights and biases before optimization begins.

A layer usually computes

y=Wx+b, y = Wx + b,

where xx is the input vector, WW is the weight matrix, and bb is the bias vector. During training, gradient descent updates WW and bb. Before training, however, the optimizer needs starting values.

In PyTorch, most layers initialize their parameters automatically. For example:

import torch
from torch import nn

layer = nn.Linear(128, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])

The tensor layer.weight already contains random values. The tensor layer.bias also contains initial values. These values are chosen by PyTorch according to the layer type.

Why Initialization Matters

Consider a deep feedforward network:

h1=f(W1x+b1), h_1 = f(W_1x + b_1), h2=f(W2h1+b2), h_2 = f(W_2h_1 + b_2), \cdots hL=f(WLhL1+bL). h_L = f(W_Lh_{L-1} + b_L).

Each layer transforms the scale of its input. If the weights are too small, activations shrink as they pass through the network. After many layers, the signal may become close to zero. If the weights are too large, activations grow as they pass through the network. After many layers, the signal may become numerically unstable.

The same issue appears in the backward pass. Gradients are repeatedly multiplied by weight matrices and activation derivatives. If the scale decreases layer after layer, gradients vanish. If the scale increases layer after layer, gradients explode.

Good initialization tries to preserve signal scale across layers.

Symmetry Breaking

A common mistake is to initialize all weights to zero.

Suppose every neuron in a hidden layer has the same weight vector and the same bias. Then every neuron receives the same input, produces the same activation, and receives the same gradient. Gradient descent updates them in the same way. The neurons remain identical throughout training.

This defeats the purpose of having multiple hidden units. Different neurons should be able to learn different features.

Random initialization breaks this symmetry. Each neuron begins with slightly different weights, so different neurons can follow different gradient paths.

Biases are often initialized to zero because weights already break symmetry:

layer = nn.Linear(128, 64)

nn.init.zeros_(layer.bias)

Zero bias initialization is usually safe. Zero weight initialization is usually harmful for hidden layers.

Variance Preservation

Initialization is usually designed by asking a scale question: if the input to a layer has variance near 1, what weight variance keeps the output variance near 1?

Let

z=Wx. z = Wx.

For one output unit,

zi=j=1nWijxj. z_i = \sum_{j=1}^{n} W_{ij}x_j.

Assume the inputs xjx_j have mean 0 and variance Var(xj)\operatorname{Var}(x_j). Assume the weights WijW_{ij} are independent, have mean 0, and variance Var(Wij)\operatorname{Var}(W_{ij}). Then approximately,

Var(zi)=nVar(Wij)Var(xj). \operatorname{Var}(z_i) = n \operatorname{Var}(W_{ij})\operatorname{Var}(x_j).

The number nn is the number of input connections to the unit. It is called the fan-in.

To keep the output variance close to the input variance, we want

nVar(Wij)1. n \operatorname{Var}(W_{ij}) \approx 1.

Thus a natural choice is

Var(Wij)1n. \operatorname{Var}(W_{ij}) \approx \frac{1}{n}.

This idea leads to standard initialization schemes.

Fan-In and Fan-Out

For a linear layer with weight shape

[out_features,in_features], [\text{out\_features}, \text{in\_features}],

the fan-in is the number of input units, and the fan-out is the number of output units:

fan_in=in_features, \text{fan\_in} = \text{in\_features}, fan_out=out_features. \text{fan\_out} = \text{out\_features}.

For example:

layer = nn.Linear(128, 64)

The weight matrix has shape [64, 128]. Therefore,

fan_in=128,fan_out=64. \text{fan\_in} = 128, \quad \text{fan\_out} = 64.

For convolutional layers, fan-in and fan-out also include kernel size. A convolution with

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)

has

fan_in=333=27, \text{fan\_in} = 3 \cdot 3 \cdot 3 = 27,

because each output location depends on 3 input channels and a 3×33\times3 kernel.

The fan-out is

fan_out=6433=576. \text{fan\_out} = 64 \cdot 3 \cdot 3 = 576.

PyTorch can compute these quantities internally for initialization.

Xavier Initialization

Xavier initialization, also called Glorot initialization, is designed to keep the scale of activations and gradients roughly stable in networks with symmetric activation functions such as tanh.

It uses both fan-in and fan-out. For a uniform distribution,

WijU(a,a), W_{ij} \sim U\left(-a, a\right),

where

a=6fan_in+fan_out. a = \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}.

In PyTorch:

layer = nn.Linear(128, 64)
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

There is also a normal version:

nn.init.xavier_normal_(layer.weight)

Xavier initialization is commonly used with tanh, sigmoid-like networks, and some attention projections when no more specific rule is preferred.

Kaiming Initialization

Kaiming initialization, also called He initialization, is designed for ReLU-like activation functions.

ReLU discards negative pre-activations:

ReLU(z)=max(0,z). \operatorname{ReLU}(z) = \max(0, z).

If zz is roughly symmetric around zero, ReLU sets about half of the values to zero. This changes the variance of the signal. Kaiming initialization compensates for this effect.

For ReLU networks, a common choice is

Var(Wij)=2fan_in. \operatorname{Var}(W_{ij}) = \frac{2}{\text{fan\_in}}.

In PyTorch:

layer = nn.Linear(128, 64)

nn.init.kaiming_uniform_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)

Or with a normal distribution:

nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")

Kaiming initialization is usually the default choice for deep networks with ReLU, Leaky ReLU, and related activations.

Bias Initialization

Biases control the initial offset of each unit. Many layers initialize biases to zero:

nn.init.zeros_(layer.bias)

This is usually adequate because the weights already differ across units.

Sometimes nonzero bias initialization is useful. For example, in recurrent networks, forget-gate biases in LSTMs are often initialized to positive values. This encourages the model to preserve memory early in training.

For ordinary feedforward networks, zero bias initialization is a good default.

Initialization in PyTorch Modules

PyTorch modules contain parameters that can be initialized manually. A common pattern is to define an initialization function and apply it recursively to a model.

import torch
from torch import nn

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

model.apply(init_weights)

The method model.apply(fn) visits every submodule and calls fn on it.

For convolutional networks:

def init_cnn(module):
    if isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Linear):
        nn.init.xavier_uniform_(module.weight)
        if module.bias is not None:
            nn.init.zeros_(module.bias)

Different layer types may use different initialization rules.

Initialization and Activation Functions

Initialization should match the activation function.

ActivationCommon initialization
LinearXavier or small normal
SigmoidXavier
TanhXavier
ReLUKaiming
Leaky ReLUKaiming with correct negative slope
GELUXavier or small normal, depending on architecture
SELULeCun normal

For Leaky ReLU, PyTorch allows the negative slope to be specified:

nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)

The parameter a is the negative slope of the Leaky ReLU.

Transformer models often use small normal initialization, especially for embeddings and projection matrices:

nn.init.normal_(layer.weight, mean=0.0, std=0.02)

This style appears in many language model architectures.

Initialization and Normalization Layers

Normalization layers reduce the sensitivity of networks to initialization. Batch normalization, layer normalization, and residual connections help stabilize signal propagation.

However, initialization still matters. Normalization does not remove all scale problems. It also introduces its own parameters.

For BatchNorm and LayerNorm, scale parameters are often initialized to 1 and shift parameters to 0:

norm = nn.LayerNorm(128)

nn.init.ones_(norm.weight)
nn.init.zeros_(norm.bias)

This makes the normalization layer initially preserve normalized activations without adding an extra learned shift or rescaling.

Inspecting Initial Parameters

It is often useful to inspect parameter statistics before training.

for name, param in model.named_parameters():
    print(
        name,
        tuple(param.shape),
        param.mean().item(),
        param.std().item(),
    )

This gives a quick check for obvious errors. For example, all weights having standard deviation zero indicates a bad initialization. Very large standard deviations may lead to unstable activations.

A simple diagnostic is to pass random input through the model and inspect activation scales:

x = torch.randn(32, 784)

with torch.no_grad():
    h = x
    for layer in model:
        h = layer(h)
        print(layer.__class__.__name__, h.mean().item(), h.std().item())

Large growth or rapid collapse of activation standard deviation indicates a possible initialization problem.

Practical Rules

For most PyTorch work, the following rules are adequate:

Model typeRecommended initialization
MLP with ReLUKaiming initialization
CNN with ReLUKaiming initialization
Tanh networkXavier initialization
TransformerArchitecture-specific small normal initialization
Normalization scaleOnes
Normalization biasZeros
Ordinary bias termsZeros

When using standard PyTorch layers, the defaults are usually reasonable. Manual initialization becomes more important when building unusual architectures, very deep networks, custom layers, or models that train unstably.

Summary

Parameter initialization sets the starting point for optimization. It controls the initial scale of activations and gradients. Random initialization breaks symmetry between hidden units. Variance-preserving initialization keeps signals from shrinking or exploding across layers.

Xavier initialization is suitable for many tanh-like networks. Kaiming initialization is suitable for many ReLU-like networks. Biases are often initialized to zero. Normalization layers usually start with scale 1 and bias 0.

A reliable PyTorch workflow uses sensible defaults, matches initialization to activation functions, and checks parameter and activation statistics when training behaves poorly.