# Parameter Initialization

A neural network begins training with parameters that have not yet been learned from data. These initial values matter. They determine the scale of activations in the forward pass and the scale of gradients in the backward pass. Poor initialization can make training slow, unstable, or impossible.

Parameter initialization is the rule used to choose the starting values of weights and biases before optimization begins.

A layer usually computes

$$
y = Wx + b,
$$

where $x$ is the input vector, $W$ is the weight matrix, and $b$ is the bias vector. During training, gradient descent updates $W$ and $b$. Before training, however, the optimizer needs starting values.

In PyTorch, most layers initialize their parameters automatically. For example:

```python
import torch
from torch import nn

layer = nn.Linear(128, 64)

print(layer.weight.shape)  # torch.Size([64, 128])
print(layer.bias.shape)    # torch.Size([64])
```

The tensor `layer.weight` already contains random values. The tensor `layer.bias` also contains initial values. These values are chosen by PyTorch according to the layer type.

### Why Initialization Matters

Consider a deep feedforward network:

$$
h_1 = f(W_1x + b_1),
$$

$$
h_2 = f(W_2h_1 + b_2),
$$

$$
\cdots
$$

$$
h_L = f(W_Lh_{L-1} + b_L).
$$

Each layer transforms the scale of its input. If the weights are too small, activations shrink as they pass through the network. After many layers, the signal may become close to zero. If the weights are too large, activations grow as they pass through the network. After many layers, the signal may become numerically unstable.

The same issue appears in the backward pass. Gradients are repeatedly multiplied by weight matrices and activation derivatives. If the scale decreases layer after layer, gradients vanish. If the scale increases layer after layer, gradients explode.

Good initialization tries to preserve signal scale across layers.

### Symmetry Breaking

A common mistake is to initialize all weights to zero.

Suppose every neuron in a hidden layer has the same weight vector and the same bias. Then every neuron receives the same input, produces the same activation, and receives the same gradient. Gradient descent updates them in the same way. The neurons remain identical throughout training.

This defeats the purpose of having multiple hidden units. Different neurons should be able to learn different features.

Random initialization breaks this symmetry. Each neuron begins with slightly different weights, so different neurons can follow different gradient paths.

Biases are often initialized to zero because weights already break symmetry:

```python
layer = nn.Linear(128, 64)

nn.init.zeros_(layer.bias)
```

Zero bias initialization is usually safe. Zero weight initialization is usually harmful for hidden layers.

### Variance Preservation

Initialization is usually designed by asking a scale question: if the input to a layer has variance near 1, what weight variance keeps the output variance near 1?

Let

$$
z = Wx.
$$

For one output unit,

$$
z_i = \sum_{j=1}^{n} W_{ij}x_j.
$$

Assume the inputs $x_j$ have mean 0 and variance $\operatorname{Var}(x_j)$. Assume the weights $W_{ij}$ are independent, have mean 0, and variance $\operatorname{Var}(W_{ij})$. Then approximately,

$$
\operatorname{Var}(z_i) =
n \operatorname{Var}(W_{ij})\operatorname{Var}(x_j).
$$

The number $n$ is the number of input connections to the unit. It is called the fan-in.

To keep the output variance close to the input variance, we want

$$
n \operatorname{Var}(W_{ij}) \approx 1.
$$

Thus a natural choice is

$$
\operatorname{Var}(W_{ij}) \approx \frac{1}{n}.
$$

This idea leads to standard initialization schemes.

### Fan-In and Fan-Out

For a linear layer with weight shape

$$
[\text{out\_features}, \text{in\_features}],
$$

the fan-in is the number of input units, and the fan-out is the number of output units:

$$
\text{fan\_in} = \text{in\_features},
$$

$$
\text{fan\_out} = \text{out\_features}.
$$

For example:

```python
layer = nn.Linear(128, 64)
```

The weight matrix has shape `[64, 128]`. Therefore,

$$
\text{fan\_in} = 128,
\quad
\text{fan\_out} = 64.
$$

For convolutional layers, fan-in and fan-out also include kernel size. A convolution with

```python
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
```

has

$$
\text{fan\_in} = 3 \cdot 3 \cdot 3 = 27,
$$

because each output location depends on 3 input channels and a $3\times3$ kernel.

The fan-out is

$$
\text{fan\_out} = 64 \cdot 3 \cdot 3 = 576.
$$

PyTorch can compute these quantities internally for initialization.

### Xavier Initialization

Xavier initialization, also called Glorot initialization, is designed to keep the scale of activations and gradients roughly stable in networks with symmetric activation functions such as `tanh`.

It uses both fan-in and fan-out. For a uniform distribution,

$$
W_{ij} \sim U\left(-a, a\right),
$$

where

$$
a = \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}.
$$

In PyTorch:

```python
layer = nn.Linear(128, 64)
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)
```

There is also a normal version:

```python
nn.init.xavier_normal_(layer.weight)
```

Xavier initialization is commonly used with `tanh`, sigmoid-like networks, and some attention projections when no more specific rule is preferred.

### Kaiming Initialization

Kaiming initialization, also called He initialization, is designed for ReLU-like activation functions.

ReLU discards negative pre-activations:

$$
\operatorname{ReLU}(z) = \max(0, z).
$$

If $z$ is roughly symmetric around zero, ReLU sets about half of the values to zero. This changes the variance of the signal. Kaiming initialization compensates for this effect.

For ReLU networks, a common choice is

$$
\operatorname{Var}(W_{ij}) = \frac{2}{\text{fan\_in}}.
$$

In PyTorch:

```python
layer = nn.Linear(128, 64)

nn.init.kaiming_uniform_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)
```

Or with a normal distribution:

```python
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
```

Kaiming initialization is usually the default choice for deep networks with ReLU, Leaky ReLU, and related activations.

### Bias Initialization

Biases control the initial offset of each unit. Many layers initialize biases to zero:

```python
nn.init.zeros_(layer.bias)
```

This is usually adequate because the weights already differ across units.

Sometimes nonzero bias initialization is useful. For example, in recurrent networks, forget-gate biases in LSTMs are often initialized to positive values. This encourages the model to preserve memory early in training.

For ordinary feedforward networks, zero bias initialization is a good default.

### Initialization in PyTorch Modules

PyTorch modules contain parameters that can be initialized manually. A common pattern is to define an initialization function and apply it recursively to a model.

```python
import torch
from torch import nn

def init_weights(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

model.apply(init_weights)
```

The method `model.apply(fn)` visits every submodule and calls `fn` on it.

For convolutional networks:

```python
def init_cnn(module):
    if isinstance(module, nn.Conv2d):
        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Linear):
        nn.init.xavier_uniform_(module.weight)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
```

Different layer types may use different initialization rules.

### Initialization and Activation Functions

Initialization should match the activation function.

| Activation | Common initialization |
|---|---|
| Linear | Xavier or small normal |
| Sigmoid | Xavier |
| Tanh | Xavier |
| ReLU | Kaiming |
| Leaky ReLU | Kaiming with correct negative slope |
| GELU | Xavier or small normal, depending on architecture |
| SELU | LeCun normal |

For Leaky ReLU, PyTorch allows the negative slope to be specified:

```python
nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)
```

The parameter `a` is the negative slope of the Leaky ReLU.

Transformer models often use small normal initialization, especially for embeddings and projection matrices:

```python
nn.init.normal_(layer.weight, mean=0.0, std=0.02)
```

This style appears in many language model architectures.

### Initialization and Normalization Layers

Normalization layers reduce the sensitivity of networks to initialization. Batch normalization, layer normalization, and residual connections help stabilize signal propagation.

However, initialization still matters. Normalization does not remove all scale problems. It also introduces its own parameters.

For `BatchNorm` and `LayerNorm`, scale parameters are often initialized to 1 and shift parameters to 0:

```python
norm = nn.LayerNorm(128)

nn.init.ones_(norm.weight)
nn.init.zeros_(norm.bias)
```

This makes the normalization layer initially preserve normalized activations without adding an extra learned shift or rescaling.

### Inspecting Initial Parameters

It is often useful to inspect parameter statistics before training.

```python
for name, param in model.named_parameters():
    print(
        name,
        tuple(param.shape),
        param.mean().item(),
        param.std().item(),
    )
```

This gives a quick check for obvious errors. For example, all weights having standard deviation zero indicates a bad initialization. Very large standard deviations may lead to unstable activations.

A simple diagnostic is to pass random input through the model and inspect activation scales:

```python
x = torch.randn(32, 784)

with torch.no_grad():
    h = x
    for layer in model:
        h = layer(h)
        print(layer.__class__.__name__, h.mean().item(), h.std().item())
```

Large growth or rapid collapse of activation standard deviation indicates a possible initialization problem.

### Practical Rules

For most PyTorch work, the following rules are adequate:

| Model type | Recommended initialization |
|---|---|
| MLP with ReLU | Kaiming initialization |
| CNN with ReLU | Kaiming initialization |
| Tanh network | Xavier initialization |
| Transformer | Architecture-specific small normal initialization |
| Normalization scale | Ones |
| Normalization bias | Zeros |
| Ordinary bias terms | Zeros |

When using standard PyTorch layers, the defaults are usually reasonable. Manual initialization becomes more important when building unusual architectures, very deep networks, custom layers, or models that train unstably.

### Summary

Parameter initialization sets the starting point for optimization. It controls the initial scale of activations and gradients. Random initialization breaks symmetry between hidden units. Variance-preserving initialization keeps signals from shrinking or exploding across layers.

Xavier initialization is suitable for many tanh-like networks. Kaiming initialization is suitable for many ReLU-like networks. Biases are often initialized to zero. Normalization layers usually start with scale 1 and bias 0.

A reliable PyTorch workflow uses sensible defaults, matches initialization to activation functions, and checks parameter and activation statistics when training behaves poorly.

