# Residual and Normalization Layers

Transformer layers are deep stacks of attention and feedforward blocks. Without additional structure, such stacks are difficult to optimize. Activations may grow or shrink across layers. Gradients may become unstable. Early layers may be overwritten by later layers.

Residual connections and normalization layers are used to make deep transformers trainable. They are simple mechanisms, but they strongly affect model stability, depth, learning speed, and final quality.

### The Role of Residual Connections

A residual connection adds the input of a sublayer to its output. If $F$ is a neural network sublayer, the residual form is

$$
y = x + F(x).
$$

The sublayer learns a correction to the input rather than a full replacement for it.

In a transformer, residual connections are used around self-attention and the feedforward network:

$$
y = x + \text{SelfAttention}(x),
$$

$$
h = y + \text{FeedForward}(y).
$$

This allows information to flow through the network even when a sublayer is weak, poorly initialized, or temporarily unhelpful during early training.

### Why Residuals Help Optimization

Consider a deep stack of layers. If each layer must learn a complete transformation from scratch, optimization becomes difficult. A residual layer can initially behave close to the identity function:

$$
y \approx x.
$$

This gives the network a stable starting point. Each layer can gradually learn useful deviations from the identity.

Residuals also improve gradient flow. For

$$
y = x + F(x),
$$

the derivative includes a direct path:

$$
\frac{\partial y}{\partial x} =
I + \frac{\partial F(x)}{\partial x}.
$$

The identity term $I$ gives gradients a route through the model even if the sublayer derivative is small. This is one reason residual networks can be much deeper than plain feedforward networks.

### Layer Normalization

Layer normalization rescales a vector using statistics computed across its feature dimension.

For a vector

$$
x\in\mathbb{R}^{D},
$$

the mean is

$$
\mu = \frac{1}{D}\sum_{i=1}^{D} x_i,
$$

and the variance is

$$
\sigma^2 = \frac{1}{D}\sum_{i=1}^{D}(x_i-\mu)^2.
$$

Layer normalization computes

$$
\text{LayerNorm}(x)_i =
\gamma_i
\frac{x_i-\mu}{\sqrt{\sigma^2+\epsilon}}
+
\beta_i.
$$

Here $\gamma$ and $\beta$ are learned scale and shift parameters, and $\epsilon$ is a small constant for numerical stability.

For a transformer tensor

$$
X\in\mathbb{R}^{B\times T\times D},
$$

layer normalization is usually applied over the final dimension $D$. Each token vector is normalized independently.

In PyTorch:

```python
import torch
from torch import nn

B, T, D = 4, 16, 768

x = torch.randn(B, T, D)
norm = nn.LayerNorm(D)

y = norm(x)

print(y.shape)  # torch.Size([4, 16, 768])
```

### Why Transformers Use LayerNorm

Batch normalization normalizes using statistics across a batch. This works well in many convolutional networks, but it is less natural for transformers.

Transformers often process variable-length sequences. They may use padding masks. During generation, batch sizes may be small or even one. In distributed training, the effective statistics may also depend on implementation details.

Layer normalization avoids these issues because it normalizes each token representation independently. It does not require a large batch and does not depend on examples in the same batch.

For each token position, LayerNorm only uses the features of that token:

$$
X_{b,t,:}.
$$

This makes it well matched to sequence models.

### Post-Norm Transformer Layers

The original transformer used post-norm placement. In a post-norm layer, normalization is applied after the residual addition:

$$
y = \text{LayerNorm}(x + \text{SelfAttention}(x)),
$$

$$
h = \text{LayerNorm}(y + \text{FeedForward}(y)).
$$

In PyTorch-like pseudocode:

```python
x = norm1(x + self_attention(x))
x = norm2(x + feedforward(x))
```

Post-norm is conceptually simple. Each sublayer output is normalized before passing to the next sublayer.

However, very deep post-norm transformers can be harder to train. The gradient path through the residual branch includes normalization operations, which can make optimization less stable at scale.

### Pre-Norm Transformer Layers

Modern transformer implementations often use pre-norm placement. In a pre-norm layer, normalization is applied before the sublayer:

$$
y = x + \text{SelfAttention}(\text{LayerNorm}(x)),
$$

$$
h = y + \text{FeedForward}(\text{LayerNorm}(y)).
$$

In pseudocode:

```python
x = x + self_attention(norm1(x))
x = x + feedforward(norm2(x))
```

Pre-norm gives the residual path a direct identity route. Gradients can pass through the residual stream with fewer transformations.

This makes pre-norm transformers easier to train when the number of layers is large. Many modern encoder and decoder models use pre-norm or closely related variants.

### Comparing Pre-Norm and Post-Norm

| Design | Formula | Main property |
|---|---|---|
| Post-norm | $\text{LN}(x + F(x))$ | Normalized layer outputs |
| Pre-norm | $x + F(\text{LN}(x))$ | Cleaner residual gradient path |

Post-norm may produce well-normalized activations after each layer, but it can be less stable in very deep models. Pre-norm is usually easier to optimize, but the residual stream can grow in scale unless additional controls are used.

Many production architectures use pre-norm with final normalization:

$$
H = \text{LayerNorm}(H^{(L)}).
$$

The final normalization stabilizes the representation before the output head.

### Residual Stream

The residual stream is the main sequence of vectors that flows through a transformer layer stack.

At layer $\ell$, the hidden state is

$$
H^{(\ell)}\in\mathbb{R}^{B\times T\times D}.
$$

Each sublayer reads from the residual stream and writes an update back into it:

$$
H^{(\ell+1)} = H^{(\ell)} + \Delta H^{(\ell)}.
$$

Attention writes token-mixing updates. The feedforward network writes per-token nonlinear updates. Across many layers, the residual stream accumulates information.

This viewpoint is useful for understanding transformer internals. The model repeatedly edits a shared representation rather than replacing it at each layer.

### Dropout in Residual Blocks

Dropout is often applied to the output of a sublayer before adding it to the residual stream:

$$
y = x + \text{Dropout}(F(\text{LayerNorm}(x))).
$$

In PyTorch-like code:

```python
x = x + dropout(attention(norm1(x)))
x = x + dropout(feedforward(norm2(x)))
```

Dropout randomly removes some activations during training. This acts as regularization and reduces reliance on specific pathways.

For large models trained on large datasets, dropout rates are sometimes small or zero. For smaller datasets, dropout can still be important.

### Residual Scaling

When many residual updates are added across many layers, the scale of the residual stream can grow. Some architectures scale residual updates to improve stability.

A simple form is

$$
y = x + \alpha F(x),
$$

where $\alpha$ is a scalar.

For a deep model with $L$ layers, $\alpha$ may be chosen as a function of depth, such as

$$
\alpha = \frac{1}{\sqrt{L}}.
$$

Other architectures use learned residual gates:

$$
y = x + gF(x),
$$

where $g$ is a learned scalar or vector.

Residual scaling is useful when training very deep transformers or when removing normalization from some parts of the model.

### RMSNorm

RMSNorm is a common alternative to LayerNorm. It normalizes by the root mean square of the feature vector but does not subtract the mean.

For

$$
x\in\mathbb{R}^{D},
$$

RMSNorm computes

$$
\text{RMS}(x)=\sqrt{\frac{1}{D}\sum_{i=1}^{D}x_i^2+\epsilon},
$$

$$
\text{RMSNorm}(x)_i=\gamma_i\frac{x_i}{\text{RMS}(x)}.
$$

RMSNorm has fewer operations than LayerNorm because it avoids mean subtraction. It is common in modern decoder-only language models.

A simple PyTorch implementation:

```python
class RMSNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x: torch.Tensor):
        # x: [..., D]
        rms = x.pow(2).mean(dim=-1, keepdim=True)
        x = x * torch.rsqrt(rms + self.eps)
        return self.weight * x
```

RMSNorm preserves the direction of the vector more directly than LayerNorm because it does not center the vector.

### Normalization and Numerical Stability

Normalization layers include a small constant $\epsilon$ to prevent division by zero:

$$
\sqrt{\sigma^2+\epsilon}.
$$

The choice of $\epsilon$ can matter in low-precision training. Common values include $10^{-5}$, $10^{-6}$, and $10^{-12}$, depending on the architecture and data type.

In mixed precision training, normalization is often computed carefully to avoid overflow or underflow. Framework implementations handle many of these details, but custom normalization code should be tested with `float16` and `bfloat16`.

### Implementation of a Pre-Norm Block

A minimal pre-norm transformer block looks like this:

```python
import torch
from torch import nn

class PreNormTransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()

        self.norm1 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            batch_first=True,
        )
        self.drop1 = nn.Dropout(dropout)

        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.drop2 = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, attn_mask=None, key_padding_mask=None):
        y = self.norm1(x)

        attn_out, _ = self.attn(
            y, y, y,
            attn_mask=attn_mask,
            key_padding_mask=key_padding_mask,
            need_weights=False,
        )

        x = x + self.drop1(attn_out)

        y = self.norm2(x)
        x = x + self.drop2(self.ffn(y))

        return x
```

The block preserves shape:

```python
B, T, D = 8, 128, 512

block = PreNormTransformerBlock(
    d_model=D,
    n_heads=8,
    d_ff=2048,
)

x = torch.randn(B, T, D)
y = block(x)

print(y.shape)  # torch.Size([8, 128, 512])
```

This shape preservation is what makes transformer layers easy to stack.

### Common Failure Modes

Residual and normalization design errors often appear as training instability.

Common symptoms include:

| Symptom | Possible cause |
|---|---|
| Loss becomes NaN | Too large learning rate, unstable normalization, mixed precision overflow |
| Loss does not decrease | Broken residual path, wrong mask, poor initialization |
| Gradients explode | Missing normalization, too large residual updates |
| Outputs collapse | Excessive normalization, wrong dropout placement, bad initialization |
| Training works only for shallow models | Post-norm instability or poor residual scaling |

Debugging should start with tensor shapes, masks, activation statistics, gradient norms, and learning rate.

A useful check is to monitor the norm of the residual stream across layers:

```python
with torch.no_grad():
    print(x.norm(dim=-1).mean())
```

If the norm grows rapidly with depth, residual scaling, smaller initialization, or normalization changes may be needed.

### Practical Defaults

For most PyTorch transformer models, a good default is:

| Component | Default choice |
|---|---|
| Residual layout | Pre-norm |
| Normalization | LayerNorm for general models, RMSNorm for decoder LMs |
| Dropout placement | Sublayer output before residual addition |
| Final normalization | Yes |
| Residual scaling | Usually unnecessary for small and medium models |
| $\epsilon$ | $10^{-5}$ for LayerNorm, $10^{-6}$ for RMSNorm |

For educational implementations, use LayerNorm first. RMSNorm is worth adding once the basic architecture is correct.

### Summary

Residual connections allow transformer layers to learn updates to a shared representation rather than replacing it at every layer. They improve gradient flow and make deep models trainable.

Normalization controls activation scale and improves numerical stability. LayerNorm normalizes each token over its feature dimension. RMSNorm is a lighter alternative often used in modern language models.

The main architectural choice is post-norm versus pre-norm. Post-norm normalizes after each residual addition. Pre-norm normalizes before each sublayer and gives gradients a cleaner path through the residual stream. For deep transformers, pre-norm is usually the safer default.

