# Dropout

Dropout is a regularization method that randomly removes parts of a neural network during training. More precisely, it sets selected activations to zero with some probability. The model must learn useful predictions without depending too heavily on any single hidden unit.

A dropout layer takes an activation tensor $h$ and samples a binary mask $m$. Each entry of the mask is either 0 or 1. During training, the output is

$$
\tilde{h} = \frac{m \odot h}{1-p},
$$

where $p$ is the dropout probability, $m_i \sim \mathrm{Bernoulli}(1-p)$, and $\odot$ denotes elementwise multiplication.

The factor $1/(1-p)$ keeps the expected activation magnitude roughly unchanged during training.

### Motivation

Large neural networks can memorize training data. If some hidden units become highly specialized to accidental patterns, the model may perform well on the training set but poorly on new examples.

Dropout reduces this risk by injecting noise into the network during training. Each update sees a slightly different subnetwork. The model cannot rely on a fixed set of hidden activations always being present.

This encourages redundancy and robustness. Useful information should be distributed across many units rather than concentrated in a small number of fragile paths.

### Dropout as Model Averaging

One way to understand dropout is as approximate model averaging.

A network with dropout represents many subnetworks. Each binary mask selects a different subnetwork. Training with dropout updates parameters under many such masks. At inference time, the full network is used, with scaling chosen so that activation magnitudes match their expected training-time values.

The exact number of possible subnetworks is enormous. If a layer has $n$ units, there are $2^n$ possible dropout masks. Dropout does not train each subnetwork separately. It samples a small number of masks through stochastic optimization.

This gives some of the benefit of ensembling without training many independent models.

### Inverted Dropout

Modern libraries usually implement inverted dropout. This is the formula shown above:

$$
\tilde{h} = \frac{m \odot h}{1-p}.
$$

The scaling happens during training. At inference time, dropout is disabled and the layer returns the input unchanged.

This convention makes inference simpler.

If $m_i$ has probability $1-p$ of being 1, then

$$
\mathbb{E}[m_i h_i] = (1-p)h_i.
$$

After scaling,

$$
\mathbb{E}\left[\frac{m_i h_i}{1-p}\right] = h_i.
$$

Thus the expected activation during training equals the deterministic activation used during inference.

### Dropout in PyTorch

PyTorch provides dropout through `torch.nn.Dropout`.

```python
import torch
from torch import nn

dropout = nn.Dropout(p=0.5)

x = torch.ones(10)
y = dropout(x)

print(y)
```

During training mode, approximately half the entries are set to zero, and the remaining entries are scaled by $1/(1-0.5)=2$.

Dropout behavior depends on the module mode.

```python
dropout.train()
y_train = dropout(x)

dropout.eval()
y_eval = dropout(x)
```

In training mode, dropout is active. In evaluation mode, dropout is disabled.

For a model:

```python
model.train()  # enables dropout
model.eval()   # disables dropout
```

This distinction is critical. Forgetting to call `model.eval()` during validation or inference can make predictions random and degrade evaluation accuracy.

### Dropout in Feedforward Networks

A common feedforward classifier uses dropout after activation functions:

```python
class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        return self.net(x)
```

The dropout layers affect hidden activations. They do not directly modify the weights. Gradients still update all parameters, but only the active paths receive gradient for a given forward pass.

### Dropout Probability

The dropout probability $p$ controls how much noise is injected.

| Dropout probability | Effect |
|---:|---|
| $p=0.0$ | No dropout |
| $p=0.1$ | Weak regularization |
| $p=0.3$ | Moderate regularization |
| $p=0.5$ | Strong regularization |
| $p>0.5$ | Often too aggressive |

For older fully connected networks, $p=0.5$ was common. For modern architectures with normalization, residual connections, strong data augmentation, and large datasets, smaller values such as $0.1$ or $0.2$ are often more appropriate.

The best value is empirical. It should be selected using validation performance.

### Dropout and Underfitting

Dropout makes training harder. A model trained with dropout sees noisy hidden representations. This can improve generalization, but it can also reduce training accuracy.

If dropout is too strong, the model may underfit. Symptoms include:

| Symptom | Interpretation |
|---|---|
| Training loss remains high | Dropout may be too strong |
| Training and validation loss are both poor | Model lacks effective capacity |
| Removing dropout improves validation | Regularization was excessive |
| Model converges very slowly | Noise level may be too high |

Dropout should be tuned together with model size, learning rate, data augmentation, and weight decay.

### Dropout in Convolutional Networks

Standard dropout can be applied to convolutional feature maps, but it may be less effective than in fully connected layers. Neighboring pixels and channels are highly correlated, so dropping individual elements may not remove much information.

PyTorch provides `Dropout2d`, which randomly drops entire channels in a feature map.

```python
drop = nn.Dropout2d(p=0.2)

x = torch.randn(16, 64, 32, 32)
y = drop(x)

print(y.shape)
```

For an input of shape `[B, C, H, W]`, `Dropout2d` may zero entire channels for each sample. This forces the model to avoid relying too heavily on specific feature maps.

Similarly, `Dropout3d` is used for volumetric data or spatiotemporal tensors.

### Dropout in Recurrent Networks

Dropout in recurrent networks must be used carefully. Applying a different dropout mask at every time step can disrupt temporal memory.

Common variants include:

| Variant | Description |
|---|---|
| Input dropout | Applied to inputs |
| Output dropout | Applied to outputs |
| Recurrent dropout | Applied to recurrent connections |
| Variational dropout | Uses the same mask across time |

PyTorch recurrent modules expose a `dropout` argument, but it applies dropout between stacked recurrent layers, not on recurrent connections inside a single layer.

```python
lstm = nn.LSTM(
    input_size=128,
    hidden_size=256,
    num_layers=3,
    dropout=0.2,
    batch_first=True,
)
```

In this example, dropout is applied between LSTM layers. If `num_layers=1`, the dropout argument has no effect.

### Dropout in Transformers

Transformers often use several dropout locations:

| Location | Purpose |
|---|---|
| Embedding dropout | Regularizes token representations |
| Attention dropout | Drops attention probabilities |
| Residual dropout | Drops outputs before residual addition |
| MLP dropout | Regularizes feedforward blocks |

A simplified transformer block may contain:

```python
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, dropout_p):
        super().__init__()

        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout_p,
            batch_first=True,
        )

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Dropout(dropout_p),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(dropout_p),
        )

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)

        mlp_out = self.mlp(x)
        x = self.norm2(x + mlp_out)

        return x
```

In large-scale language model pretraining, dropout may be small or even zero when the dataset is very large. In fine-tuning, dropout can be useful because task-specific datasets are often much smaller.

### Dropout and Batch Normalization

Dropout and batch normalization can interact in nontrivial ways. Batch normalization estimates activation statistics during training. Dropout changes those activations by randomly zeroing entries. This can make the statistics seen during training differ from those used during inference.

For this reason, many modern convolutional architectures use batch normalization and data augmentation with little or no dropout.

In transformer models, LayerNorm is more common than BatchNorm, and dropout is often easier to use.

### Dropout at Inference

By default, dropout is disabled during inference.

```python
model.eval()

with torch.no_grad():
    logits = model(x)
```

This gives deterministic predictions, except for other stochastic layers or nondeterministic hardware kernels.

There is also a technique called Monte Carlo dropout. It intentionally keeps dropout active during inference and runs the model multiple times. The variation across predictions is used as an approximate uncertainty estimate.

Example pattern:

```python
model.train()  # intentionally keep dropout active

predictions = []

with torch.no_grad():
    for _ in range(20):
        logits = model(x)
        predictions.append(logits.softmax(dim=-1))

mean_prediction = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)
```

This method is simple, but it should be treated as an approximation. It does not replace careful uncertainty modeling.

### Dropout Versus Weight Decay

Dropout and weight decay regularize models differently.

| Method | Regularizes by | Main effect |
|---|---|---|
| Weight decay | Penalizing parameter magnitude | Encourages small weights |
| Dropout | Injecting activation noise | Reduces co-adaptation |
| Data augmentation | Changing inputs | Encourages invariance |
| Early stopping | Limiting training time | Prevents late overfitting |

These methods can be combined. A model may use weight decay, dropout, augmentation, and early stopping at the same time.

However, excessive regularization can cause underfitting. Adding more regularizers does not automatically improve validation performance.

### Practical Guidelines

For fully connected networks, start with dropout probabilities between 0.2 and 0.5.

For convolutional networks, prefer data augmentation, weight decay, and normalization first. Use dropout mainly near classifier heads or use channel-wise dropout variants.

For transformers, use small dropout values such as 0.1 as a starting point for moderate-sized datasets. For very large pretraining runs, use validation experiments to decide whether dropout is needed.

Always call `model.train()` during training and `model.eval()` during validation and inference. This controls dropout and other mode-dependent layers.

Use dropout less aggressively when the dataset is large, the model already uses strong augmentation, or validation performance degrades.

### Summary

Dropout randomly sets activations to zero during training. It prevents the model from relying too strongly on individual units and acts as a stochastic regularizer.

In PyTorch, dropout is implemented with modules such as `nn.Dropout`, `nn.Dropout2d`, and `nn.Dropout3d`. Dropout is active in training mode and disabled in evaluation mode.

Dropout remains useful, especially for fully connected layers, smaller datasets, and fine-tuning. In modern architectures, it is usually combined carefully with weight decay, normalization, augmentation, and early stopping.