Dropout

Dropout is a regularization method that randomly removes parts of a neural network during training. More precisely, it sets selected activations to zero with some probability. The model must learn useful predictions without depending too heavily on any single hidden unit.

A dropout layer takes an activation tensor $h$ and samples a binary mask $m$ . Each entry of the mask is either 0 or 1. During training, the output is

\tilde{h} = \frac{m \odot h}{1-p},

where $p$ is the dropout probability, $m_i \sim \mathrm{Bernoulli}(1-p)$ , and $\odot$ denotes elementwise multiplication.

The factor $1/(1-p)$ keeps the expected activation magnitude roughly unchanged during training.

Motivation

Large neural networks can memorize training data. If some hidden units become highly specialized to accidental patterns, the model may perform well on the training set but poorly on new examples.

Dropout reduces this risk by injecting noise into the network during training. Each update sees a slightly different subnetwork. The model cannot rely on a fixed set of hidden activations always being present.

This encourages redundancy and robustness. Useful information should be distributed across many units rather than concentrated in a small number of fragile paths.

Dropout as Model Averaging

One way to understand dropout is as approximate model averaging.

A network with dropout represents many subnetworks. Each binary mask selects a different subnetwork. Training with dropout updates parameters under many such masks. At inference time, the full network is used, with scaling chosen so that activation magnitudes match their expected training-time values.

The exact number of possible subnetworks is enormous. If a layer has $n$ units, there are $2^n$ possible dropout masks. Dropout does not train each subnetwork separately. It samples a small number of masks through stochastic optimization.

This gives some of the benefit of ensembling without training many independent models.

Inverted Dropout

Modern libraries usually implement inverted dropout. This is the formula shown above:

\tilde{h} = \frac{m \odot h}{1-p}.

The scaling happens during training. At inference time, dropout is disabled and the layer returns the input unchanged.

This convention makes inference simpler.

If $m_i$ has probability $1-p$ of being 1, then

\mathbb{E}[m_i h_i] = (1-p)h_i.

After scaling,

\mathbb{E}\left[\frac{m_i h_i}{1-p}\right] = h_i.

Thus the expected activation during training equals the deterministic activation used during inference.

Dropout in PyTorch

PyTorch provides dropout through torch.nn.Dropout.

import torch
from torch import nn

dropout = nn.Dropout(p=0.5)

x = torch.ones(10)
y = dropout(x)

print(y)

During training mode, approximately half the entries are set to zero, and the remaining entries are scaled by $1/(1-0.5)=2$ .

Dropout behavior depends on the module mode.

dropout.train()
y_train = dropout(x)

dropout.eval()
y_eval = dropout(x)

In training mode, dropout is active. In evaluation mode, dropout is disabled.

For a model:

model.train()  # enables dropout
model.eval()   # disables dropout

This distinction is critical. Forgetting to call model.eval() during validation or inference can make predictions random and degrade evaluation accuracy.

Dropout in Feedforward Networks

A common feedforward classifier uses dropout after activation functions:

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x):
        return self.net(x)

The dropout layers affect hidden activations. They do not directly modify the weights. Gradients still update all parameters, but only the active paths receive gradient for a given forward pass.

Dropout Probability

The dropout probability $p$ controls how much noise is injected.

Dropout probability	Effect
$p=0.0$	No dropout
$p=0.1$	Weak regularization
$p=0.3$	Moderate regularization
$p=0.5$	Strong regularization
$p>0.5$	Often too aggressive

For older fully connected networks, $p=0.5$ was common. For modern architectures with normalization, residual connections, strong data augmentation, and large datasets, smaller values such as $0.1$ or $0.2$ are often more appropriate.

The best value is empirical. It should be selected using validation performance.

Dropout and Underfitting

Dropout makes training harder. A model trained with dropout sees noisy hidden representations. This can improve generalization, but it can also reduce training accuracy.

If dropout is too strong, the model may underfit. Symptoms include:

Symptom	Interpretation
Training loss remains high	Dropout may be too strong
Training and validation loss are both poor	Model lacks effective capacity
Removing dropout improves validation	Regularization was excessive
Model converges very slowly	Noise level may be too high

Dropout should be tuned together with model size, learning rate, data augmentation, and weight decay.

Dropout in Convolutional Networks

Standard dropout can be applied to convolutional feature maps, but it may be less effective than in fully connected layers. Neighboring pixels and channels are highly correlated, so dropping individual elements may not remove much information.

PyTorch provides Dropout2d, which randomly drops entire channels in a feature map.

drop = nn.Dropout2d(p=0.2)

x = torch.randn(16, 64, 32, 32)
y = drop(x)

print(y.shape)

For an input of shape [B, C, H, W], Dropout2d may zero entire channels for each sample. This forces the model to avoid relying too heavily on specific feature maps.

Similarly, Dropout3d is used for volumetric data or spatiotemporal tensors.

Dropout in Recurrent Networks

Dropout in recurrent networks must be used carefully. Applying a different dropout mask at every time step can disrupt temporal memory.

Common variants include:

Variant	Description
Input dropout	Applied to inputs
Output dropout	Applied to outputs
Recurrent dropout	Applied to recurrent connections
Variational dropout	Uses the same mask across time

PyTorch recurrent modules expose a dropout argument, but it applies dropout between stacked recurrent layers, not on recurrent connections inside a single layer.

lstm = nn.LSTM(
    input_size=128,
    hidden_size=256,
    num_layers=3,
    dropout=0.2,
    batch_first=True,
)

In this example, dropout is applied between LSTM layers. If num_layers=1, the dropout argument has no effect.

Dropout in Transformers

Transformers often use several dropout locations:

Location	Purpose
Embedding dropout	Regularizes token representations
Attention dropout	Drops attention probabilities
Residual dropout	Drops outputs before residual addition
MLP dropout	Regularizes feedforward blocks

A simplified transformer block may contain:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, dropout_p):
        super().__init__()

        self.attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=n_heads,
            dropout=dropout_p,
            batch_first=True,
        )

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Dropout(dropout_p),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(dropout_p),
        )

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)

        mlp_out = self.mlp(x)
        x = self.norm2(x + mlp_out)

        return x

In large-scale language model pretraining, dropout may be small or even zero when the dataset is very large. In fine-tuning, dropout can be useful because task-specific datasets are often much smaller.

Dropout and Batch Normalization

Dropout and batch normalization can interact in nontrivial ways. Batch normalization estimates activation statistics during training. Dropout changes those activations by randomly zeroing entries. This can make the statistics seen during training differ from those used during inference.

For this reason, many modern convolutional architectures use batch normalization and data augmentation with little or no dropout.

In transformer models, LayerNorm is more common than BatchNorm, and dropout is often easier to use.

Dropout at Inference

By default, dropout is disabled during inference.

model.eval()

with torch.no_grad():
    logits = model(x)

This gives deterministic predictions, except for other stochastic layers or nondeterministic hardware kernels.

There is also a technique called Monte Carlo dropout. It intentionally keeps dropout active during inference and runs the model multiple times. The variation across predictions is used as an approximate uncertainty estimate.

Example pattern:

model.train()  # intentionally keep dropout active

predictions = []

with torch.no_grad():
    for _ in range(20):
        logits = model(x)
        predictions.append(logits.softmax(dim=-1))

mean_prediction = torch.stack(predictions).mean(dim=0)
uncertainty = torch.stack(predictions).std(dim=0)

This method is simple, but it should be treated as an approximation. It does not replace careful uncertainty modeling.

Dropout Versus Weight Decay

Dropout and weight decay regularize models differently.

Method	Regularizes by	Main effect
Weight decay	Penalizing parameter magnitude	Encourages small weights
Dropout	Injecting activation noise	Reduces co-adaptation
Data augmentation	Changing inputs	Encourages invariance
Early stopping	Limiting training time	Prevents late overfitting

These methods can be combined. A model may use weight decay, dropout, augmentation, and early stopping at the same time.

However, excessive regularization can cause underfitting. Adding more regularizers does not automatically improve validation performance.

Practical Guidelines

For fully connected networks, start with dropout probabilities between 0.2 and 0.5.

For convolutional networks, prefer data augmentation, weight decay, and normalization first. Use dropout mainly near classifier heads or use channel-wise dropout variants.

For transformers, use small dropout values such as 0.1 as a starting point for moderate-sized datasets. For very large pretraining runs, use validation experiments to decide whether dropout is needed.

Always call model.train() during training and model.eval() during validation and inference. This controls dropout and other mode-dependent layers.

Use dropout less aggressively when the dataset is large, the model already uses strong augmentation, or validation performance degrades.

Summary

Dropout randomly sets activations to zero during training. It prevents the model from relying too strongly on individual units and acts as a stochastic regularizer.

In PyTorch, dropout is implemented with modules such as nn.Dropout, nn.Dropout2d, and nn.Dropout3d. Dropout is active in training mode and disabled in evaluation mode.

Dropout remains useful, especially for fully connected layers, smaller datasets, and fine-tuning. In modern architectures, it is usually combined carefully with weight decay, normalization, augmentation, and early stopping.