Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.
Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.
A useful rule is: use simple activations for simple architectures, smooth activations for large transformer-style models, and bounded activations when the output must stay in a specific range.
Hidden-Layer Activations
Hidden layers need nonlinear functions that preserve useful gradients. In most modern networks, hidden activations should avoid strong saturation.
For ordinary multilayer perceptrons, a strong starting point is ReLU:
import torch.nn as nn
model = nn.Sequential(
nn.Linear(128, 256),
nn.ReLU(),
nn.Linear(256, 256),
nn.ReLU(),
nn.Linear(256, 10),
)For slightly smoother behavior, use GELU or SiLU:
model = nn.Sequential(
nn.Linear(128, 256),
nn.GELU(),
nn.Linear(256, 256),
nn.GELU(),
nn.Linear(256, 10),
)ReLU is cheaper. GELU and SiLU are smoother. In small models, the difference may be minor. In large models, the smoother activations often train better.
Output Activations
Output activations depend on the task.
| Task | Final layer output | Loss |
|---|---|---|
| Multi-class classification | Raw logits | nn.CrossEntropyLoss |
| Binary classification | Raw logits | nn.BCEWithLogitsLoss |
| Multi-label classification | Raw logits per label | nn.BCEWithLogitsLoss |
| Regression | Usually raw value | MSE, MAE, Huber |
| Probability output for inference | Softmax or sigmoid | Usually no loss |
| Bounded regression | Sigmoid or tanh | Task-dependent |
During training, PyTorch losses often expect logits, not probabilities. Apply softmax or sigmoid for inference, logging, calibration, or sampling, but usually not before the loss.
Architecture-Specific Defaults
Different model families have different activation defaults.
| Architecture | Good default |
|---|---|
| Classical MLP | ReLU, GELU |
| CNN | ReLU, SiLU |
| ResNet-style CNN | ReLU |
| EfficientNet-style CNN | SiLU |
| Transformer encoder | GELU |
| Decoder-only language model | GELU, SiLU |
| RNN hidden state | Tanh |
| RNN gates | Sigmoid |
| GAN discriminator | Leaky ReLU |
| Autoencoder | ReLU, GELU, SiLU |
| VAE latent parameters | Usually no activation |
| Attention weights | Softmax |
These defaults are starting points. The final choice should be checked empirically.
Match Activation and Initialization
Initialization should match the activation.
For ReLU and Leaky ReLU, use Kaiming initialization:
layer = nn.Linear(128, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")For Leaky ReLU:
nn.init.kaiming_normal_(
layer.weight,
a=0.01,
nonlinearity="leaky_relu",
)For tanh, use Xavier initialization:
nn.init.xavier_uniform_(layer.weight)Bad initialization can push activations into saturation or shrink signals layer by layer.
Match Activation and Normalization
Normalization changes how activations behave. Batch normalization and layer normalization keep pre-activation values in a more stable range.
A CNN block commonly uses:
block = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
)A transformer feedforward block commonly uses layer normalization outside or around the feedforward module:
ffn = nn.Sequential(
nn.Linear(768, 3072),
nn.GELU(),
nn.Linear(3072, 768),
)Normalization reduces saturation risk, but it does not make activation choice irrelevant.
Use Bounded Activations Only When Needed
Sigmoid and tanh are bounded. This is useful when the model output must stay in a range.
Use sigmoid for values in :
prob = torch.sigmoid(logits)Use tanh for values in :
value = torch.tanh(raw_output)Avoid sigmoid and tanh as default hidden activations in deep feedforward networks. Their saturation makes optimization harder.
Multi-Class Versus Multi-Label
For multi-class classification, exactly one class is correct. Use softmax semantics through CrossEntropyLoss.
loss = nn.CrossEntropyLoss()(logits, targets)Here logits has shape [B, K], and targets has shape [B].
For multi-label classification, each class is an independent yes/no decision. Use sigmoid semantics through BCEWithLogitsLoss.
loss = nn.BCEWithLogitsLoss()(logits, targets)Here both logits and targets have shape [B, K].
This distinction is one of the most common sources of classification bugs.
Watch Activation Statistics
Activation statistics can reveal bad choices quickly.
For ReLU, inspect the fraction of zero activations:
zero_frac = (activation == 0).float().mean()For sigmoid, inspect whether outputs are close to 0 or 1:
sat_frac = ((activation < 0.01) | (activation > 0.99)).float().mean()For tanh, inspect whether outputs are close to -1 or 1:
sat_frac = ((activation < -0.99) | (activation > 0.99)).float().mean()A high saturation fraction means gradients may be blocked or severely reduced.
Practical Decision Table
| Situation | Recommended activation |
|---|---|
| Need fast baseline | ReLU |
| Many dead ReLU units | Leaky ReLU |
| Transformer feedforward block | GELU or SiLU |
| Modern CNN block | ReLU or SiLU |
| Binary probability output | Sigmoid at inference, logits during training |
| Multi-class probability output | Softmax at inference, logits during training |
| RNN gate | Sigmoid |
| RNN candidate state | Tanh |
| Output must be between -1 and 1 | Tanh |
| Output must be between 0 and 1 | Sigmoid |
| Training unstable due to saturation | Recheck initialization, normalization, and learning rate |
Minimal PyTorch Experiment
The safest way to choose an activation is to compare candidates under the same training setup.
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, activation: type[nn.Module]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(128, 256),
activation(),
nn.Linear(256, 256),
activation(),
nn.Linear(256, 10),
)
def forward(self, x):
return self.net(x)
models = {
"relu": MLP(nn.ReLU),
"gelu": MLP(nn.GELU),
"silu": MLP(nn.SiLU),
"leaky_relu": MLP(lambda: nn.LeakyReLU(0.01)),
}When comparing them, keep everything else fixed: dataset split, optimizer, learning rate, batch size, initialization, number of steps, and random seed.
Practical Rules
Use logits during training with stable PyTorch loss functions.
Use ReLU as a baseline for MLPs and CNNs.
Use GELU or SiLU for transformer-style feedforward blocks.
Use Leaky ReLU when ordinary ReLU produces too many inactive units.
Use sigmoid and tanh deliberately, mostly for gates, recurrent states, and bounded outputs.
Treat activation selection as part of a system. A good activation can still fail with poor initialization, bad normalization, unstable learning rates, or incorrectly scaled inputs.
Exercises
Train the same MLP with ReLU, GELU, SiLU, and Leaky ReLU. Compare validation loss.
Measure the fraction of zero activations in a ReLU network over training.
Replace ReLU with Leaky ReLU and check whether inactive units decrease.
Build a binary classifier using
BCEWithLogitsLoss. Verify that no sigmoid is used before the loss.Build a multi-class classifier using
CrossEntropyLoss. Verify that no softmax is used before the loss.