Practical Activation Selection

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.

A useful rule is: use simple activations for simple architectures, smooth activations for large transformer-style models, and bounded activations when the output must stay in a specific range.

Hidden-Layer Activations

Hidden layers need nonlinear functions that preserve useful gradients. In most modern networks, hidden activations should avoid strong saturation.

For ordinary multilayer perceptrons, a strong starting point is ReLU:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

For slightly smoother behavior, use GELU or SiLU:

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.GELU(),
    nn.Linear(256, 256),
    nn.GELU(),
    nn.Linear(256, 10),
)

ReLU is cheaper. GELU and SiLU are smoother. In small models, the difference may be minor. In large models, the smoother activations often train better.

Output Activations

Output activations depend on the task.

Task	Final layer output	Loss
Multi-class classification	Raw logits	`nn.CrossEntropyLoss`
Binary classification	Raw logits	`nn.BCEWithLogitsLoss`
Multi-label classification	Raw logits per label	`nn.BCEWithLogitsLoss`
Regression	Usually raw value	MSE, MAE, Huber
Probability output for inference	Softmax or sigmoid	Usually no loss
Bounded regression	Sigmoid or tanh	Task-dependent

During training, PyTorch losses often expect logits, not probabilities. Apply softmax or sigmoid for inference, logging, calibration, or sampling, but usually not before the loss.

Architecture-Specific Defaults

Different model families have different activation defaults.

Architecture	Good default
Classical MLP	ReLU, GELU
CNN	ReLU, SiLU
ResNet-style CNN	ReLU
EfficientNet-style CNN	SiLU
Transformer encoder	GELU
Decoder-only language model	GELU, SiLU
RNN hidden state	Tanh
RNN gates	Sigmoid
GAN discriminator	Leaky ReLU
Autoencoder	ReLU, GELU, SiLU
VAE latent parameters	Usually no activation
Attention weights	Softmax

These defaults are starting points. The final choice should be checked empirically.

Match Activation and Initialization

Initialization should match the activation.

For ReLU and Leaky ReLU, use Kaiming initialization:

layer = nn.Linear(128, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")

For Leaky ReLU:

nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)

For tanh, use Xavier initialization:

nn.init.xavier_uniform_(layer.weight)

Bad initialization can push activations into saturation or shrink signals layer by layer.

Match Activation and Normalization

Normalization changes how activations behave. Batch normalization and layer normalization keep pre-activation values in a more stable range.

A CNN block commonly uses:

block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
)

A transformer feedforward block commonly uses layer normalization outside or around the feedforward module:

ffn = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),
    nn.Linear(3072, 768),
)

Normalization reduces saturation risk, but it does not make activation choice irrelevant.

Use Bounded Activations Only When Needed

Sigmoid and tanh are bounded. This is useful when the model output must stay in a range.

Use sigmoid for values in $(0,1)$ :

prob = torch.sigmoid(logits)

Use tanh for values in $(-1,1)$ :

value = torch.tanh(raw_output)

Avoid sigmoid and tanh as default hidden activations in deep feedforward networks. Their saturation makes optimization harder.

Multi-Class Versus Multi-Label

For multi-class classification, exactly one class is correct. Use softmax semantics through CrossEntropyLoss.

loss = nn.CrossEntropyLoss()(logits, targets)

Here logits has shape [B, K], and targets has shape [B].

For multi-label classification, each class is an independent yes/no decision. Use sigmoid semantics through BCEWithLogitsLoss.

loss = nn.BCEWithLogitsLoss()(logits, targets)

Here both logits and targets have shape [B, K].

This distinction is one of the most common sources of classification bugs.

Watch Activation Statistics

Activation statistics can reveal bad choices quickly.

For ReLU, inspect the fraction of zero activations:

zero_frac = (activation == 0).float().mean()

For sigmoid, inspect whether outputs are close to 0 or 1:

sat_frac = ((activation < 0.01) | (activation > 0.99)).float().mean()

For tanh, inspect whether outputs are close to -1 or 1:

sat_frac = ((activation < -0.99) | (activation > 0.99)).float().mean()

A high saturation fraction means gradients may be blocked or severely reduced.

Practical Decision Table

Situation	Recommended activation
Need fast baseline	ReLU
Many dead ReLU units	Leaky ReLU
Transformer feedforward block	GELU or SiLU
Modern CNN block	ReLU or SiLU
Binary probability output	Sigmoid at inference, logits during training
Multi-class probability output	Softmax at inference, logits during training
RNN gate	Sigmoid
RNN candidate state	Tanh
Output must be between -1 and 1	Tanh
Output must be between 0 and 1	Sigmoid
Training unstable due to saturation	Recheck initialization, normalization, and learning rate

Minimal PyTorch Experiment

The safest way to choose an activation is to compare candidates under the same training setup.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, activation: type[nn.Module]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(128, 256),
            activation(),
            nn.Linear(256, 256),
            activation(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.net(x)

models = {
    "relu": MLP(nn.ReLU),
    "gelu": MLP(nn.GELU),
    "silu": MLP(nn.SiLU),
    "leaky_relu": MLP(lambda: nn.LeakyReLU(0.01)),
}

When comparing them, keep everything else fixed: dataset split, optimizer, learning rate, batch size, initialization, number of steps, and random seed.

Practical Rules

Use logits during training with stable PyTorch loss functions.

Use ReLU as a baseline for MLPs and CNNs.

Use GELU or SiLU for transformer-style feedforward blocks.

Use Leaky ReLU when ordinary ReLU produces too many inactive units.

Use sigmoid and tanh deliberately, mostly for gates, recurrent states, and bounded outputs.

Treat activation selection as part of a system. A good activation can still fail with poor initialization, bad normalization, unstable learning rates, or incorrectly scaled inputs.

Exercises

Train the same MLP with ReLU, GELU, SiLU, and Leaky ReLU. Compare validation loss.
Measure the fraction of zero activations in a ReLU network over training.
Replace ReLU with Leaky ReLU and check whether inactive units decrease.
Build a binary classifier using BCEWithLogitsLoss. Verify that no sigmoid is used before the loss.
Build a multi-class classifier using CrossEntropyLoss. Verify that no softmax is used before the loss.