Skip to content

Practical Activation Selection

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.

A useful rule is: use simple activations for simple architectures, smooth activations for large transformer-style models, and bounded activations when the output must stay in a specific range.

Hidden-Layer Activations

Hidden layers need nonlinear functions that preserve useful gradients. In most modern networks, hidden activations should avoid strong saturation.

For ordinary multilayer perceptrons, a strong starting point is ReLU:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

For slightly smoother behavior, use GELU or SiLU:

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.GELU(),
    nn.Linear(256, 256),
    nn.GELU(),
    nn.Linear(256, 10),
)

ReLU is cheaper. GELU and SiLU are smoother. In small models, the difference may be minor. In large models, the smoother activations often train better.

Output Activations

Output activations depend on the task.

TaskFinal layer outputLoss
Multi-class classificationRaw logitsnn.CrossEntropyLoss
Binary classificationRaw logitsnn.BCEWithLogitsLoss
Multi-label classificationRaw logits per labelnn.BCEWithLogitsLoss
RegressionUsually raw valueMSE, MAE, Huber
Probability output for inferenceSoftmax or sigmoidUsually no loss
Bounded regressionSigmoid or tanhTask-dependent

During training, PyTorch losses often expect logits, not probabilities. Apply softmax or sigmoid for inference, logging, calibration, or sampling, but usually not before the loss.

Architecture-Specific Defaults

Different model families have different activation defaults.

ArchitectureGood default
Classical MLPReLU, GELU
CNNReLU, SiLU
ResNet-style CNNReLU
EfficientNet-style CNNSiLU
Transformer encoderGELU
Decoder-only language modelGELU, SiLU
RNN hidden stateTanh
RNN gatesSigmoid
GAN discriminatorLeaky ReLU
AutoencoderReLU, GELU, SiLU
VAE latent parametersUsually no activation
Attention weightsSoftmax

These defaults are starting points. The final choice should be checked empirically.

Match Activation and Initialization

Initialization should match the activation.

For ReLU and Leaky ReLU, use Kaiming initialization:

layer = nn.Linear(128, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")

For Leaky ReLU:

nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)

For tanh, use Xavier initialization:

nn.init.xavier_uniform_(layer.weight)

Bad initialization can push activations into saturation or shrink signals layer by layer.

Match Activation and Normalization

Normalization changes how activations behave. Batch normalization and layer normalization keep pre-activation values in a more stable range.

A CNN block commonly uses:

block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
)

A transformer feedforward block commonly uses layer normalization outside or around the feedforward module:

ffn = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),
    nn.Linear(3072, 768),
)

Normalization reduces saturation risk, but it does not make activation choice irrelevant.

Use Bounded Activations Only When Needed

Sigmoid and tanh are bounded. This is useful when the model output must stay in a range.

Use sigmoid for values in (0,1)(0,1):

prob = torch.sigmoid(logits)

Use tanh for values in (1,1)(-1,1):

value = torch.tanh(raw_output)

Avoid sigmoid and tanh as default hidden activations in deep feedforward networks. Their saturation makes optimization harder.

Multi-Class Versus Multi-Label

For multi-class classification, exactly one class is correct. Use softmax semantics through CrossEntropyLoss.

loss = nn.CrossEntropyLoss()(logits, targets)

Here logits has shape [B, K], and targets has shape [B].

For multi-label classification, each class is an independent yes/no decision. Use sigmoid semantics through BCEWithLogitsLoss.

loss = nn.BCEWithLogitsLoss()(logits, targets)

Here both logits and targets have shape [B, K].

This distinction is one of the most common sources of classification bugs.

Watch Activation Statistics

Activation statistics can reveal bad choices quickly.

For ReLU, inspect the fraction of zero activations:

zero_frac = (activation == 0).float().mean()

For sigmoid, inspect whether outputs are close to 0 or 1:

sat_frac = ((activation < 0.01) | (activation > 0.99)).float().mean()

For tanh, inspect whether outputs are close to -1 or 1:

sat_frac = ((activation < -0.99) | (activation > 0.99)).float().mean()

A high saturation fraction means gradients may be blocked or severely reduced.

Practical Decision Table

SituationRecommended activation
Need fast baselineReLU
Many dead ReLU unitsLeaky ReLU
Transformer feedforward blockGELU or SiLU
Modern CNN blockReLU or SiLU
Binary probability outputSigmoid at inference, logits during training
Multi-class probability outputSoftmax at inference, logits during training
RNN gateSigmoid
RNN candidate stateTanh
Output must be between -1 and 1Tanh
Output must be between 0 and 1Sigmoid
Training unstable due to saturationRecheck initialization, normalization, and learning rate

Minimal PyTorch Experiment

The safest way to choose an activation is to compare candidates under the same training setup.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, activation: type[nn.Module]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(128, 256),
            activation(),
            nn.Linear(256, 256),
            activation(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.net(x)

models = {
    "relu": MLP(nn.ReLU),
    "gelu": MLP(nn.GELU),
    "silu": MLP(nn.SiLU),
    "leaky_relu": MLP(lambda: nn.LeakyReLU(0.01)),
}

When comparing them, keep everything else fixed: dataset split, optimizer, learning rate, batch size, initialization, number of steps, and random seed.

Practical Rules

Use logits during training with stable PyTorch loss functions.

Use ReLU as a baseline for MLPs and CNNs.

Use GELU or SiLU for transformer-style feedforward blocks.

Use Leaky ReLU when ordinary ReLU produces too many inactive units.

Use sigmoid and tanh deliberately, mostly for gates, recurrent states, and bounded outputs.

Treat activation selection as part of a system. A good activation can still fail with poor initialization, bad normalization, unstable learning rates, or incorrectly scaled inputs.

Exercises

  1. Train the same MLP with ReLU, GELU, SiLU, and Leaky ReLU. Compare validation loss.

  2. Measure the fraction of zero activations in a ReLU network over training.

  3. Replace ReLU with Leaky ReLU and check whether inactive units decrease.

  4. Build a binary classifier using BCEWithLogitsLoss. Verify that no sigmoid is used before the loss.

  5. Build a multi-class classifier using CrossEntropyLoss. Verify that no softmax is used before the loss.