ReLU and its variants improved optimization in deep networks, but they still have limitations.
ReLU and its variants improved optimization in deep networks, but they still have limitations. ReLU is not smooth at zero, discards all negative values, and can produce dead activations. Later activation functions attempted to preserve the optimization advantages of ReLU while improving gradient behavior, smoothness, and representational flexibility.
Three important modern activations are ELU, GELU, and Swish. These functions appear frequently in modern convolutional networks, transformers, and large language models.
Motivation for Smooth Activations
ReLU is piecewise linear:
Its derivative changes abruptly at zero. For negative inputs, the derivative is exactly zero.
Many later activations attempted to address several issues:
- Preserve gradients for negative inputs
- Avoid dead neurons
- Provide smoother optimization landscapes
- Improve gradient flow
- Preserve some negative information
- Improve large-scale training stability
The central idea is simple: instead of sharply clipping the negative side, use a smoother nonlinear transition.
Exponential Linear Units (ELU)
The exponential linear unit, or ELU, introduces a smooth exponential curve for negative inputs.
The function is defined as
where controls the saturation value of the negative region.
A common choice is
For positive inputs, ELU behaves like ReLU. For negative inputs, it smoothly approaches .
Properties of ELU
The negative side of ELU is smooth rather than flat.
As ,
At zero,
The function is continuous, and when , its derivative is also continuous at zero.
This smooth transition improves gradient behavior compared with ReLU.
In PyTorch:
import torch
import torch.nn as nn
x = torch.tensor([-3.0, -1.0, 0.0, 2.0])
act = nn.ELU(alpha=1.0)
y = act(x)
print(y)Output:
tensor([-0.9502, -0.6321, 0.0000, 2.0000])Unlike ReLU, negative inputs remain informative.
Derivative of ELU
For positive inputs:
For negative inputs:
The derivative remains positive for all inputs. Therefore, negative activations still receive gradients.
This reduces the dying neuron problem seen in ReLU networks.
However, when becomes very negative,
so gradients still become small in strongly negative regions.
Thus ELU reduces saturation problems but does not eliminate them completely.
Zero-Centered Activations
One important advantage of ELU is that its outputs can become negative.
ReLU outputs are always nonnegative:
ELU outputs can be both positive and negative:
This produces activations that are more centered around zero. Zero-centered activations often improve optimization because the next layer receives both positive and negative signals.
This effect was historically important before normalization layers became standard.
SELU and Self-Normalization
A closely related activation is SELU (Scaled ELU). SELU introduces carefully chosen scaling constants:
where and are fixed values chosen to stabilize activation statistics across layers.
SELU was designed for self-normalizing neural networks, where activations automatically maintain approximately stable mean and variance.
In practice, SELU requires specific conditions:
- Proper initialization
- Alpha dropout
- Fully connected architectures
- Careful layer design
Because modern networks often use batch normalization or layer normalization, SELU is less common today than ReLU or GELU.
Gaussian Error Linear Units (GELU)
The Gaussian Error Linear Unit, or GELU, became especially important in transformers and large language models.
The GELU activation is
where is the cumulative distribution function of the standard normal distribution.
Intuitively, GELU scales inputs according to their magnitude rather than applying a hard threshold.
Small negative values are not immediately discarded. Instead, they are smoothly reduced.
A common approximation is
Modern frameworks use optimized implementations internally.
In PyTorch:
import torch
import torch.nn as nn
x = torch.tensor([-3.0, -1.0, 0.0, 2.0])
act = nn.GELU()
y = act(x)
print(y)Output:
tensor([-0.0040, -0.1587, 0.0000, 1.9545])Notice that negative values are not sharply clipped.
Intuition Behind GELU
ReLU behaves like a hard gate:
GELU behaves more like a probabilistic gate. Inputs are weighted continuously according to magnitude.
Large positive values pass almost unchanged:
Large negative values shrink toward zero:
Near zero, the transition is smooth.
This smoother behavior improves optimization in very large models.
GELU in Transformers
GELU became standard in transformer architectures after models such as BERT demonstrated strong performance using it.
A transformer feedforward block often computes
This structure appears in many modern models:
- BERT
- GPT-2
- Vision Transformer
GELU is now one of the most common activations in large transformer systems.
Swish Activation
Swish was introduced as another smooth activation function:
Since
Swish becomes
The function resembles GELU but is simpler.
In PyTorch, Swish is commonly implemented as SiLU (Sigmoid Linear Unit):
import torch
import torch.nn as nn
x = torch.tensor([-3.0, -1.0, 0.0, 2.0])
act = nn.SiLU()
y = act(x)
print(y)Output:
tensor([-0.1423, -0.2689, 0.0000, 1.7616])PyTorch uses the name SiLU, but mathematically it corresponds to the Swish family.
Properties of Swish
Swish has several important properties:
- Smooth everywhere
- Nonmonotonic
- Preserves small negative information
- Strong gradient flow
- Unbounded above
- Bounded below
Unlike ReLU, Swish is not piecewise linear. Unlike sigmoid, it does not saturate strongly for positive inputs.
The nonmonotonic behavior means the function slightly bends downward for small negative inputs before increasing again. This may improve representational flexibility.
Comparing GELU and Swish
GELU and Swish are closely related.
GELU:
Swish:
Both multiply the input by a smooth gating function. Both preserve small negative values. Both are smooth alternatives to ReLU.
Empirically, their performance is often similar.
Modern language models frequently use either:
- GELU
- SiLU/Swish
depending on architecture and implementation preference.
Activation Smoothness
Smooth activations can improve optimization because gradients vary continuously.
ReLU has a discontinuous derivative at zero:
GELU and Swish are differentiable everywhere.
Smooth derivatives may help:
- Optimization stability
- Gradient-based second-order methods
- Large-batch training
- Transformer scaling
However, smoothness alone does not guarantee better performance. Activation quality depends on the full training system.
Computational Cost
Smooth activations are more expensive than ReLU.
ReLU requires only:
GELU and Swish require sigmoid, tanh, or Gaussian approximations involving exponentials.
For very large models, activation cost matters because activations are applied billions or trillions of times during training.
Despite the extra cost, modern hardware and optimized kernels make GELU and SiLU practical at scale.
Activation Functions in Practice
Different architectures tend to favor different activations.
| Architecture | Common activations |
|---|---|
| Classical CNNs | ReLU |
| GAN discriminators | Leaky ReLU |
| Modern CNNs | ReLU, SiLU |
| Transformers | GELU, SiLU |
| Recurrent networks | Tanh, sigmoid |
| Mobile models | ReLU6, SiLU |
The choice is partly historical and partly empirical.
ReLU Versus Smooth Activations
| Property | ReLU | ELU | GELU | Swish/SiLU |
|---|---|---|---|---|
| Smooth | No | Mostly | Yes | Yes |
| Negative outputs | No | Yes | Yes | Yes |
| Dead neurons | Possible | Less likely | Rare | Rare |
| Computational cost | Low | Medium | Higher | Higher |
| Transformer usage | Limited | Rare | Very common | Very common |
| CNN usage | Very common | Moderate | Increasing | Increasing |
ReLU remains attractive because it is simple and efficient. GELU and Swish became popular because they improve optimization in large-scale systems.
Activation Choice in PyTorch
PyTorch provides all major activations:
import torch.nn as nn
relu = nn.ReLU()
leaky = nn.LeakyReLU(0.01)
elu = nn.ELU()
gelu = nn.GELU()
silu = nn.SiLU()A transformer feedforward block might use GELU:
block = nn.Sequential(
nn.Linear(768, 3072),
nn.GELU(),
nn.Linear(3072, 768),
)A CNN block might use SiLU:
block = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.SiLU(),
)Practical Guidance
Use ReLU when simplicity and efficiency are priorities.
Use Leaky ReLU when dead activations become problematic.
Use GELU for transformers and large language models.
Use SiLU/Swish for modern CNNs and architectures inspired by efficient scaling.
Use ELU when negative activations and smoother transitions are desired, especially in smaller or older architectures.
The best activation depends on the architecture, optimizer, normalization strategy, initialization, and scale of training.
Exercises
Compare the outputs of ReLU, ELU, GELU, and Swish for inputs between and .
Explain why GELU behaves like a soft gate.
Show why ELU avoids completely zero gradients for negative inputs.
Implement the same transformer feedforward block using ReLU and GELU. Compare training stability.
Replace ReLU with SiLU in a CNN and measure the effect on validation accuracy and convergence speed.