ELU, GELU, and Swish

ReLU and its variants improved optimization in deep networks, but they still have limitations. ReLU is not smooth at zero, discards all negative values, and can produce dead activations. Later activation functions attempted to preserve the optimization advantages of ReLU while improving gradient behavior, smoothness, and representational flexibility.

Three important modern activations are ELU, GELU, and Swish. These functions appear frequently in modern convolutional networks, transformers, and large language models.

Motivation for Smooth Activations

ReLU is piecewise linear:

\mathrm{ReLU}(x)=\max(0,x).

Its derivative changes abruptly at zero. For negative inputs, the derivative is exactly zero.

Many later activations attempted to address several issues:

Preserve gradients for negative inputs
Avoid dead neurons
Provide smoother optimization landscapes
Improve gradient flow
Preserve some negative information
Improve large-scale training stability

The central idea is simple: instead of sharply clipping the negative side, use a smoother nonlinear transition.

Exponential Linear Units (ELU)

The exponential linear unit, or ELU, introduces a smooth exponential curve for negative inputs.

The function is defined as

\mathrm{ELU}(x)= \begin{cases} x & x>0 \\ \alpha(e^x-1) & x\leq 0 \end{cases}

where $\alpha>0$ controls the saturation value of the negative region.

A common choice is

\alpha=1.

For positive inputs, ELU behaves like ReLU. For negative inputs, it smoothly approaches $-\alpha$ .

Properties of ELU

The negative side of ELU is smooth rather than flat.

As $x\to -\infty$ ,

\mathrm{ELU}(x)\to -\alpha.

At zero,

\mathrm{ELU}(0)=0.

The function is continuous, and when $\alpha=1$ , its derivative is also continuous at zero.

This smooth transition improves gradient behavior compared with ReLU.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.ELU(alpha=1.0)
y = act(x)

print(y)

Output:

tensor([-0.9502, -0.6321,  0.0000,  2.0000])

Unlike ReLU, negative inputs remain informative.

Derivative of ELU

For positive inputs:

\frac{d}{dx}\mathrm{ELU}(x)=1.

For negative inputs:

\frac{d}{dx}\mathrm{ELU}(x)=\alpha e^x.

The derivative remains positive for all inputs. Therefore, negative activations still receive gradients.

This reduces the dying neuron problem seen in ReLU networks.

However, when $x$ becomes very negative,

e^x\to 0,

so gradients still become small in strongly negative regions.

Thus ELU reduces saturation problems but does not eliminate them completely.

Zero-Centered Activations

One important advantage of ELU is that its outputs can become negative.

ReLU outputs are always nonnegative:

\mathrm{ReLU}(x)\geq 0.

ELU outputs can be both positive and negative:

-\alpha < \mathrm{ELU}(x) < \infty.

This produces activations that are more centered around zero. Zero-centered activations often improve optimization because the next layer receives both positive and negative signals.

This effect was historically important before normalization layers became standard.

SELU and Self-Normalization

A closely related activation is SELU (Scaled ELU). SELU introduces carefully chosen scaling constants:

\mathrm{SELU}(x)= \lambda \begin{cases} x & x>0 \\ \alpha(e^x-1) & x\leq 0 \end{cases}

where $\lambda$ and $\alpha$ are fixed values chosen to stabilize activation statistics across layers.

SELU was designed for self-normalizing neural networks, where activations automatically maintain approximately stable mean and variance.

In practice, SELU requires specific conditions:

Proper initialization
Alpha dropout
Fully connected architectures
Careful layer design

Because modern networks often use batch normalization or layer normalization, SELU is less common today than ReLU or GELU.

Gaussian Error Linear Units (GELU)

The Gaussian Error Linear Unit, or GELU, became especially important in transformers and large language models.

The GELU activation is

\mathrm{GELU}(x)=x\Phi(x),

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Intuitively, GELU scales inputs according to their magnitude rather than applying a hard threshold.

Small negative values are not immediately discarded. Instead, they are smoothly reduced.

A common approximation is

\mathrm{GELU}(x) \approx 0.5x \left( 1+\tanh\left( \sqrt{\frac{2}{\pi}} \left( x+0.044715x^3 \right) \right) \right).

Modern frameworks use optimized implementations internally.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.GELU()
y = act(x)

print(y)

Output:

tensor([-0.0040, -0.1587,  0.0000,  1.9545])

Notice that negative values are not sharply clipped.

Intuition Behind GELU

ReLU behaves like a hard gate:

x>0 \Rightarrow \text{keep}, \quad x\leq 0 \Rightarrow \text{discard}.

GELU behaves more like a probabilistic gate. Inputs are weighted continuously according to magnitude.

Large positive values pass almost unchanged:

\mathrm{GELU}(x)\approx x \quad \text{for large } x>0.

Large negative values shrink toward zero:

\mathrm{GELU}(x)\approx 0 \quad \text{for large } x<0.

Near zero, the transition is smooth.

This smoother behavior improves optimization in very large models.

GELU in Transformers

GELU became standard in transformer architectures after models such as BERT demonstrated strong performance using it.

A transformer feedforward block often computes

\mathrm{FFN}(x) = W_2 \mathrm{GELU}(W_1x+b_1) +b_2.

This structure appears in many modern models:

BERT
GPT-2
Vision Transformer

GELU is now one of the most common activations in large transformer systems.

Swish Activation

Swish was introduced as another smooth activation function:

\mathrm{Swish}(x)=x\sigma(x).

Since

\sigma(x)=\frac{1}{1+e^{-x}},

Swish becomes

\mathrm{Swish}(x)=\frac{x}{1+e^{-x}}.

The function resembles GELU but is simpler.

In PyTorch, Swish is commonly implemented as SiLU (Sigmoid Linear Unit):

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.SiLU()
y = act(x)

print(y)

Output:

tensor([-0.1423, -0.2689,  0.0000,  1.7616])

PyTorch uses the name SiLU, but mathematically it corresponds to the Swish family.

Properties of Swish

Swish has several important properties:

Smooth everywhere
Nonmonotonic
Preserves small negative information
Strong gradient flow
Unbounded above
Bounded below

Unlike ReLU, Swish is not piecewise linear. Unlike sigmoid, it does not saturate strongly for positive inputs.

The nonmonotonic behavior means the function slightly bends downward for small negative inputs before increasing again. This may improve representational flexibility.

Comparing GELU and Swish

GELU and Swish are closely related.

GELU:

\mathrm{GELU}(x)=x\Phi(x)

Swish:

\mathrm{Swish}(x)=x\sigma(x)

Both multiply the input by a smooth gating function. Both preserve small negative values. Both are smooth alternatives to ReLU.

Empirically, their performance is often similar.

Modern language models frequently use either:

GELU
SiLU/Swish

depending on architecture and implementation preference.

Activation Smoothness

Smooth activations can improve optimization because gradients vary continuously.

ReLU has a discontinuous derivative at zero:

\mathrm{ReLU}'(0) \text{ is undefined}.

GELU and Swish are differentiable everywhere.

Smooth derivatives may help:

Optimization stability
Gradient-based second-order methods
Large-batch training
Transformer scaling

However, smoothness alone does not guarantee better performance. Activation quality depends on the full training system.

Computational Cost

Smooth activations are more expensive than ReLU.

ReLU requires only:

\max(0,x).

GELU and Swish require sigmoid, tanh, or Gaussian approximations involving exponentials.

For very large models, activation cost matters because activations are applied billions or trillions of times during training.

Despite the extra cost, modern hardware and optimized kernels make GELU and SiLU practical at scale.

Activation Functions in Practice

Different architectures tend to favor different activations.

Architecture	Common activations
Classical CNNs	ReLU
GAN discriminators	Leaky ReLU
Modern CNNs	ReLU, SiLU
Transformers	GELU, SiLU
Recurrent networks	Tanh, sigmoid
Mobile models	ReLU6, SiLU

The choice is partly historical and partly empirical.

ReLU Versus Smooth Activations

Property	ReLU	ELU	GELU	Swish/SiLU
Smooth	No	Mostly	Yes	Yes
Negative outputs	No	Yes	Yes	Yes
Dead neurons	Possible	Less likely	Rare	Rare
Computational cost	Low	Medium	Higher	Higher
Transformer usage	Limited	Rare	Very common	Very common
CNN usage	Very common	Moderate	Increasing	Increasing

ReLU remains attractive because it is simple and efficient. GELU and Swish became popular because they improve optimization in large-scale systems.

Activation Choice in PyTorch

PyTorch provides all major activations:

import torch.nn as nn

relu = nn.ReLU()
leaky = nn.LeakyReLU(0.01)
elu = nn.ELU()
gelu = nn.GELU()
silu = nn.SiLU()

A transformer feedforward block might use GELU:

block = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),
    nn.Linear(3072, 768),
)

A CNN block might use SiLU:

block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.SiLU(),
)

Practical Guidance

Use ReLU when simplicity and efficiency are priorities.

Use Leaky ReLU when dead activations become problematic.

Use GELU for transformers and large language models.

Use SiLU/Swish for modern CNNs and architectures inspired by efficient scaling.

Use ELU when negative activations and smoother transitions are desired, especially in smaller or older architectures.

The best activation depends on the architecture, optimizer, normalization strategy, initialization, and scale of training.

Exercises

Compare the outputs of ReLU, ELU, GELU, and Swish for inputs between $-5$ and $5$ .
Explain why GELU behaves like a soft gate.
Show why ELU avoids completely zero gradients for negative inputs.
Implement the same transformer feedforward block using ReLU and GELU. Compare training stability.
Replace ReLU with SiLU in a CNN and measure the effect on validation accuracy and convergence speed.