Skip to content

ELU, GELU, and Swish

ReLU and its variants improved optimization in deep networks, but they still have limitations.

ReLU and its variants improved optimization in deep networks, but they still have limitations. ReLU is not smooth at zero, discards all negative values, and can produce dead activations. Later activation functions attempted to preserve the optimization advantages of ReLU while improving gradient behavior, smoothness, and representational flexibility.

Three important modern activations are ELU, GELU, and Swish. These functions appear frequently in modern convolutional networks, transformers, and large language models.

Motivation for Smooth Activations

ReLU is piecewise linear:

ReLU(x)=max(0,x). \mathrm{ReLU}(x)=\max(0,x).

Its derivative changes abruptly at zero. For negative inputs, the derivative is exactly zero.

Many later activations attempted to address several issues:

  • Preserve gradients for negative inputs
  • Avoid dead neurons
  • Provide smoother optimization landscapes
  • Improve gradient flow
  • Preserve some negative information
  • Improve large-scale training stability

The central idea is simple: instead of sharply clipping the negative side, use a smoother nonlinear transition.

Exponential Linear Units (ELU)

The exponential linear unit, or ELU, introduces a smooth exponential curve for negative inputs.

The function is defined as

ELU(x)={xx>0α(ex1)x0 \mathrm{ELU}(x)= \begin{cases} x & x>0 \\ \alpha(e^x-1) & x\leq 0 \end{cases}

where α>0\alpha>0 controls the saturation value of the negative region.

A common choice is

α=1. \alpha=1.

For positive inputs, ELU behaves like ReLU. For negative inputs, it smoothly approaches α-\alpha.

Properties of ELU

The negative side of ELU is smooth rather than flat.

As xx\to -\infty,

ELU(x)α. \mathrm{ELU}(x)\to -\alpha.

At zero,

ELU(0)=0. \mathrm{ELU}(0)=0.

The function is continuous, and when α=1\alpha=1, its derivative is also continuous at zero.

This smooth transition improves gradient behavior compared with ReLU.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.ELU(alpha=1.0)
y = act(x)

print(y)

Output:

tensor([-0.9502, -0.6321,  0.0000,  2.0000])

Unlike ReLU, negative inputs remain informative.

Derivative of ELU

For positive inputs:

ddxELU(x)=1. \frac{d}{dx}\mathrm{ELU}(x)=1.

For negative inputs:

ddxELU(x)=αex. \frac{d}{dx}\mathrm{ELU}(x)=\alpha e^x.

The derivative remains positive for all inputs. Therefore, negative activations still receive gradients.

This reduces the dying neuron problem seen in ReLU networks.

However, when xx becomes very negative,

ex0, e^x\to 0,

so gradients still become small in strongly negative regions.

Thus ELU reduces saturation problems but does not eliminate them completely.

Zero-Centered Activations

One important advantage of ELU is that its outputs can become negative.

ReLU outputs are always nonnegative:

ReLU(x)0. \mathrm{ReLU}(x)\geq 0.

ELU outputs can be both positive and negative:

α<ELU(x)<. -\alpha < \mathrm{ELU}(x) < \infty.

This produces activations that are more centered around zero. Zero-centered activations often improve optimization because the next layer receives both positive and negative signals.

This effect was historically important before normalization layers became standard.

SELU and Self-Normalization

A closely related activation is SELU (Scaled ELU). SELU introduces carefully chosen scaling constants:

SELU(x)=λ{xx>0α(ex1)x0 \mathrm{SELU}(x)= \lambda \begin{cases} x & x>0 \\ \alpha(e^x-1) & x\leq 0 \end{cases}

where λ\lambda and α\alpha are fixed values chosen to stabilize activation statistics across layers.

SELU was designed for self-normalizing neural networks, where activations automatically maintain approximately stable mean and variance.

In practice, SELU requires specific conditions:

  • Proper initialization
  • Alpha dropout
  • Fully connected architectures
  • Careful layer design

Because modern networks often use batch normalization or layer normalization, SELU is less common today than ReLU or GELU.

Gaussian Error Linear Units (GELU)

The Gaussian Error Linear Unit, or GELU, became especially important in transformers and large language models.

The GELU activation is

GELU(x)=xΦ(x), \mathrm{GELU}(x)=x\Phi(x),

where Φ(x)\Phi(x) is the cumulative distribution function of the standard normal distribution.

Intuitively, GELU scales inputs according to their magnitude rather than applying a hard threshold.

Small negative values are not immediately discarded. Instead, they are smoothly reduced.

A common approximation is

GELU(x)0.5x(1+tanh(2π(x+0.044715x3))). \mathrm{GELU}(x) \approx 0.5x \left( 1+\tanh\left( \sqrt{\frac{2}{\pi}} \left( x+0.044715x^3 \right) \right) \right).

Modern frameworks use optimized implementations internally.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.GELU()
y = act(x)

print(y)

Output:

tensor([-0.0040, -0.1587,  0.0000,  1.9545])

Notice that negative values are not sharply clipped.

Intuition Behind GELU

ReLU behaves like a hard gate:

x>0keep,x0discard. x>0 \Rightarrow \text{keep}, \quad x\leq 0 \Rightarrow \text{discard}.

GELU behaves more like a probabilistic gate. Inputs are weighted continuously according to magnitude.

Large positive values pass almost unchanged:

GELU(x)xfor large x>0. \mathrm{GELU}(x)\approx x \quad \text{for large } x>0.

Large negative values shrink toward zero:

GELU(x)0for large x<0. \mathrm{GELU}(x)\approx 0 \quad \text{for large } x<0.

Near zero, the transition is smooth.

This smoother behavior improves optimization in very large models.

GELU in Transformers

GELU became standard in transformer architectures after models such as BERT demonstrated strong performance using it.

A transformer feedforward block often computes

FFN(x)=W2GELU(W1x+b1)+b2. \mathrm{FFN}(x) = W_2 \mathrm{GELU}(W_1x+b_1) +b_2.

This structure appears in many modern models:

  • BERT
  • GPT-2
  • Vision Transformer

GELU is now one of the most common activations in large transformer systems.

Swish Activation

Swish was introduced as another smooth activation function:

Swish(x)=xσ(x). \mathrm{Swish}(x)=x\sigma(x).

Since

σ(x)=11+ex, \sigma(x)=\frac{1}{1+e^{-x}},

Swish becomes

Swish(x)=x1+ex. \mathrm{Swish}(x)=\frac{x}{1+e^{-x}}.

The function resembles GELU but is simpler.

In PyTorch, Swish is commonly implemented as SiLU (Sigmoid Linear Unit):

import torch
import torch.nn as nn

x = torch.tensor([-3.0, -1.0, 0.0, 2.0])

act = nn.SiLU()
y = act(x)

print(y)

Output:

tensor([-0.1423, -0.2689,  0.0000,  1.7616])

PyTorch uses the name SiLU, but mathematically it corresponds to the Swish family.

Properties of Swish

Swish has several important properties:

  • Smooth everywhere
  • Nonmonotonic
  • Preserves small negative information
  • Strong gradient flow
  • Unbounded above
  • Bounded below

Unlike ReLU, Swish is not piecewise linear. Unlike sigmoid, it does not saturate strongly for positive inputs.

The nonmonotonic behavior means the function slightly bends downward for small negative inputs before increasing again. This may improve representational flexibility.

Comparing GELU and Swish

GELU and Swish are closely related.

GELU:

GELU(x)=xΦ(x) \mathrm{GELU}(x)=x\Phi(x)

Swish:

Swish(x)=xσ(x) \mathrm{Swish}(x)=x\sigma(x)

Both multiply the input by a smooth gating function. Both preserve small negative values. Both are smooth alternatives to ReLU.

Empirically, their performance is often similar.

Modern language models frequently use either:

  • GELU
  • SiLU/Swish

depending on architecture and implementation preference.

Activation Smoothness

Smooth activations can improve optimization because gradients vary continuously.

ReLU has a discontinuous derivative at zero:

ReLU(0) is undefined. \mathrm{ReLU}'(0) \text{ is undefined}.

GELU and Swish are differentiable everywhere.

Smooth derivatives may help:

  • Optimization stability
  • Gradient-based second-order methods
  • Large-batch training
  • Transformer scaling

However, smoothness alone does not guarantee better performance. Activation quality depends on the full training system.

Computational Cost

Smooth activations are more expensive than ReLU.

ReLU requires only:

max(0,x). \max(0,x).

GELU and Swish require sigmoid, tanh, or Gaussian approximations involving exponentials.

For very large models, activation cost matters because activations are applied billions or trillions of times during training.

Despite the extra cost, modern hardware and optimized kernels make GELU and SiLU practical at scale.

Activation Functions in Practice

Different architectures tend to favor different activations.

ArchitectureCommon activations
Classical CNNsReLU
GAN discriminatorsLeaky ReLU
Modern CNNsReLU, SiLU
TransformersGELU, SiLU
Recurrent networksTanh, sigmoid
Mobile modelsReLU6, SiLU

The choice is partly historical and partly empirical.

ReLU Versus Smooth Activations

PropertyReLUELUGELUSwish/SiLU
SmoothNoMostlyYesYes
Negative outputsNoYesYesYes
Dead neuronsPossibleLess likelyRareRare
Computational costLowMediumHigherHigher
Transformer usageLimitedRareVery commonVery common
CNN usageVery commonModerateIncreasingIncreasing

ReLU remains attractive because it is simple and efficient. GELU and Swish became popular because they improve optimization in large-scale systems.

Activation Choice in PyTorch

PyTorch provides all major activations:

import torch.nn as nn

relu = nn.ReLU()
leaky = nn.LeakyReLU(0.01)
elu = nn.ELU()
gelu = nn.GELU()
silu = nn.SiLU()

A transformer feedforward block might use GELU:

block = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),
    nn.Linear(3072, 768),
)

A CNN block might use SiLU:

block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.SiLU(),
)

Practical Guidance

Use ReLU when simplicity and efficiency are priorities.

Use Leaky ReLU when dead activations become problematic.

Use GELU for transformers and large language models.

Use SiLU/Swish for modern CNNs and architectures inspired by efficient scaling.

Use ELU when negative activations and smoother transitions are desired, especially in smaller or older architectures.

The best activation depends on the architecture, optimizer, normalization strategy, initialization, and scale of training.

Exercises

  1. Compare the outputs of ReLU, ELU, GELU, and Swish for inputs between 5-5 and 55.

  2. Explain why GELU behaves like a soft gate.

  3. Show why ELU avoids completely zero gradients for negative inputs.

  4. Implement the same transformer feedforward block using ReLU and GELU. Compare training stability.

  5. Replace ReLU with SiLU in a CNN and measure the effect on validation accuracy and convergence speed.