Rectified Linear Units

The rectified linear unit, usually called ReLU, is the most widely used activation function in modern deep learning. ReLU transformed neural network training because it greatly reduced optimization difficulties that appeared in deep sigmoid and tanh networks.

The ReLU function is simple:

\mathrm{ReLU}(x)=\max(0,x).

Unlike sigmoid and tanh, ReLU does not saturate for positive inputs. This allows gradients to propagate more effectively through deep networks.

Definition of ReLU

The ReLU activation passes positive values unchanged and maps negative values to zero:

\mathrm{ReLU}(x)= \begin{cases} x & x>0 \\ 0 & x\leq 0 \end{cases}

In PyTorch:

import torch
import torch.nn.functional as F

x = torch.tensor([-2.0, -0.5, 0.0, 1.5, 3.0])
y = F.relu(x)

print(y)

Output:

tensor([0.0000, 0.0000, 0.0000, 1.5000, 3.0000])

The function is linear for positive values and constant for negative values.

Piecewise Linear Structure

ReLU is a piecewise linear function. The input space is divided into two regions:

Negative region:

x \leq 0

Positive region:

x > 0

Within each region, the function behaves linearly.

This simple structure has important consequences. A deep ReLU network is composed of many linear regions stitched together. Although each local region is linear, the full network can represent highly complex nonlinear functions.

Modern deep networks often rely on this property. Large transformer models, CNNs, and multilayer perceptrons all use activations that are piecewise linear or approximately linear in some regions.

Derivative of ReLU

The derivative of ReLU is simple:

\frac{d}{dx}\mathrm{ReLU}(x)= \begin{cases} 1 & x>0 \\ 0 & x<0 \end{cases}

At $x=0$ , the derivative is undefined because the function has a sharp corner. In practice, frameworks define a subgradient at zero, usually 0.

In PyTorch, automatic differentiation handles this automatically.

Example:

x = torch.tensor([-2.0, 0.5, 3.0], requires_grad=True)

y = F.relu(x)
loss = y.sum()

loss.backward()

print(x.grad)

Output:

tensor([0., 1., 1.])

Negative inputs receive zero gradient. Positive inputs receive gradient 1.

Why ReLU Improved Deep Learning

The main advantage of ReLU is that it avoids saturation in the positive region.

For sigmoid and tanh, large input magnitudes produce very small derivatives. This causes vanishing gradients. ReLU behaves differently:

\frac{d}{dx}\mathrm{ReLU}(x)=1 \quad \text{for } x>0.

Thus, positive activations preserve gradient magnitude much better during backpropagation.

This property makes optimization easier in deep networks. Earlier layers receive stronger gradients, and learning proceeds faster.

Before ReLU became standard, training very deep networks was difficult. Networks with many sigmoid layers often converged slowly or failed completely. ReLU enabled practical training of much deeper architectures.

This change played a major role in the success of deep convolutional networks after 2012.

Sparse Activations

ReLU produces sparse activations because negative inputs become exactly zero.

Suppose a hidden layer computes

h = \mathrm{ReLU}(Wx+b).

Many components of $h$ may become zero. This means only part of the network is active for a given input.

Sparse activations have several useful effects:

Reduced computation in some systems
More localized representations
Improved gradient flow
Reduced interference between features

For example:

x = torch.randn(1000)

y = F.relu(x)

zero_fraction = (y == 0).float().mean()
print(zero_fraction)

Roughly half the activations are often zero when inputs are centered around zero.

ReLU Networks as Feature Selectors

ReLU can be viewed as a simple feature-selection mechanism.

Suppose a neuron computes

z = w^\top x + b.

The activation becomes

h = \max(0,z).

If $z\leq 0$ , the neuron contributes nothing. If $z>0$ , the neuron activates and passes information forward.

Thus each ReLU unit behaves like a conditional feature detector. Different neurons activate for different regions of input space.

This interpretation becomes especially important in large networks, where many neurons specialize for different patterns.

ReLU in Deep Feedforward Networks

A standard multilayer network with ReLU activations may be written as

h_1 = \mathrm{ReLU}(W_1x+b_1)

h_2 = \mathrm{ReLU}(W_2h_1+b_2)

y = W_3h_2+b_3.

Each layer alternates between a linear transformation and a nonlinear activation.

In PyTorch:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

This structure appears throughout modern deep learning.

ReLU and Computational Efficiency

ReLU is computationally cheap.

Sigmoid and tanh require exponentials:

e^x,\quad e^{-x}.

ReLU requires only a comparison:

\max(0,x).

This makes ReLU efficient on GPUs and accelerators. The operation is simple, vectorizable, and memory-efficient.

For large-scale models containing billions of activations, this computational simplicity matters significantly.

The Dying ReLU Problem

Although ReLU improved optimization, it introduced a new issue: dead neurons.

A ReLU neuron becomes inactive when its input stays negative:

z=w^\top x+b < 0.

Then

\mathrm{ReLU}(z)=0, \quad \frac{d}{dz}\mathrm{ReLU}(z)=0.

Because the gradient is zero, the neuron receives no parameter updates from gradient descent. If this situation persists, the neuron may never activate again.

This is called the dying ReLU problem.

Example:

x = torch.tensor([-3.0], requires_grad=True)

y = F.relu(x)
y.backward()

print(x.grad)

Output:

tensor([0.])

No gradient flows through the inactive region.

Dead neurons are more likely when:

Learning rates are too large
Biases become strongly negative
Initialization is poor

In practice, some inactive neurons are acceptable. Problems arise when large fractions of the network become permanently inactive.

He Initialization

ReLU changes activation statistics because negative values are clipped to zero. This affects variance propagation through deep networks.

Suppose weights are initialized too large. Activations may explode. If weights are too small, activations may vanish.

For ReLU networks, a common initialization is He initialization:

W_{ij}\sim\mathcal{N}\left(0,\frac{2}{n_{\text{in}}}\right).

Here $n_{\text{in}}$ is the number of input connections.

This scaling compensates for the fact that roughly half of ReLU activations become zero.

In PyTorch:

layer = nn.Linear(256, 512)

nn.init.kaiming_normal_(layer.weight)

The term “Kaiming initialization” refers to the same method.

Proper initialization is especially important in deep ReLU networks.

ReLU and Gradient Propagation

Consider a deep network:

h^{(l)}=\mathrm{ReLU}(W^{(l)}h^{(l-1)}+b^{(l)}).

During backpropagation, the derivative through each ReLU layer is either 0 or 1.

Unlike sigmoid, gradients are not repeatedly multiplied by small fractional derivatives in active regions. This helps preserve gradient magnitude across many layers.

However, ReLU does not completely solve optimization problems. Gradients can still explode or vanish due to weight matrices, poor normalization, or unstable architectures.

Modern networks therefore combine ReLU-like activations with:

Residual connections
Batch normalization
Layer normalization
Careful initialization
Adaptive optimizers

These techniques work together to stabilize deep training.

ReLU in Convolutional Networks

ReLU became especially important in convolutional neural networks.

A convolutional block often has the form

\text{Conv} \rightarrow \text{Normalization} \rightarrow \text{ReLU}.

For example:

block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU()
)

After the success of AlexNet, ReLU became the standard activation for CNNs.

Later architectures such as ResNet and EfficientNet also relied heavily on ReLU-family activations.

Limitations of ReLU

Despite its success, ReLU has limitations.

First, negative inputs are discarded entirely. Information about negative magnitude is lost.

Second, ReLU is not zero-centered. Outputs are always nonnegative.

Third, dead neurons may appear.

Fourth, ReLU is nondifferentiable at zero. Although this rarely causes practical issues, it complicates some theoretical analyses.

These limitations motivated improved variants such as:

Leaky ReLU
Parametric ReLU
ELU
GELU
SiLU/Swish

Modern transformers often prefer GELU or SiLU because they provide smoother behavior.

Comparison with Sigmoid and Tanh

Property	Sigmoid	Tanh	ReLU
Output range	$(0,1)$	$(-1,1)$	$[0,\infty)$
Saturation	Strong	Strong	Only negative region
Zero-centered	No	Yes	No
Maximum derivative	0.25	1	1
Sparse outputs	No	No	Yes
Computational cost	High	High	Low
Hidden-layer usage	Rare	Limited	Very common

ReLU solved several optimization problems that affected sigmoid and tanh networks. This made very deep architectures practical.

ReLU in Transformers

Original transformer models often used ReLU activations inside feedforward blocks:

\mathrm{FFN}(x) = W_2 \mathrm{ReLU}(W_1x+b_1)+b_2.

Later transformer architectures increasingly adopted GELU activations because they provide smoother gradients and slightly improved empirical performance.

Still, ReLU remains conceptually important because many later activations evolved from the same idea: preserve strong gradient flow while introducing nonlinear behavior.

Practical Guidance

Use ReLU as the default activation for many feedforward and convolutional networks unless there is a strong reason to choose another activation.

Combine ReLU with:

He initialization
Normalization layers
Residual connections

Monitor for dead neurons when training becomes unstable.

For transformer models, consider GELU or SiLU instead of plain ReLU.

For recurrent networks, tanh and sigmoid are still often preferable because bounded activations help stabilize hidden-state dynamics.

Exercises

Compute the derivative of ReLU for positive and negative inputs.
Explain why ReLU reduces vanishing gradient problems compared with sigmoid.
Show why ReLU produces sparse activations.
Implement a multilayer perceptron with nn.ReLU and compare its convergence speed against nn.Sigmoid.
Modify a ReLU network to use Leaky ReLU and compare the fraction of dead neurons during training.