Leaky and Parametric ReLU

ReLU is simple and effective, but it has one sharp weakness. For negative inputs, the output is zero and the gradient is also zero. A unit that stays in this region receives no useful learning signal through that activation. Leaky ReLU and Parametric ReLU modify the negative side of ReLU so that some signal can still pass through.

Motivation

The standard ReLU is

\mathrm{ReLU}(x)=\max(0,x).

Its negative side is flat:

\mathrm{ReLU}(x)=0 \quad \text{for } x\leq 0.

This can create dead units. Once a unit’s pre-activation is negative for nearly all inputs, its gradient through ReLU is zero for nearly all examples. Large learning rates, poor initialization, and unfavorable bias values can make this more likely.

Leaky ReLU changes the negative side from a flat line to a line with small slope. Instead of discarding all negative values, it preserves a scaled version of them.

Leaky ReLU

The Leaky ReLU function is defined as

\mathrm{LeakyReLU}(x)= \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

where $\alpha$ is a small positive constant, often $0.01$ .

If $x$ is positive, Leaky ReLU behaves like ReLU. If $x$ is negative, the output remains negative but scaled down.

For example, with $\alpha=0.01$ ,

\mathrm{LeakyReLU}(-5)= -0.05.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-5.0, -1.0, 0.0, 2.0])

act = nn.LeakyReLU(negative_slope=0.01)
y = act(x)

print(y)

Output:

tensor([-0.0500, -0.0100,  0.0000,  2.0000])

Derivative of Leaky ReLU

The derivative is

\frac{d}{dx}\mathrm{LeakyReLU}(x)= \begin{cases} 1 & x > 0 \\ \alpha & x < 0 \end{cases}

At $x=0$ , the derivative is formally undefined, as with ReLU. In practice, automatic differentiation systems choose a convenient subgradient.

The important difference is the negative region. ReLU has derivative 0 there. Leaky ReLU has derivative $\alpha$ . Therefore, a negative pre-activation can still receive a nonzero gradient.

This does not guarantee that every unit remains useful, but it reduces the chance that a unit becomes permanently inactive.

Effect of the Negative Slope

The negative slope $\alpha$ controls how much signal passes through negative inputs.

$\alpha$	Behavior
0	Same as ReLU
0.001	Very small leakage
0.01	Common default
0.1	Stronger negative signal
1	Linear identity function

If $\alpha$ is too small, the function behaves almost like ReLU. If $\alpha$ is too large, the nonlinearity becomes weaker because negative and positive inputs pass through with similar slopes.

A common default is $\alpha=0.01$ , but it should be treated as a hyperparameter rather than a law.

Leaky ReLU in a Network

A multilayer perceptron using Leaky ReLU is nearly identical to a ReLU network:

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.LeakyReLU(negative_slope=0.01),
    nn.Linear(256, 256),
    nn.LeakyReLU(negative_slope=0.01),
    nn.Linear(256, 10)
)

The activation can also be used from torch.nn.functional:

import torch
import torch.nn.functional as F

x = torch.randn(32, 128)
y = F.leaky_relu(x, negative_slope=0.01)

The module form is useful inside nn.Sequential. The functional form is useful inside custom forward methods.

Leaky ReLU and Dead Units

Leaky ReLU addresses the dying ReLU problem by allowing gradients in the negative region.

For ReLU:

x < 0 \quad \Rightarrow \quad \frac{d}{dx}\mathrm{ReLU}(x)=0.

For Leaky ReLU:

x < 0 \quad \Rightarrow \quad \frac{d}{dx}\mathrm{LeakyReLU}(x)=\alpha.

This means the parameters feeding into a negative activation can still change. The unit may later move back into the positive region if updates shift its pre-activation.

However, Leaky ReLU does not solve all training problems. Poor data scaling, unstable learning rates, and bad initialization can still damage training. The activation only removes one source of zero gradients.

Parametric ReLU

Parametric ReLU, or PReLU, makes the negative slope learnable.

Instead of choosing a fixed $\alpha$ , the model learns it from data:

\mathrm{PReLU}(x)= \begin{cases} x & x > 0 \\ a x & x \leq 0 \end{cases}

where $a$ is a trainable parameter.

In PyTorch:

import torch
import torch.nn as nn

x = torch.tensor([-2.0, 0.0, 3.0])

act = nn.PReLU()
y = act(x)

print(y)
print(list(act.parameters()))

By default, nn.PReLU() creates one learnable slope shared across all channels. For convolutional networks, it can also learn one slope per channel:

act = nn.PReLU(num_parameters=64)

This is often used when the activation follows a convolution layer with 64 output channels.

Learning the Negative Slope

For negative inputs, PReLU computes

y = ax.

The derivative with respect to the input is

\frac{\partial y}{\partial x}=a.

The derivative with respect to the parameter $a$ is

\frac{\partial y}{\partial a}=x.

This means the model can adjust how much negative information to preserve.

If the learned value of $a$ becomes close to 0, PReLU behaves like ReLU. If $a$ becomes a small positive number, it behaves like Leaky ReLU. If $a$ becomes larger, the activation becomes closer to a linear function on both sides.

PReLU and Regularization

PReLU adds parameters to the model. The number of added parameters is usually small, but the activation becomes more flexible.

This flexibility can help performance in some architectures, especially convolutional networks. It can also slightly increase overfitting risk if used without care.

One practical detail: weight decay should usually not be applied blindly to the PReLU slope parameter. Forcing the slope toward zero may unintentionally turn PReLU back into ReLU. In larger projects, optimizer parameter groups can separate ordinary weights from activation parameters.

Example:

decay = []
no_decay = []

for name, param in model.named_parameters():
    if "prelu" in name.lower():
        no_decay.append(param)
    else:
        decay.append(param)

optimizer = torch.optim.AdamW(
    [
        {"params": decay, "weight_decay": 1e-4},
        {"params": no_decay, "weight_decay": 0.0},
    ],
    lr=1e-3,
)

In real projects, grouping is usually done more systematically, but the principle is the same.

Comparison with ReLU

Property	ReLU	Leaky ReLU	PReLU
Positive side	$x$	$x$	$x$
Negative side	$0$	$\alpha x$	$a x$
Negative gradient	0	$\alpha$	$a$
Learnable slope	No	No	Yes
Dead unit risk	Higher	Lower	Lower
Extra parameters	No	No	Yes

ReLU is simpler and often sufficient. Leaky ReLU is a small change that improves gradient flow through negative inputs. PReLU gives the network control over the negative slope.

When to Use Each One

Use ReLU when you want a strong, simple default for feedforward or convolutional networks.

Use Leaky ReLU when many units become inactive, or when you want a conservative replacement for ReLU with almost no extra complexity.

Use PReLU when you are willing to add a small number of trainable parameters and want the model to learn the best negative slope.

In modern transformer models, GELU and SiLU are often preferred. In convolutional networks and discriminators in generative models, Leaky ReLU remains common.

Initialization for Leaky ReLU

For ReLU networks, He initialization uses a gain suited to the ReLU nonlinearity. For Leaky ReLU, the gain depends on the negative slope.

PyTorch supports this directly:

layer = nn.Linear(128, 256)

nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)

Here a=0.01 should match the negative_slope used in the activation.

For convolutional layers:

conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)

nn.init.kaiming_normal_(
    conv.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)

Matching initialization to the activation helps keep activation variance stable across layers.

Practical Example

The following model uses Leaky ReLU for image classification:

import torch
import torch.nn as nn

class SmallCNN(nn.Module):
    def __init__(self, num_classes: int = 10):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.LeakyReLU(0.01),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.LeakyReLU(0.01),
            nn.MaxPool2d(2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 128),
            nn.LeakyReLU(0.01),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = SmallCNN()
x = torch.randn(16, 3, 32, 32)
logits = model(x)

print(logits.shape)

Output:

torch.Size([16, 10])

The model produces raw logits for 10 classes. The final layer has no activation because nn.CrossEntropyLoss expects logits.

Practical Guidance

Leaky ReLU and PReLU are small extensions of ReLU. Their purpose is to keep the negative side trainable.

The most important rule is simple: do not choose an activation in isolation. Match the activation with the initialization, normalization, optimizer, and architecture.

For ordinary MLPs and CNNs, ReLU is a strong baseline. If dead units or unstable training appear, try Leaky ReLU. If the architecture benefits from channel-specific flexibility, try PReLU.

For large language models and many transformer-style architectures, GELU or SiLU is usually a better starting point.

Exercises

Show that ReLU is a special case of Leaky ReLU with $\alpha=0$ .
Compute Leaky ReLU outputs for $x=-3,-1,0,2$ with $\alpha=0.1$ .
Explain why Leaky ReLU reduces the dying ReLU problem.
Implement the same MLP with ReLU, Leaky ReLU, and PReLU. Compare validation accuracy and the fraction of zero activations.
Modify the initialization of a Leaky ReLU network using nn.init.kaiming_normal_ with the correct negative slope.