Activation functions control both the forward signal and the backward signal.
Activation functions control both the forward signal and the backward signal. In the forward pass, they transform pre-activations into activations. In the backward pass, their derivatives decide how much gradient passes to earlier layers.
This second role is critical. A network can have a reasonable forward computation but still train poorly if gradients vanish, explode, or become blocked by saturated activations.
Forward Signal and Backward Signal
Consider one layer of a neural network:
Here is the pre-activation, is the activation function, and is the activation passed to the next layer.
During backpropagation, the gradient through this activation is multiplied by the derivative of :
The symbol denotes elementwise multiplication.
Thus, even if the upstream gradient is useful, the activation derivative may shrink or block it.
What Saturation Means
An activation function saturates when large regions of its input produce nearly constant outputs.
For sigmoid,
When is very positive,
When is very negative,
In both regions, the curve is nearly flat. A flat curve has a small derivative.
For tanh,
and
Again, the derivative becomes small.
Saturation is a forward-pass property, but its training effect appears in the backward pass.
Saturation and Small Derivatives
The derivative of sigmoid is
Its largest value occurs at :
As moves far from zero, the derivative approaches zero.
The derivative of tanh is
Its largest value occurs at :
As becomes very positive or very negative, the derivative approaches zero.
This means both sigmoid and tanh pass gradients well only in a limited region near zero.
Vanishing Gradients
In a deep network, gradients are repeatedly multiplied by layer derivatives. If many of these derivatives are small, gradients shrink rapidly.
Suppose a gradient passes through ten sigmoid activations near their largest derivative. Even in this favorable case, each derivative is at most 0.25:
In practice, many derivatives may be much smaller than 0.25, especially if the units are saturated.
This is the vanishing gradient problem. Earlier layers receive gradients so small that their parameters update very slowly.
Vanishing gradients are especially severe in:
- Deep sigmoid networks
- Deep tanh networks without normalization
- Long recurrent networks
- Poorly initialized networks
- Networks with badly scaled input data
Exploding Gradients
Gradients can also grow too large. This happens when the backward pass repeatedly multiplies by factors greater than 1.
For a deep network with weight matrices
the gradient contains products of these matrices. If their spectral norms are large, gradient magnitude can grow exponentially with depth.
Exploding gradients cause unstable training. The loss may become NaN, parameters may grow without bound, and optimization may fail.
Exploding gradients are especially common in recurrent networks because the same transition matrix is applied repeatedly across time.
ReLU and Gradient Flow
ReLU reduces saturation in the positive region:
For ,
Thus, positive ReLU units pass gradients without shrinkage through the activation function itself.
For ,
Negative ReLU units block gradients entirely.
This creates a tradeoff. ReLU improves gradient flow for active units, but inactive units receive no gradient through the activation.
Dead Units
A ReLU unit is dead when it outputs zero for almost all inputs.
If
for nearly every training example, then
and
The unit contributes nothing to the forward pass and receives no useful gradient through the activation. It may remain inactive for the rest of training.
Dead units can result from:
- Large learning rates
- Poor initialization
- Strong negative biases
- Bad input scaling
- Highly imbalanced activations
Leaky ReLU, PReLU, ELU, GELU, and SiLU reduce this problem by allowing some gradient or some signal in the negative region.
Activation Distributions
A good activation distribution is neither collapsed nor explosive.
If most activations are near zero, the network may underuse its capacity. If most activations are saturated, gradients may vanish. If activations grow too large, later layers may become unstable.
During training, it is useful to inspect activation statistics:
import torch
import torch.nn as nn
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.ModuleList([
nn.Linear(128, 256),
nn.Tanh(),
nn.Linear(256, 256),
nn.Tanh(),
nn.Linear(256, 10),
])
def forward(self, x):
stats = []
for layer in self.layers:
x = layer(x)
if isinstance(layer, nn.Tanh):
stats.append({
"mean": x.mean().item(),
"std": x.std().item(),
"min": x.min().item(),
"max": x.max().item(),
})
return x, stats
model = MLP()
x = torch.randn(64, 128)
logits, stats = model(x)
print(stats)For tanh, values near -1 or 1 indicate saturation. For ReLU, a high fraction of exact zeros may indicate many inactive units.
Gradient Statistics
Gradient flow can also be monitored directly.
After calling loss.backward(), each parameter has a gradient tensor. We can inspect the gradient norm:
total_norm = 0.0
for name, param in model.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(name, grad_norm)
total_norm += grad_norm ** 2
total_norm = total_norm ** 0.5
print("total grad norm:", total_norm)Small gradient norms in early layers may indicate vanishing gradients. Very large norms may indicate exploding gradients.
Gradient statistics are not a complete diagnosis, but they are often the first useful signal.
Initialization and Gradient Flow
Initialization controls the scale of pre-activations at the start of training.
If weights are initialized too large, pre-activations may fall into saturated regions. If weights are initialized too small, signals may shrink layer by layer.
For tanh and sigmoid networks, Xavier initialization is often used:
For ReLU-family networks, He initialization is often used:
In PyTorch:
layer = nn.Linear(128, 256)
nn.init.xavier_uniform_(layer.weight) # often used with tanh
nn.init.kaiming_normal_(layer.weight) # often used with ReLUThe initialization should match the activation function.
Normalization and Gradient Flow
Normalization layers stabilize activation distributions.
Batch normalization normalizes activations using batch statistics. Layer normalization normalizes activations using features within each example.
For a layer activation , normalization roughly computes
Then it applies learned scale and shift parameters:
Normalization helps keep activations in useful ranges. This reduces saturation and stabilizes gradients.
In PyTorch:
block = nn.Sequential(
nn.Linear(128, 256),
nn.LayerNorm(256),
nn.GELU(),
)Transformers commonly use layer normalization. CNNs often use batch normalization.
Residual Connections
Residual connections provide direct gradient paths through deep networks.
A residual block computes
During backpropagation, the gradient can flow through the identity path from to . This reduces the dependence on the derivative of every intermediate activation.
Residual connections were essential for training very deep convolutional networks and are also central to transformers.
In PyTorch:
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(dim, dim),
nn.GELU(),
nn.Linear(dim, dim),
)
def forward(self, x):
return x + self.net(x)The identity path helps preserve both forward information and backward gradients.
Gradient Clipping
Gradient clipping prevents exploding gradients by limiting gradient norm.
A common method clips the global norm:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)This operation rescales gradients when their norm exceeds the threshold. It is especially useful in recurrent networks, sequence models, and large transformer training.
Clipping does not fix vanishing gradients, but it prevents unstable updates when gradients become too large.
Choosing Activations for Gradient Flow
Activation choice affects gradient propagation.
| Activation | Gradient behavior |
|---|---|
| Sigmoid | Strong saturation; gradients often vanish |
| Tanh | Saturates, but zero-centered |
| ReLU | Good positive-side gradient; zero negative-side gradient |
| Leaky ReLU | Nonzero negative-side gradient |
| ELU | Smooth negative side, but can saturate |
| GELU | Smooth soft gating, good for transformers |
| SiLU/Swish | Smooth gating, good for large models and modern CNNs |
The best activation depends on architecture and scale. For ordinary deep MLPs and CNNs, ReLU-family activations are strong defaults. For transformers, GELU and SiLU are common. For recurrent gates, sigmoid and tanh remain useful.
Practical Debugging Checklist
When training fails, inspect gradient flow before changing the whole model.
Check whether the loss is NaN or infinite. If so, reduce the learning rate, inspect input scaling, and check for unstable operations.
Check activation statistics. If sigmoid or tanh outputs are near their boundaries, the units are saturated. If ReLU outputs are mostly zero, many units may be inactive.
Check gradient norms by layer. If early layers have near-zero gradients, gradients may be vanishing. If gradients are extremely large, apply clipping and reduce the learning rate.
Check initialization. Use Xavier initialization for tanh-like activations and He initialization for ReLU-like activations.
Check normalization. Add batch normalization, layer normalization, or residual connections when training deeper networks.
PyTorch Example: Comparing Activations
The following example measures activation statistics for different activations:
import torch
import torch.nn as nn
def make_model(activation):
return nn.Sequential(
nn.Linear(128, 256),
activation(),
nn.Linear(256, 256),
activation(),
nn.Linear(256, 10),
)
activations = {
"sigmoid": nn.Sigmoid,
"tanh": nn.Tanh,
"relu": nn.ReLU,
"gelu": nn.GELU,
"silu": nn.SiLU,
}
x = torch.randn(64, 128)
for name, act in activations.items():
model = make_model(act)
y = x
print(name)
for layer in model:
y = layer(y)
if isinstance(layer, (nn.Sigmoid, nn.Tanh, nn.ReLU, nn.GELU, nn.SiLU)):
print({
"mean": round(y.mean().item(), 4),
"std": round(y.std().item(), 4),
"min": round(y.min().item(), 4),
"max": round(y.max().item(), 4),
})This kind of simple diagnostic often reveals activation collapse or unstable scaling before the issue becomes difficult to debug.
Summary
Saturation occurs when an activation function becomes nearly flat. Flat regions have small derivatives, so gradients shrink during backpropagation.
Sigmoid and tanh saturate for large positive and negative inputs. ReLU avoids positive-side saturation but blocks gradients for negative inputs. Smooth activations such as GELU and SiLU preserve more information and often improve gradient flow in large models.
Stable training depends on several connected choices: activation function, initialization, normalization, residual connections, learning rate, and gradient clipping. Activation functions should be chosen as part of the full training system, not as isolated formulas.
Exercises
Explain why sigmoid derivatives become small when the input has large magnitude.
Show how repeated multiplication by derivatives less than 1 causes vanishing gradients.
Compare activation statistics for sigmoid, tanh, ReLU, GELU, and SiLU in a 10-layer MLP.
Add gradient norm logging to a PyTorch training loop.
Train two deep MLPs, one with residual connections and one without. Compare gradient norms in early layers.