Activation functions give neural networks their nonlinear structure. Without nonlinear activation functions, a feedforward network made from many linear layers would still compute only a linear transformation. Depth would add parameters, but it would not add expressive power.
Two classical activation functions are the logistic sigmoid and the hyperbolic tangent. They played a central role in early neural networks and remain useful for gates, probabilities, and bounded outputs.
The Logistic Sigmoid Function
The logistic sigmoid function maps a real number to a value between 0 and 1:
Its output range is
The function is smooth, monotone increasing, and bounded. Large positive inputs are mapped close to 1. Large negative inputs are mapped close to 0. The value at zero is
In PyTorch:
import torch
x = torch.tensor([-2.0, 0.0, 2.0])
y = torch.sigmoid(x)
print(y)The result is approximately
tensor([0.1192, 0.5000, 0.8808])The sigmoid function is often used when a model needs to represent a probability for a binary event. For example, in binary classification, a model may output a logit . Applying the sigmoid gives
where can be interpreted as the model’s estimated probability that the label is 1.
Sigmoid as a Smooth Threshold
The sigmoid function behaves like a smooth version of a step function. When is very negative, the output is close to 0. When is very positive, the output is close to 1. Near zero, the transition is smooth.
This makes sigmoid useful when we want a differentiable gate. A hard threshold would be difficult to train with gradient descent because its derivative is zero almost everywhere and undefined at the threshold. The sigmoid gives a soft threshold with useful derivatives near the transition region.
For example, a simple gate may be written as
where is a learned score. The gate can then control how much information passes through:
If , little information passes. If , most information passes.
This idea appears in gated recurrent units, LSTMs, attention mechanisms, and many neural architectures that must learn when to keep, erase, or combine information.
Derivative of the Sigmoid
The sigmoid function has a convenient derivative:
This identity is important because it lets us express the derivative using the output of the sigmoid itself.
Let
Then
The derivative is largest at , where . Thus
As , , so the derivative approaches 0. As , , so the derivative also approaches 0.
This means the sigmoid is most sensitive near zero and nearly flat far from zero.
Saturation
An activation function saturates when large input magnitudes produce outputs in nearly flat regions. The sigmoid saturates on both ends.
For large positive ,
For large negative ,
Saturation causes small gradients. During backpropagation, gradients are multiplied by activation derivatives. If many sigmoid units operate in saturated regions, the gradients passed to earlier layers can become very small.
This is one reason deep sigmoid networks can be difficult to train. The problem becomes more severe as depth increases because many small derivative factors are multiplied together.
Sigmoid and the Vanishing Gradient Problem
Consider a deep network with several sigmoid activations. During backpropagation, the gradient flowing through one sigmoid unit is multiplied by at most 0.25, because
If a gradient passes through many sigmoid layers, repeated multiplication by values less than 1 can rapidly shrink it. A rough illustration is
This is the vanishing gradient problem. Early layers learn slowly because their gradients become tiny.
The practical consequence is that deep networks with sigmoid hidden activations often train poorly unless additional techniques are used, such as careful initialization, normalization, residual connections, or different activation functions.
For this reason, sigmoid is rarely used as the main hidden-layer activation in modern deep feedforward networks. ReLU, GELU, and related functions are more common.
Sigmoid in Binary Classification
Although sigmoid is less common inside hidden layers, it remains important at output layers for binary classification.
Suppose a model produces a scalar logit . The predicted probability is
For a target label , the binary cross-entropy loss is
In PyTorch, one should usually avoid applying torch.sigmoid followed by torch.nn.BCELoss directly. The numerically stable choice is torch.nn.BCEWithLogitsLoss, which combines sigmoid and binary cross-entropy in one operation.
Example:
import torch
import torch.nn as nn
logits = torch.tensor([0.2, -1.3, 2.4])
targets = torch.tensor([1.0, 0.0, 1.0])
loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)
print(loss)This function expects raw logits, not probabilities. Internally, it applies a stable form of the sigmoid cross-entropy calculation.
The Hyperbolic Tangent Function
The hyperbolic tangent function, written , maps real numbers to values between -1 and 1:
Its output range is
The value at zero is
In PyTorch:
x = torch.tensor([-2.0, 0.0, 2.0])
y = torch.tanh(x)
print(y)The result is approximately
tensor([-0.9640, 0.0000, 0.9640])The hyperbolic tangent is also smooth, monotone increasing, and bounded. It resembles the sigmoid function, but its outputs are centered around zero.
Relationship Between Sigmoid and Tanh
The sigmoid and hyperbolic tangent functions are closely related:
Equivalently,
This relationship shows that tanh is essentially a rescaled and shifted sigmoid. The sigmoid maps values into , while tanh maps values into .
The zero-centered output of tanh is often preferable in hidden layers. When activations are centered near zero, optimization can behave better because the inputs to the next layer contain both positive and negative values.
Derivative of Tanh
The derivative of tanh is
If
then
The derivative is largest at zero:
As , , so the derivative approaches 0. As , , so the derivative also approaches 0.
Like sigmoid, tanh saturates for large positive and negative inputs. However, its derivative can be larger near zero, and its outputs are zero-centered.
Tanh as a Hidden Activation
Before ReLU became dominant, tanh was widely used as a hidden-layer activation. It usually trains better than sigmoid in hidden layers because its output is centered around zero.
Consider a hidden layer
The components of lie between -1 and 1. This bounded range can help control activation magnitude. But tanh still suffers from saturation. If the pre-activation has large magnitude, the output becomes close to either -1 or 1, and the derivative becomes small.
In PyTorch:
import torch
import torch.nn as nn
layer = nn.Sequential(
nn.Linear(10, 32),
nn.Tanh(),
nn.Linear(32, 1)
)
x = torch.randn(8, 10)
y = layer(x)
print(y.shape)This model is valid, but for many modern feedforward networks, nn.ReLU, nn.GELU, or related activations would usually be preferred.
Tanh in Recurrent Networks
Tanh remains important in recurrent architectures. In a simple recurrent neural network, the hidden state is often updated by
Here is the input at time , is the previous hidden state, and is the new hidden state.
The tanh function keeps the hidden state bounded between -1 and 1. This can reduce uncontrolled growth in recurrent dynamics.
LSTMs also use tanh for candidate cell values and output transformations, while sigmoid functions are used for gates. A simplified LSTM gate structure contains expressions such as
The sigmoid gates decide how much information to forget, write, or expose. The tanh activation proposes signed content values to store in the cell state.
Thus, sigmoid and tanh often work together: sigmoid controls flow, tanh represents bounded signed information.
Comparison of Sigmoid and Tanh
| Property | Sigmoid | Tanh |
|---|---|---|
| Formula | ||
| Output range | ||
| Value at zero | ||
| Zero-centered | No | Yes |
| Maximum derivative | ||
| Saturates | Yes | Yes |
| Common use | Binary probabilities, gates | Recurrent states, bounded hidden values |
Tanh is usually better than sigmoid for hidden activations because it is zero-centered and has a larger derivative near zero. Sigmoid is usually better when the output must behave like a probability or gate.
Numerical Issues
Both sigmoid and tanh involve exponentials. Very large positive or negative inputs can cause numerical overflow or underflow if implemented naively.
For example, directly computing
for very negative may involve , which can overflow floating-point limits.
PyTorch implements these functions using numerically stable kernels, so ordinary use of torch.sigmoid and torch.tanh is safe. The larger practical concern is gradient saturation, not direct numerical failure.
For losses involving sigmoid, use stable combined loss functions when available:
loss_fn = torch.nn.BCEWithLogitsLoss()instead of manually writing:
prob = torch.sigmoid(logits)
loss = torch.nn.functional.binary_cross_entropy(prob, targets)The combined function avoids unstable intermediate probabilities near 0 or 1.
Practical Guidance
Use sigmoid when the model needs an output in , especially for binary classification, multi-label classification, and gates.
Use tanh when the model needs a bounded signed output in , especially in recurrent networks or when representing normalized continuous values.
Avoid sigmoid and tanh as default hidden-layer activations in deep feedforward networks. Their saturation can slow learning. Modern networks usually prefer nonsaturating or weakly saturating activations such as ReLU, Leaky ReLU, GELU, and SiLU.
The main distinction is functional. Sigmoid is a probability-like squashing function. Tanh is a zero-centered squashing function. Both are differentiable, bounded, and historically important, but both can produce vanishing gradients in deep networks.
Exercises
Compute , , and . Explain why the outputs are asymmetric around zero.
Show that
- Show that
Explain why sigmoid hidden layers can suffer from vanishing gradients.
Implement a two-layer neural network in PyTorch using
nn.Tanh. Replacenn.Tanhwithnn.ReLUand compare training behavior on the same dataset.