Sigmoid and Hyperbolic Tangent

Activation functions give neural networks their nonlinear structure. Without nonlinear activation functions, a feedforward network made from many linear layers would still compute only a linear transformation. Depth would add parameters, but it would not add expressive power.

Two classical activation functions are the logistic sigmoid and the hyperbolic tangent. They played a central role in early neural networks and remain useful for gates, probabilities, and bounded outputs.

The Logistic Sigmoid Function

The logistic sigmoid function maps a real number to a value between 0 and 1:

\sigma(x)=\frac{1}{1+e^{-x}}.

Its output range is

0 < \sigma(x) < 1.

The function is smooth, monotone increasing, and bounded. Large positive inputs are mapped close to 1. Large negative inputs are mapped close to 0. The value at zero is

\sigma(0)=\frac{1}{2}.

In PyTorch:

import torch

x = torch.tensor([-2.0, 0.0, 2.0])
y = torch.sigmoid(x)

print(y)

The result is approximately

tensor([0.1192, 0.5000, 0.8808])

The sigmoid function is often used when a model needs to represent a probability for a binary event. For example, in binary classification, a model may output a logit $z\in\mathbb{R}$ . Applying the sigmoid gives

p = \sigma(z),

where $p$ can be interpreted as the model’s estimated probability that the label is 1.

Sigmoid as a Smooth Threshold

The sigmoid function behaves like a smooth version of a step function. When $x$ is very negative, the output is close to 0. When $x$ is very positive, the output is close to 1. Near zero, the transition is smooth.

This makes sigmoid useful when we want a differentiable gate. A hard threshold would be difficult to train with gradient descent because its derivative is zero almost everywhere and undefined at the threshold. The sigmoid gives a soft threshold with useful derivatives near the transition region.

For example, a simple gate may be written as

g=\sigma(a),

where $a$ is a learned score. The gate $g$ can then control how much information passes through:

y = g x.

If $g\approx 0$ , little information passes. If $g\approx 1$ , most information passes.

This idea appears in gated recurrent units, LSTMs, attention mechanisms, and many neural architectures that must learn when to keep, erase, or combine information.

Derivative of the Sigmoid

The sigmoid function has a convenient derivative:

\sigma'(x)=\sigma(x)(1-\sigma(x)).

This identity is important because it lets us express the derivative using the output of the sigmoid itself.

Let

y=\sigma(x).

Then

\frac{dy}{dx}=y(1-y).

The derivative is largest at $x=0$ , where $y=0.5$ . Thus

\sigma'(0)=0.5(1-0.5)=0.25.

As $x\to\infty$ , $\sigma(x)\to 1$ , so the derivative approaches 0. As $x\to-\infty$ , $\sigma(x)\to 0$ , so the derivative also approaches 0.

This means the sigmoid is most sensitive near zero and nearly flat far from zero.

Saturation

An activation function saturates when large input magnitudes produce outputs in nearly flat regions. The sigmoid saturates on both ends.

For large positive $x$ ,

\sigma(x)\approx 1, \quad \sigma'(x)\approx 0.

For large negative $x$ ,

\sigma(x)\approx 0, \quad \sigma'(x)\approx 0.

Saturation causes small gradients. During backpropagation, gradients are multiplied by activation derivatives. If many sigmoid units operate in saturated regions, the gradients passed to earlier layers can become very small.

This is one reason deep sigmoid networks can be difficult to train. The problem becomes more severe as depth increases because many small derivative factors are multiplied together.

Sigmoid and the Vanishing Gradient Problem

Consider a deep network with several sigmoid activations. During backpropagation, the gradient flowing through one sigmoid unit is multiplied by at most 0.25, because

0 < \sigma'(x) \leq 0.25.

If a gradient passes through many sigmoid layers, repeated multiplication by values less than 1 can rapidly shrink it. A rough illustration is

0.25^{10} \approx 9.5\times 10^{-7}.

This is the vanishing gradient problem. Early layers learn slowly because their gradients become tiny.

The practical consequence is that deep networks with sigmoid hidden activations often train poorly unless additional techniques are used, such as careful initialization, normalization, residual connections, or different activation functions.

For this reason, sigmoid is rarely used as the main hidden-layer activation in modern deep feedforward networks. ReLU, GELU, and related functions are more common.

Sigmoid in Binary Classification

Although sigmoid is less common inside hidden layers, it remains important at output layers for binary classification.

Suppose a model produces a scalar logit $z$ . The predicted probability is

\hat{y}=\sigma(z).

For a target label $y\in\{0,1\}$ , the binary cross-entropy loss is

L(y,\hat{y}) = -\left[ y\log \hat{y} + (1-y)\log(1-\hat{y}) \right].

In PyTorch, one should usually avoid applying torch.sigmoid followed by torch.nn.BCELoss directly. The numerically stable choice is torch.nn.BCEWithLogitsLoss, which combines sigmoid and binary cross-entropy in one operation.

Example:

import torch
import torch.nn as nn

logits = torch.tensor([0.2, -1.3, 2.4])
targets = torch.tensor([1.0, 0.0, 1.0])

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

print(loss)

This function expects raw logits, not probabilities. Internally, it applies a stable form of the sigmoid cross-entropy calculation.

The Hyperbolic Tangent Function

The hyperbolic tangent function, written $\tanh$ , maps real numbers to values between -1 and 1:

\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}.

Its output range is

-1 < \tanh(x) < 1.

The value at zero is

\tanh(0)=0.

In PyTorch:

x = torch.tensor([-2.0, 0.0, 2.0])
y = torch.tanh(x)

print(y)

The result is approximately

tensor([-0.9640,  0.0000,  0.9640])

The hyperbolic tangent is also smooth, monotone increasing, and bounded. It resembles the sigmoid function, but its outputs are centered around zero.

Relationship Between Sigmoid and Tanh

The sigmoid and hyperbolic tangent functions are closely related:

\tanh(x)=2\sigma(2x)-1.

Equivalently,

\sigma(x)=\frac{1+\tanh(x/2)}{2}.

This relationship shows that tanh is essentially a rescaled and shifted sigmoid. The sigmoid maps values into $(0,1)$ , while tanh maps values into $(-1,1)$ .

The zero-centered output of tanh is often preferable in hidden layers. When activations are centered near zero, optimization can behave better because the inputs to the next layer contain both positive and negative values.

Derivative of Tanh

The derivative of tanh is

\frac{d}{dx}\tanh(x)=1-\tanh^2(x).

y=\tanh(x),

then

\frac{dy}{dx}=1-y^2.

The derivative is largest at zero:

\tanh'(0)=1.

As $x\to\infty$ , $\tanh(x)\to 1$ , so the derivative approaches 0. As $x\to-\infty$ , $\tanh(x)\to -1$ , so the derivative also approaches 0.

Like sigmoid, tanh saturates for large positive and negative inputs. However, its derivative can be larger near zero, and its outputs are zero-centered.

Tanh as a Hidden Activation

Before ReLU became dominant, tanh was widely used as a hidden-layer activation. It usually trains better than sigmoid in hidden layers because its output is centered around zero.

Consider a hidden layer

h = \tanh(Wx+b).

The components of $h$ lie between -1 and 1. This bounded range can help control activation magnitude. But tanh still suffers from saturation. If the pre-activation $Wx+b$ has large magnitude, the output becomes close to either -1 or 1, and the derivative becomes small.

In PyTorch:

import torch
import torch.nn as nn

layer = nn.Sequential(
    nn.Linear(10, 32),
    nn.Tanh(),
    nn.Linear(32, 1)
)

x = torch.randn(8, 10)
y = layer(x)

print(y.shape)

This model is valid, but for many modern feedforward networks, nn.ReLU, nn.GELU, or related activations would usually be preferred.

Tanh in Recurrent Networks

Tanh remains important in recurrent architectures. In a simple recurrent neural network, the hidden state is often updated by

h_t = \tanh(W_x x_t + W_h h_{t-1} + b).

Here $x_t$ is the input at time $t$ , $h_{t-1}$ is the previous hidden state, and $h_t$ is the new hidden state.

The tanh function keeps the hidden state bounded between -1 and 1. This can reduce uncontrolled growth in recurrent dynamics.

LSTMs also use tanh for candidate cell values and output transformations, while sigmoid functions are used for gates. A simplified LSTM gate structure contains expressions such as

f_t=\sigma(a_f), \quad i_t=\sigma(a_i), \quad o_t=\sigma(a_o), \quad \tilde{c}_t=\tanh(a_c).

The sigmoid gates decide how much information to forget, write, or expose. The tanh activation proposes signed content values to store in the cell state.

Thus, sigmoid and tanh often work together: sigmoid controls flow, tanh represents bounded signed information.

Comparison of Sigmoid and Tanh

Property	Sigmoid	Tanh
Formula	$\sigma(x)=1/(1+e^{-x})$	$\tanh(x)=(e^x-e^{-x})/(e^x+e^{-x})$
Output range	$(0,1)$	$(-1,1)$
Value at zero	$0.5$	$0$
Zero-centered	No	Yes
Maximum derivative	$0.25$	$1$
Saturates	Yes	Yes
Common use	Binary probabilities, gates	Recurrent states, bounded hidden values

Tanh is usually better than sigmoid for hidden activations because it is zero-centered and has a larger derivative near zero. Sigmoid is usually better when the output must behave like a probability or gate.

Numerical Issues

Both sigmoid and tanh involve exponentials. Very large positive or negative inputs can cause numerical overflow or underflow if implemented naively.

For example, directly computing

e^{-x}

for very negative $x$ may involve $e^{|x|}$ , which can overflow floating-point limits.

PyTorch implements these functions using numerically stable kernels, so ordinary use of torch.sigmoid and torch.tanh is safe. The larger practical concern is gradient saturation, not direct numerical failure.

For losses involving sigmoid, use stable combined loss functions when available:

loss_fn = torch.nn.BCEWithLogitsLoss()

instead of manually writing:

prob = torch.sigmoid(logits)
loss = torch.nn.functional.binary_cross_entropy(prob, targets)

The combined function avoids unstable intermediate probabilities near 0 or 1.

Practical Guidance

Use sigmoid when the model needs an output in $(0,1)$ , especially for binary classification, multi-label classification, and gates.

Use tanh when the model needs a bounded signed output in $(-1,1)$ , especially in recurrent networks or when representing normalized continuous values.

Avoid sigmoid and tanh as default hidden-layer activations in deep feedforward networks. Their saturation can slow learning. Modern networks usually prefer nonsaturating or weakly saturating activations such as ReLU, Leaky ReLU, GELU, and SiLU.

The main distinction is functional. Sigmoid is a probability-like squashing function. Tanh is a zero-centered squashing function. Both are differentiable, bounded, and historically important, but both can produce vanishing gradients in deep networks.

Exercises

Compute $\sigma(0)$ , $\sigma(2)$ , and $\sigma(-2)$ . Explain why the outputs are asymmetric around zero.
Show that

\sigma'(x)=\sigma(x)(1-\sigma(x)).

Show that

\tanh'(x)=1-\tanh^2(x).

Explain why sigmoid hidden layers can suffer from vanishing gradients.
Implement a two-layer neural network in PyTorch using nn.Tanh. Replace nn.Tanh with nn.ReLU and compare training behavior on the same dataset.