# Practical Activation Selection

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.

A useful rule is: use simple activations for simple architectures, smooth activations for large transformer-style models, and bounded activations when the output must stay in a specific range.

### Hidden-Layer Activations

Hidden layers need nonlinear functions that preserve useful gradients. In most modern networks, hidden activations should avoid strong saturation.

For ordinary multilayer perceptrons, a strong starting point is ReLU:

```python
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)
```

For slightly smoother behavior, use GELU or SiLU:

```python
model = nn.Sequential(
    nn.Linear(128, 256),
    nn.GELU(),
    nn.Linear(256, 256),
    nn.GELU(),
    nn.Linear(256, 10),
)
```

ReLU is cheaper. GELU and SiLU are smoother. In small models, the difference may be minor. In large models, the smoother activations often train better.

### Output Activations

Output activations depend on the task.

| Task | Final layer output | Loss |
|---|---|---|
| Multi-class classification | Raw logits | `nn.CrossEntropyLoss` |
| Binary classification | Raw logits | `nn.BCEWithLogitsLoss` |
| Multi-label classification | Raw logits per label | `nn.BCEWithLogitsLoss` |
| Regression | Usually raw value | MSE, MAE, Huber |
| Probability output for inference | Softmax or sigmoid | Usually no loss |
| Bounded regression | Sigmoid or tanh | Task-dependent |

During training, PyTorch losses often expect logits, not probabilities. Apply softmax or sigmoid for inference, logging, calibration, or sampling, but usually not before the loss.

### Architecture-Specific Defaults

Different model families have different activation defaults.

| Architecture | Good default |
|---|---|
| Classical MLP | ReLU, GELU |
| CNN | ReLU, SiLU |
| ResNet-style CNN | ReLU |
| EfficientNet-style CNN | SiLU |
| Transformer encoder | GELU |
| Decoder-only language model | GELU, SiLU |
| RNN hidden state | Tanh |
| RNN gates | Sigmoid |
| GAN discriminator | Leaky ReLU |
| Autoencoder | ReLU, GELU, SiLU |
| VAE latent parameters | Usually no activation |
| Attention weights | Softmax |

These defaults are starting points. The final choice should be checked empirically.

### Match Activation and Initialization

Initialization should match the activation.

For ReLU and Leaky ReLU, use Kaiming initialization:

```python
layer = nn.Linear(128, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
```

For Leaky ReLU:

```python
nn.init.kaiming_normal_(
    layer.weight,
    a=0.01,
    nonlinearity="leaky_relu",
)
```

For tanh, use Xavier initialization:

```python
nn.init.xavier_uniform_(layer.weight)
```

Bad initialization can push activations into saturation or shrink signals layer by layer.

### Match Activation and Normalization

Normalization changes how activations behave. Batch normalization and layer normalization keep pre-activation values in a more stable range.

A CNN block commonly uses:

```python
block = nn.Sequential(
    nn.Conv2d(64, 128, kernel_size=3, padding=1),
    nn.BatchNorm2d(128),
    nn.ReLU(),
)
```

A transformer feedforward block commonly uses layer normalization outside or around the feedforward module:

```python
ffn = nn.Sequential(
    nn.Linear(768, 3072),
    nn.GELU(),
    nn.Linear(3072, 768),
)
```

Normalization reduces saturation risk, but it does not make activation choice irrelevant.

### Use Bounded Activations Only When Needed

Sigmoid and tanh are bounded. This is useful when the model output must stay in a range.

Use sigmoid for values in \((0,1)\):

```python
prob = torch.sigmoid(logits)
```

Use tanh for values in \((-1,1)\):

```python
value = torch.tanh(raw_output)
```

Avoid sigmoid and tanh as default hidden activations in deep feedforward networks. Their saturation makes optimization harder.

### Multi-Class Versus Multi-Label

For multi-class classification, exactly one class is correct. Use softmax semantics through `CrossEntropyLoss`.

```python
loss = nn.CrossEntropyLoss()(logits, targets)
```

Here `logits` has shape `[B, K]`, and `targets` has shape `[B]`.

For multi-label classification, each class is an independent yes/no decision. Use sigmoid semantics through `BCEWithLogitsLoss`.

```python
loss = nn.BCEWithLogitsLoss()(logits, targets)
```

Here both `logits` and `targets` have shape `[B, K]`.

This distinction is one of the most common sources of classification bugs.

### Watch Activation Statistics

Activation statistics can reveal bad choices quickly.

For ReLU, inspect the fraction of zero activations:

```python
zero_frac = (activation == 0).float().mean()
```

For sigmoid, inspect whether outputs are close to 0 or 1:

```python
sat_frac = ((activation < 0.01) | (activation > 0.99)).float().mean()
```

For tanh, inspect whether outputs are close to -1 or 1:

```python
sat_frac = ((activation < -0.99) | (activation > 0.99)).float().mean()
```

A high saturation fraction means gradients may be blocked or severely reduced.

### Practical Decision Table

| Situation | Recommended activation |
|---|---|
| Need fast baseline | ReLU |
| Many dead ReLU units | Leaky ReLU |
| Transformer feedforward block | GELU or SiLU |
| Modern CNN block | ReLU or SiLU |
| Binary probability output | Sigmoid at inference, logits during training |
| Multi-class probability output | Softmax at inference, logits during training |
| RNN gate | Sigmoid |
| RNN candidate state | Tanh |
| Output must be between -1 and 1 | Tanh |
| Output must be between 0 and 1 | Sigmoid |
| Training unstable due to saturation | Recheck initialization, normalization, and learning rate |

### Minimal PyTorch Experiment

The safest way to choose an activation is to compare candidates under the same training setup.

```python
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, activation: type[nn.Module]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(128, 256),
            activation(),
            nn.Linear(256, 256),
            activation(),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        return self.net(x)

models = {
    "relu": MLP(nn.ReLU),
    "gelu": MLP(nn.GELU),
    "silu": MLP(nn.SiLU),
    "leaky_relu": MLP(lambda: nn.LeakyReLU(0.01)),
}
```

When comparing them, keep everything else fixed: dataset split, optimizer, learning rate, batch size, initialization, number of steps, and random seed.

### Practical Rules

Use logits during training with stable PyTorch loss functions.

Use ReLU as a baseline for MLPs and CNNs.

Use GELU or SiLU for transformer-style feedforward blocks.

Use Leaky ReLU when ordinary ReLU produces too many inactive units.

Use sigmoid and tanh deliberately, mostly for gates, recurrent states, and bounded outputs.

Treat activation selection as part of a system. A good activation can still fail with poor initialization, bad normalization, unstable learning rates, or incorrectly scaled inputs.

### Exercises

1. Train the same MLP with ReLU, GELU, SiLU, and Leaky ReLU. Compare validation loss.

2. Measure the fraction of zero activations in a ReLU network over training.

3. Replace ReLU with Leaky ReLU and check whether inactive units decrease.

4. Build a binary classifier using `BCEWithLogitsLoss`. Verify that no sigmoid is used before the loss.

5. Build a multi-class classifier using `CrossEntropyLoss`. Verify that no softmax is used before the loss.

