Stochastic Depth

Stochastic depth is a regularization method for deep residual networks. During training, it randomly skips entire residual blocks. Instead of dropping individual activations, as dropout does, stochastic depth drops whole computational paths.

A standard residual block computes

$$ y = x + F(x), $$

where $x$ is the input and $F(x)$ is the residual branch. With stochastic depth, the residual branch is randomly kept or removed during training:

$$ y = x + bF(x), $$

where $b\sim\mathrm{Bernoulli}(p)$. Here $p$ is the probability of keeping the residual branch.

Motivation

Very deep networks can overfit and can also become difficult to optimize. Residual connections make deep training more stable, but large residual networks still contain many layers and many parameters.

Stochastic depth improves regularization by forcing the model to work with many shallower subnetworks during training. On one step, a block may be active. On another step, the same block may be skipped.

This has two effects.

First, it reduces co-adaptation between blocks. A block cannot assume that every previous residual transformation is always present.

Second, it shortens the effective network depth during training. Gradients can pass through fewer transformations, which can improve optimization in very deep networks.

Residual Blocks

Residual networks are built from blocks of the form

$$ y = x + F(x). $$

The skip connection $x$ gives the network a direct identity path. If $F(x)$ learns a useful transformation, the block modifies the representation. If $F(x)$ is unnecessary, the block can learn a function close to zero.

This structure makes it natural to randomly remove $F(x)$. The identity path remains valid, so the output still has the correct shape.

Stochastic depth works best when the skipped branch and the identity branch have compatible shapes. If a block changes spatial resolution or channel count, the skip path may include a projection. Such blocks require more care.

Training-Time Rule

A common stochastic depth rule uses a binary mask $b$:

$$ y = \begin{cases} x + F(x), & b=1, \ x, & b=0. \end{cases} $$

where

$$ b\sim\mathrm{Bernoulli}(p). $$

Here $p$ is the survival probability. The drop probability is $1-p$.

If $p=0.9$, the residual branch is used 90 percent of the time and skipped 10 percent of the time.

Inverted Stochastic Depth

Many implementations use an inverted form:

$$ y = x + \frac{b}{p}F(x). $$

This keeps the expected residual contribution unchanged during training:

$$ \mathbb{E}\left[\frac{b}{p}F(x)\right] = F(x). $$

At inference time, all residual branches are used and no random dropping is applied. The block becomes the ordinary residual block:

$$ y = x + F(x). $$

This mirrors inverted dropout, where scaling is applied during training so inference remains simple.

Stochastic Depth Versus Dropout

Dropout usually drops individual activation entries. Stochastic depth drops entire residual branches.

Method	Drops	Common location
Dropout	Individual activations	MLPs, attention, classifier heads
Dropout2d	Channels	CNN feature maps
Stochastic depth	Residual branches	ResNets, Transformers, ConvNeXt, ViTs

Stochastic depth is often better suited to modern residual architectures because it respects the block structure of the model.

Layer-Dependent Drop Rates

In very deep models, stochastic depth often uses different drop probabilities for different layers.

Early layers are usually dropped less often. Later layers are dropped more often.

If a model has $L$ residual blocks, one common schedule is

$$ d_l = d_{\max}\frac{l}{L}, $$

where $d_l$ is the drop probability for block $l$, and $d_{\max}$ is the maximum drop probability at the deepest layer.

The survival probability is

$$ p_l = 1 - d_l. $$

This schedule preserves low-level features in early layers while applying stronger regularization to deeper layers.

PyTorch Implementation

A minimal stochastic depth module can be written as:

import torch
from torch import nn

class StochasticDepth(nn.Module):
    def __init__(self, drop_prob):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        if not self.training or self.drop_prob == 0.0:
            return x

        keep_prob = 1.0 - self.drop_prob

        shape = [x.shape[0]] + [1] * (x.ndim - 1)
        mask = torch.empty(shape, device=x.device, dtype=x.dtype)
        mask.bernoulli_(keep_prob)

        return x * mask / keep_prob

This module drops whole samples in a batch rather than individual tensor entries. The mask has shape [B, 1, 1, ...], so each example either keeps or drops the branch.

Used inside a residual block:

class ResidualBlock(nn.Module):
    def __init__(self, dim, drop_prob):
        super().__init__()

        self.branch = nn.Sequential(
            nn.Linear(dim, dim),
            nn.GELU(),
            nn.Linear(dim, dim),
        )

        self.drop_path = StochasticDepth(drop_prob)

    def forward(self, x):
        return x + self.drop_path(self.branch(x))

During training, the residual branch is randomly removed. During evaluation, it is always used.

Using Built-In PyTorch and Torchvision Utilities

Some PyTorch ecosystem libraries provide stochastic depth or drop path implementations.

In torchvision, stochastic depth is available as:

from torchvision.ops import StochasticDepth

A block can use it as:

self.drop_path = StochasticDepth(p=0.1, mode="row")

The mode="row" setting applies a different binary decision per batch element. The mode="batch" setting applies the same decision to the whole batch.

Many vision transformer and ConvNeXt implementations call this method DropPath. The name differs, but the idea is the same.

Stochastic Depth in Vision Transformers

Vision transformers are built from residual blocks:

$$ x \leftarrow x + \operatorname{Attention}(\operatorname{Norm}(x)), $$

$$ x \leftarrow x + \operatorname{MLP}(\operatorname{Norm}(x)). $$

Stochastic depth can be applied to either residual branch:

class TransformerBlock(nn.Module):
    def __init__(self, dim, attn, mlp, drop_prob):
        super().__init__()

        self.norm1 = nn.LayerNorm(dim)
        self.attn = attn
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = mlp
        self.drop_path = StochasticDepth(drop_prob)

    def forward(self, x):
        x = x + self.drop_path(self.attn(self.norm1(x)))
        x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x

In practice, separate drop path modules may be used for attention and MLP branches.

Stochastic depth is common in vision transformers, ConvNeXt-style networks, and other deep residual models.

Stochastic Depth in CNNs

In residual CNNs, stochastic depth is applied to convolutional residual branches.

A simplified residual CNN block:

class ConvResidualBlock(nn.Module):
    def __init__(self, channels, drop_prob):
        super().__init__()

        self.branch = nn.Sequential(
            nn.Conv2d(channels, channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(channels),
        )

        self.drop_path = StochasticDepth(drop_prob)
        self.activation = nn.ReLU()

    def forward(self, x):
        y = x + self.drop_path(self.branch(x))
        return self.activation(y)

This is most direct when input and output shapes match. Blocks that change resolution need a projection path, and dropping rules should preserve valid shapes.

Choosing Drop Probabilities

Typical maximum drop probabilities are modest:

Model size	Typical maximum drop probability
Small model	0.0 to 0.1
Medium model	0.1 to 0.2
Large model	0.2 to 0.4
Very large vision model	0.4 or higher, with validation

The correct value depends on model depth, data size, augmentation strength, and optimization schedule.

If the drop probability is too high, the model may underfit. Training becomes noisy because too many residual transformations are removed.

Interaction with Other Regularizers

Stochastic depth is often combined with:

Method	Interaction
Weight decay	Regularizes parameters directly
Data augmentation	Regularizes input distribution
Label smoothing	Softens targets
Mixup and CutMix	Strong image regularization
Dropout	May still be used in MLP or attention layers

When several strong regularizers are combined, each one usually needs a smaller strength. For example, a model using Mixup, CutMix, label smoothing, weight decay, and stochastic depth may need less ordinary dropout.

Effects on Training and Inference

Stochastic depth affects training but not inference.

During training:

different residual paths are sampled,
effective depth varies across steps,
gradients flow through random subsets of blocks,
the model is regularized by architectural noise.

During inference:

all residual branches are active,
predictions are deterministic unless other stochastic methods are used,
there is no extra inference cost.

This makes stochastic depth attractive for deployment. It improves training regularization without slowing the final model.

Failure Modes

Stochastic depth can fail when used too aggressively or in the wrong architecture.

Common problems include:

Problem	Cause
Underfitting	Drop probability too high
Unstable training	Too much architectural noise
Shape errors	Dropped branch changes tensor shape incorrectly
Weak regularization	Drop probability too low
Poor early learning	Early layers dropped too often

Early layers should usually have low drop probabilities. They learn basic features used by later blocks.

Practical Guidelines

Use stochastic depth mainly in residual architectures. It is most natural when the model has explicit skip connections.

Start with a small maximum drop probability such as 0.1. Increase it for deeper models or smaller datasets if validation performance improves.

Use a depth-dependent schedule so later blocks are dropped more often than earlier blocks.

Keep stochastic depth active only during training. Always use model.eval() during validation and inference.

Combine it carefully with other regularizers. Excessive regularization can reduce both training and validation performance.

Summary

Stochastic depth randomly removes residual branches during training. It trains an ensemble of shallower subnetworks inside one deep residual model.

It differs from dropout by operating at the block level rather than the activation level. This makes it especially useful for ResNets, ConvNeXt models, vision transformers, and other residual architectures.

At inference time, all blocks are used. The method adds no inference cost while often improving generalization in deep networks.