Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Instead of sending one full batch through stage 1, then stage 2, then stage 3, pipeline parallelism divides the batch into smaller microbatches. While one microbatch is processed by a later stage, another microbatch can be processed by an earlier stage.

The model is still sequential:

f_\theta(x) = f_K(f_{K-1}(\cdots f_2(f_1(x))\cdots)).

Each stage stores one part of the model:

Stage	Device	Model block
Stage 1	GPU 0	$f_1$
Stage 2	GPU 1	$f_2$
Stage 3	GPU 2	$f_3$
Stage 4	GPU 3	$f_4$

The main goal is to keep all devices busy.

The Pipeline Idea

Consider a model split into four stages. A full batch is split into four microbatches:

B = B_1 \cup B_2 \cup B_3 \cup B_4.

The first stage begins processing $B_1$ . When it finishes, it sends the activation to the second stage and starts processing $B_2$ . Then the second stage processes $B_1$ while the first stage processes $B_2$ .

After the pipeline fills, several stages work at the same time.

Time	GPU 0	GPU 1	GPU 2	GPU 3
1	$B_1$	idle	idle	idle
2	$B_2$	$B_1$	idle	idle
3	$B_3$	$B_2$	$B_1$	idle
4	$B_4$	$B_3$	$B_2$	$B_1$

This is the forward pipeline. During training, backward computation must also be scheduled.

Microbatches

A microbatch is a smaller piece of a training batch. If the global batch size is $B$ and we split it into $M$ microbatches, then each microbatch has approximately

B_{\text{micro}} = \frac{B}{M}

examples.

Microbatches allow pipeline stages to overlap. More microbatches usually reduce idle time, but they may also increase overhead.

For example, a batch of 256 examples may be split into 8 microbatches of 32 examples:

global_batch_size = 256
num_microbatches = 8
microbatch_size = global_batch_size // num_microbatches

The optimizer step is usually performed after gradients from all microbatches have been accumulated.

Pipeline Bubbles

A pipeline bubble is idle time caused by filling or draining the pipeline.

At the beginning, later stages are idle because no activations have reached them yet. At the end, earlier stages may be idle while later stages finish remaining microbatches.

If there are $K$ pipeline stages and $M$ microbatches, the approximate fraction of wasted work from bubbles is

\frac{K - 1}{M + K - 1}.

This expression shows why more microbatches improve pipeline utilization.

For example, with 4 stages and 4 microbatches, the bubble fraction is

\frac{3}{7} \approx 0.43.

With 4 stages and 32 microbatches, it becomes

\frac{3}{35} \approx 0.086.

So increasing the number of microbatches can sharply reduce idle time.

GPipe Scheduling

GPipe is one common pipeline schedule. It runs all forward microbatches first, then all backward microbatches.

The schedule has two phases:

forward pass for all microbatches
backward pass for all microbatches

This is simple and stable. However, it stores many activations because all forward passes finish before backward starts.

For a model with many stages and many microbatches, activation memory can become large.

Activation checkpointing is often used with GPipe. Instead of storing all intermediate activations, the system stores selected tensors and recomputes missing activations during backward.

This trades extra computation for lower memory.

1F1B Scheduling

Another common schedule is one-forward-one-backward, abbreviated as 1F1B.

After the pipeline is filled, each stage alternates between one forward microbatch and one backward microbatch.

This reduces activation memory because backward begins before all forward microbatches have completed.

A simplified schedule looks like this:

Phase	Behavior
Warmup	Fill the pipeline with forward microbatches
Steady state	Alternate forward and backward work
Cooldown	Finish remaining backward microbatches

1F1B is widely used in large language model training because it improves memory usage and keeps devices active.

Weight Versioning

Pipeline training can introduce a subtle issue: different microbatches may see different versions of model weights if updates happen too early.

To avoid this, pipeline training usually accumulates gradients across all microbatches and applies the optimizer step only after the full batch has completed.

This keeps the training semantics close to ordinary mini-batch training.

The rule is:

\text{one optimizer step per full batch, not per microbatch}.

In PyTorch-like pseudocode:

optimizer.zero_grad(set_to_none=True)

for micro_x, micro_y in microbatches:
    loss = pipeline_forward_backward(micro_x, micro_y)
    loss = loss / num_microbatches
    loss.backward()

optimizer.step()

Real pipeline implementations hide much of this scheduling, but the optimization rule remains the same.

Partitioning the Model

The quality of pipeline parallelism depends heavily on how the model is partitioned.

A good partition should balance:

Concern	Goal
Compute	Each stage takes similar time
Memory	No stage exceeds device memory
Communication	Minimize activation transfer
Structure	Split at clean layer boundaries

If one stage is much slower than the others, it becomes the bottleneck. All other stages wait for it.

Transformer models are often easier to partition than irregular networks because they contain repeated blocks. A model with 48 transformer layers can be split across 8 stages, with 6 layers per stage.

However, embeddings and output projection layers may be large. They can create imbalance if placed carelessly.

Communication Between Stages

Pipeline stages communicate activations during the forward pass and activation gradients during the backward pass.

Forward communication sends:

h_k = f_k(h_{k-1})

from stage $k$ to stage $k+1$ .

Backward communication sends:

\nabla_{h_k} L

from stage $k+1$ back to stage $k$ .

The amount of communication depends on the size of boundary activations. A bad partition may cut the model at a point where activation tensors are large, causing high communication overhead.

Good partitions reduce both compute imbalance and communication volume.

Pipeline Parallelism in Transformers

Transformers are natural candidates for pipeline parallelism because they consist of stacked blocks:

x \rightarrow \text{Block}_1 \rightarrow \text{Block}_2 \rightarrow \cdots \rightarrow \text{Block}_L.

A simple partition assigns consecutive blocks to each stage.

For a 24-layer transformer on 4 GPUs:

GPU	Layers
GPU 0	Embedding, layers 1 to 6
GPU 1	Layers 7 to 12
GPU 2	Layers 13 to 18
GPU 3	Layers 19 to 24, output head

This partition is easy to reason about. Each stage owns a contiguous segment of the network.

For decoder-only language models, the output head may share weights with the embedding matrix. This can complicate placement because the first and last stages may need access to the same large parameter matrix.

Interaction with Data Parallelism

Pipeline parallelism is often combined with data parallelism.

Suppose we have 16 GPUs. We may build two independent 8-GPU pipelines. Each pipeline processes different data, and corresponding stages synchronize gradients across pipelines.

This gives two axes of parallelism:

Axis	Meaning
Pipeline parallelism	Splits the model across stages
Data parallelism	Replicates the pipeline across data shards

For very large models, a third axis is also common:

Axis	Meaning
Tensor parallelism	Splits large matrix operations inside each stage

Large-scale transformer training often uses all three.

PyTorch Pipeline APIs

PyTorch has supported pipeline-style training through multiple APIs and ecosystem tools. The exact API has changed over time, but the conceptual structure is stable: define stages, split inputs into microbatches, and schedule communication.

A simplified conceptual model looks like this:

stage0 = torch.nn.Sequential(
    embedding,
    *layers[:6],
).to("cuda:0")

stage1 = torch.nn.Sequential(
    *layers[6:12],
).to("cuda:1")

stage2 = torch.nn.Sequential(
    *layers[12:18],
).to("cuda:2")

stage3 = torch.nn.Sequential(
    *layers[18:],
    output_head,
).to("cuda:3")

Then a pipeline runtime coordinates microbatch execution.

For serious use, common choices include:

Tool	Role
PyTorch distributed pipeline APIs	Native pipeline abstractions
DeepSpeed Pipeline Parallelism	Large-scale training system
Megatron-LM	Transformer tensor and pipeline parallelism
FairScale	Earlier PyTorch scaling utilities
Hugging Face Accelerate	Higher-level distributed orchestration

The manual implementation is useful for learning, but production training usually needs a runtime that handles scheduling, communication, and failure cases.

Activation Checkpointing

Pipeline parallelism often uses activation checkpointing to reduce memory.

During normal backpropagation, intermediate activations are stored so gradients can be computed later. In large models, these activations consume substantial memory.

Activation checkpointing stores only selected activations. During backward, missing activations are recomputed.

This changes the memory-compute tradeoff:

Choice	Memory	Compute
Store all activations	High	Lower
Checkpoint activations	Lower	Higher

In PyTorch, checkpointing can be used with:

from torch.utils.checkpoint import checkpoint

def forward(self, x):
    x = checkpoint(self.block1, x)
    x = checkpoint(self.block2, x)
    return x

In pipeline training, checkpointing is often essential because multiple microbatches may have live activations at the same time.

Throughput and Latency

Pipeline parallelism improves throughput, not latency.

A single example still has to pass through every stage in order. The latency for one example may increase because activations must move between devices.

The benefit comes from processing many microbatches at once. Once the pipeline is full, different devices work on different microbatches simultaneously.

This makes pipeline parallelism suitable for training and batch inference. It is less attractive for low-latency single-request inference unless combined with careful batching and scheduling.

Common Failure Modes

The first failure mode is poor stage balance. If one stage takes twice as long as the others, overall throughput is limited by that slow stage.

The second failure mode is too few microbatches. This creates large pipeline bubbles and poor utilization.

The third failure mode is too many microbatches. Very small microbatches may reduce arithmetic efficiency and increase scheduling overhead.

The fourth failure mode is excessive activation communication. Large boundary tensors can make communication dominate.

The fifth failure mode is complicated debugging. Errors may appear far from the stage where they originate, especially in asynchronous distributed execution.

The sixth failure mode is memory blowup from stored activations. This is especially common with schedules that run many forward microbatches before backward begins.

When to Use Pipeline Parallelism

Use pipeline parallelism when the model has many sequential layers and cannot fit efficiently on one device.

It works well for:

Model type	Reason
Large transformers	Many repeated sequential blocks
Deep CNNs	Sequential layer groups
Encoder-decoder models	Natural stage boundaries
Large multimodal models	Components can be staged

It works poorly when:

Situation	Problem
Model has irregular branching	Hard to schedule
Boundary activations are huge	Communication-heavy
Few layers exist	Poor partitioning
Low-latency inference is required	Sequential stage latency remains

Pipeline parallelism solves the idle-device problem of naive layer-wise model parallelism by turning one batch into a stream of microbatches. Its effectiveness depends on balanced stages, enough microbatches, and efficient communication.