Skip to content

Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Instead of sending one full batch through stage 1, then stage 2, then stage 3, pipeline parallelism divides the batch into smaller microbatches. While one microbatch is processed by a later stage, another microbatch can be processed by an earlier stage.

The model is still sequential:

fθ(x)=fK(fK1(f2(f1(x)))). f_\theta(x) = f_K(f_{K-1}(\cdots f_2(f_1(x))\cdots)).

Each stage stores one part of the model:

StageDeviceModel block
Stage 1GPU 0f1f_1
Stage 2GPU 1f2f_2
Stage 3GPU 2f3f_3
Stage 4GPU 3f4f_4

The main goal is to keep all devices busy.

The Pipeline Idea

Consider a model split into four stages. A full batch is split into four microbatches:

B=B1B2B3B4. B = B_1 \cup B_2 \cup B_3 \cup B_4.

The first stage begins processing B1B_1. When it finishes, it sends the activation to the second stage and starts processing B2B_2. Then the second stage processes B1B_1 while the first stage processes B2B_2.

After the pipeline fills, several stages work at the same time.

TimeGPU 0GPU 1GPU 2GPU 3
1B1B_1idleidleidle
2B2B_2B1B_1idleidle
3B3B_3B2B_2B1B_1idle
4B4B_4B3B_3B2B_2B1B_1

This is the forward pipeline. During training, backward computation must also be scheduled.

Microbatches

A microbatch is a smaller piece of a training batch. If the global batch size is BB and we split it into MM microbatches, then each microbatch has approximately

Bmicro=BM B_{\text{micro}} = \frac{B}{M}

examples.

Microbatches allow pipeline stages to overlap. More microbatches usually reduce idle time, but they may also increase overhead.

For example, a batch of 256 examples may be split into 8 microbatches of 32 examples:

global_batch_size = 256
num_microbatches = 8
microbatch_size = global_batch_size // num_microbatches

The optimizer step is usually performed after gradients from all microbatches have been accumulated.

Pipeline Bubbles

A pipeline bubble is idle time caused by filling or draining the pipeline.

At the beginning, later stages are idle because no activations have reached them yet. At the end, earlier stages may be idle while later stages finish remaining microbatches.

If there are KK pipeline stages and MM microbatches, the approximate fraction of wasted work from bubbles is

K1M+K1. \frac{K - 1}{M + K - 1}.

This expression shows why more microbatches improve pipeline utilization.

For example, with 4 stages and 4 microbatches, the bubble fraction is

370.43. \frac{3}{7} \approx 0.43.

With 4 stages and 32 microbatches, it becomes

3350.086. \frac{3}{35} \approx 0.086.

So increasing the number of microbatches can sharply reduce idle time.

GPipe Scheduling

GPipe is one common pipeline schedule. It runs all forward microbatches first, then all backward microbatches.

The schedule has two phases:

  1. forward pass for all microbatches
  2. backward pass for all microbatches

This is simple and stable. However, it stores many activations because all forward passes finish before backward starts.

For a model with many stages and many microbatches, activation memory can become large.

Activation checkpointing is often used with GPipe. Instead of storing all intermediate activations, the system stores selected tensors and recomputes missing activations during backward.

This trades extra computation for lower memory.

1F1B Scheduling

Another common schedule is one-forward-one-backward, abbreviated as 1F1B.

After the pipeline is filled, each stage alternates between one forward microbatch and one backward microbatch.

This reduces activation memory because backward begins before all forward microbatches have completed.

A simplified schedule looks like this:

PhaseBehavior
WarmupFill the pipeline with forward microbatches
Steady stateAlternate forward and backward work
CooldownFinish remaining backward microbatches

1F1B is widely used in large language model training because it improves memory usage and keeps devices active.

Weight Versioning

Pipeline training can introduce a subtle issue: different microbatches may see different versions of model weights if updates happen too early.

To avoid this, pipeline training usually accumulates gradients across all microbatches and applies the optimizer step only after the full batch has completed.

This keeps the training semantics close to ordinary mini-batch training.

The rule is:

one optimizer step per full batch, not per microbatch. \text{one optimizer step per full batch, not per microbatch}.

In PyTorch-like pseudocode:

optimizer.zero_grad(set_to_none=True)

for micro_x, micro_y in microbatches:
    loss = pipeline_forward_backward(micro_x, micro_y)
    loss = loss / num_microbatches
    loss.backward()

optimizer.step()

Real pipeline implementations hide much of this scheduling, but the optimization rule remains the same.

Partitioning the Model

The quality of pipeline parallelism depends heavily on how the model is partitioned.

A good partition should balance:

ConcernGoal
ComputeEach stage takes similar time
MemoryNo stage exceeds device memory
CommunicationMinimize activation transfer
StructureSplit at clean layer boundaries

If one stage is much slower than the others, it becomes the bottleneck. All other stages wait for it.

Transformer models are often easier to partition than irregular networks because they contain repeated blocks. A model with 48 transformer layers can be split across 8 stages, with 6 layers per stage.

However, embeddings and output projection layers may be large. They can create imbalance if placed carelessly.

Communication Between Stages

Pipeline stages communicate activations during the forward pass and activation gradients during the backward pass.

Forward communication sends:

hk=fk(hk1) h_k = f_k(h_{k-1})

from stage kk to stage k+1k+1.

Backward communication sends:

hkL \nabla_{h_k} L

from stage k+1k+1 back to stage kk.

The amount of communication depends on the size of boundary activations. A bad partition may cut the model at a point where activation tensors are large, causing high communication overhead.

Good partitions reduce both compute imbalance and communication volume.

Pipeline Parallelism in Transformers

Transformers are natural candidates for pipeline parallelism because they consist of stacked blocks:

xBlock1Block2BlockL. x \rightarrow \text{Block}_1 \rightarrow \text{Block}_2 \rightarrow \cdots \rightarrow \text{Block}_L.

A simple partition assigns consecutive blocks to each stage.

For a 24-layer transformer on 4 GPUs:

GPULayers
GPU 0Embedding, layers 1 to 6
GPU 1Layers 7 to 12
GPU 2Layers 13 to 18
GPU 3Layers 19 to 24, output head

This partition is easy to reason about. Each stage owns a contiguous segment of the network.

For decoder-only language models, the output head may share weights with the embedding matrix. This can complicate placement because the first and last stages may need access to the same large parameter matrix.

Interaction with Data Parallelism

Pipeline parallelism is often combined with data parallelism.

Suppose we have 16 GPUs. We may build two independent 8-GPU pipelines. Each pipeline processes different data, and corresponding stages synchronize gradients across pipelines.

This gives two axes of parallelism:

AxisMeaning
Pipeline parallelismSplits the model across stages
Data parallelismReplicates the pipeline across data shards

For very large models, a third axis is also common:

AxisMeaning
Tensor parallelismSplits large matrix operations inside each stage

Large-scale transformer training often uses all three.

PyTorch Pipeline APIs

PyTorch has supported pipeline-style training through multiple APIs and ecosystem tools. The exact API has changed over time, but the conceptual structure is stable: define stages, split inputs into microbatches, and schedule communication.

A simplified conceptual model looks like this:

stage0 = torch.nn.Sequential(
    embedding,
    *layers[:6],
).to("cuda:0")

stage1 = torch.nn.Sequential(
    *layers[6:12],
).to("cuda:1")

stage2 = torch.nn.Sequential(
    *layers[12:18],
).to("cuda:2")

stage3 = torch.nn.Sequential(
    *layers[18:],
    output_head,
).to("cuda:3")

Then a pipeline runtime coordinates microbatch execution.

For serious use, common choices include:

ToolRole
PyTorch distributed pipeline APIsNative pipeline abstractions
DeepSpeed Pipeline ParallelismLarge-scale training system
Megatron-LMTransformer tensor and pipeline parallelism
FairScaleEarlier PyTorch scaling utilities
Hugging Face AccelerateHigher-level distributed orchestration

The manual implementation is useful for learning, but production training usually needs a runtime that handles scheduling, communication, and failure cases.

Activation Checkpointing

Pipeline parallelism often uses activation checkpointing to reduce memory.

During normal backpropagation, intermediate activations are stored so gradients can be computed later. In large models, these activations consume substantial memory.

Activation checkpointing stores only selected activations. During backward, missing activations are recomputed.

This changes the memory-compute tradeoff:

ChoiceMemoryCompute
Store all activationsHighLower
Checkpoint activationsLowerHigher

In PyTorch, checkpointing can be used with:

from torch.utils.checkpoint import checkpoint

def forward(self, x):
    x = checkpoint(self.block1, x)
    x = checkpoint(self.block2, x)
    return x

In pipeline training, checkpointing is often essential because multiple microbatches may have live activations at the same time.

Throughput and Latency

Pipeline parallelism improves throughput, not latency.

A single example still has to pass through every stage in order. The latency for one example may increase because activations must move between devices.

The benefit comes from processing many microbatches at once. Once the pipeline is full, different devices work on different microbatches simultaneously.

This makes pipeline parallelism suitable for training and batch inference. It is less attractive for low-latency single-request inference unless combined with careful batching and scheduling.

Common Failure Modes

The first failure mode is poor stage balance. If one stage takes twice as long as the others, overall throughput is limited by that slow stage.

The second failure mode is too few microbatches. This creates large pipeline bubbles and poor utilization.

The third failure mode is too many microbatches. Very small microbatches may reduce arithmetic efficiency and increase scheduling overhead.

The fourth failure mode is excessive activation communication. Large boundary tensors can make communication dominate.

The fifth failure mode is complicated debugging. Errors may appear far from the stage where they originate, especially in asynchronous distributed execution.

The sixth failure mode is memory blowup from stored activations. This is especially common with schedules that run many forward microbatches before backward begins.

When to Use Pipeline Parallelism

Use pipeline parallelism when the model has many sequential layers and cannot fit efficiently on one device.

It works well for:

Model typeReason
Large transformersMany repeated sequential blocks
Deep CNNsSequential layer groups
Encoder-decoder modelsNatural stage boundaries
Large multimodal modelsComponents can be staged

It works poorly when:

SituationProblem
Model has irregular branchingHard to schedule
Boundary activations are hugeCommunication-heavy
Few layers existPoor partitioning
Low-latency inference is requiredSequential stage latency remains

Pipeline parallelism solves the idle-device problem of naive layer-wise model parallelism by turning one batch into a stream of microbatches. Its effectiveness depends on balanced stages, enough microbatches, and efficient communication.