# Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Instead of sending one full batch through stage 1, then stage 2, then stage 3, pipeline parallelism divides the batch into smaller microbatches. While one microbatch is processed by a later stage, another microbatch can be processed by an earlier stage.

The model is still sequential:

$$
f_\theta(x) = f_K(f_{K-1}(\cdots f_2(f_1(x))\cdots)).
$$

Each stage stores one part of the model:

| Stage | Device | Model block |
|---|---|---|
| Stage 1 | GPU 0 | $f_1$ |
| Stage 2 | GPU 1 | $f_2$ |
| Stage 3 | GPU 2 | $f_3$ |
| Stage 4 | GPU 3 | $f_4$ |

The main goal is to keep all devices busy.

### The Pipeline Idea

Consider a model split into four stages. A full batch is split into four microbatches:

$$
B = B_1 \cup B_2 \cup B_3 \cup B_4.
$$

The first stage begins processing $B_1$. When it finishes, it sends the activation to the second stage and starts processing $B_2$. Then the second stage processes $B_1$ while the first stage processes $B_2$.

After the pipeline fills, several stages work at the same time.

| Time | GPU 0 | GPU 1 | GPU 2 | GPU 3 |
|---|---|---|---|---|
| 1 | $B_1$ | idle | idle | idle |
| 2 | $B_2$ | $B_1$ | idle | idle |
| 3 | $B_3$ | $B_2$ | $B_1$ | idle |
| 4 | $B_4$ | $B_3$ | $B_2$ | $B_1$ |

This is the forward pipeline. During training, backward computation must also be scheduled.

### Microbatches

A microbatch is a smaller piece of a training batch. If the global batch size is $B$ and we split it into $M$ microbatches, then each microbatch has approximately

$$
B_{\text{micro}} = \frac{B}{M}
$$

examples.

Microbatches allow pipeline stages to overlap. More microbatches usually reduce idle time, but they may also increase overhead.

For example, a batch of 256 examples may be split into 8 microbatches of 32 examples:

```python
global_batch_size = 256
num_microbatches = 8
microbatch_size = global_batch_size // num_microbatches
```

The optimizer step is usually performed after gradients from all microbatches have been accumulated.

### Pipeline Bubbles

A pipeline bubble is idle time caused by filling or draining the pipeline.

At the beginning, later stages are idle because no activations have reached them yet. At the end, earlier stages may be idle while later stages finish remaining microbatches.

If there are $K$ pipeline stages and $M$ microbatches, the approximate fraction of wasted work from bubbles is

$$
\frac{K - 1}{M + K - 1}.
$$

This expression shows why more microbatches improve pipeline utilization.

For example, with 4 stages and 4 microbatches, the bubble fraction is

$$
\frac{3}{7} \approx 0.43.
$$

With 4 stages and 32 microbatches, it becomes

$$
\frac{3}{35} \approx 0.086.
$$

So increasing the number of microbatches can sharply reduce idle time.

### GPipe Scheduling

GPipe is one common pipeline schedule. It runs all forward microbatches first, then all backward microbatches.

The schedule has two phases:

1. forward pass for all microbatches
2. backward pass for all microbatches

This is simple and stable. However, it stores many activations because all forward passes finish before backward starts.

For a model with many stages and many microbatches, activation memory can become large.

Activation checkpointing is often used with GPipe. Instead of storing all intermediate activations, the system stores selected tensors and recomputes missing activations during backward.

This trades extra computation for lower memory.

### 1F1B Scheduling

Another common schedule is one-forward-one-backward, abbreviated as 1F1B.

After the pipeline is filled, each stage alternates between one forward microbatch and one backward microbatch.

This reduces activation memory because backward begins before all forward microbatches have completed.

A simplified schedule looks like this:

| Phase | Behavior |
|---|---|
| Warmup | Fill the pipeline with forward microbatches |
| Steady state | Alternate forward and backward work |
| Cooldown | Finish remaining backward microbatches |

1F1B is widely used in large language model training because it improves memory usage and keeps devices active.

### Weight Versioning

Pipeline training can introduce a subtle issue: different microbatches may see different versions of model weights if updates happen too early.

To avoid this, pipeline training usually accumulates gradients across all microbatches and applies the optimizer step only after the full batch has completed.

This keeps the training semantics close to ordinary mini-batch training.

The rule is:

$$
\text{one optimizer step per full batch, not per microbatch}.
$$

In PyTorch-like pseudocode:

```python
optimizer.zero_grad(set_to_none=True)

for micro_x, micro_y in microbatches:
    loss = pipeline_forward_backward(micro_x, micro_y)
    loss = loss / num_microbatches
    loss.backward()

optimizer.step()
```

Real pipeline implementations hide much of this scheduling, but the optimization rule remains the same.

### Partitioning the Model

The quality of pipeline parallelism depends heavily on how the model is partitioned.

A good partition should balance:

| Concern | Goal |
|---|---|
| Compute | Each stage takes similar time |
| Memory | No stage exceeds device memory |
| Communication | Minimize activation transfer |
| Structure | Split at clean layer boundaries |

If one stage is much slower than the others, it becomes the bottleneck. All other stages wait for it.

Transformer models are often easier to partition than irregular networks because they contain repeated blocks. A model with 48 transformer layers can be split across 8 stages, with 6 layers per stage.

However, embeddings and output projection layers may be large. They can create imbalance if placed carelessly.

### Communication Between Stages

Pipeline stages communicate activations during the forward pass and activation gradients during the backward pass.

Forward communication sends:

$$
h_k = f_k(h_{k-1})
$$

from stage $k$ to stage $k+1$.

Backward communication sends:

$$
\nabla_{h_k} L
$$

from stage $k+1$ back to stage $k$.

The amount of communication depends on the size of boundary activations. A bad partition may cut the model at a point where activation tensors are large, causing high communication overhead.

Good partitions reduce both compute imbalance and communication volume.

### Pipeline Parallelism in Transformers

Transformers are natural candidates for pipeline parallelism because they consist of stacked blocks:

$$
x
\rightarrow
\text{Block}_1
\rightarrow
\text{Block}_2
\rightarrow
\cdots
\rightarrow
\text{Block}_L.
$$

A simple partition assigns consecutive blocks to each stage.

For a 24-layer transformer on 4 GPUs:

| GPU | Layers |
|---|---|
| GPU 0 | Embedding, layers 1 to 6 |
| GPU 1 | Layers 7 to 12 |
| GPU 2 | Layers 13 to 18 |
| GPU 3 | Layers 19 to 24, output head |

This partition is easy to reason about. Each stage owns a contiguous segment of the network.

For decoder-only language models, the output head may share weights with the embedding matrix. This can complicate placement because the first and last stages may need access to the same large parameter matrix.

### Interaction with Data Parallelism

Pipeline parallelism is often combined with data parallelism.

Suppose we have 16 GPUs. We may build two independent 8-GPU pipelines. Each pipeline processes different data, and corresponding stages synchronize gradients across pipelines.

This gives two axes of parallelism:

| Axis | Meaning |
|---|---|
| Pipeline parallelism | Splits the model across stages |
| Data parallelism | Replicates the pipeline across data shards |

For very large models, a third axis is also common:

| Axis | Meaning |
|---|---|
| Tensor parallelism | Splits large matrix operations inside each stage |

Large-scale transformer training often uses all three.

### PyTorch Pipeline APIs

PyTorch has supported pipeline-style training through multiple APIs and ecosystem tools. The exact API has changed over time, but the conceptual structure is stable: define stages, split inputs into microbatches, and schedule communication.

A simplified conceptual model looks like this:

```python
stage0 = torch.nn.Sequential(
    embedding,
    *layers[:6],
).to("cuda:0")

stage1 = torch.nn.Sequential(
    *layers[6:12],
).to("cuda:1")

stage2 = torch.nn.Sequential(
    *layers[12:18],
).to("cuda:2")

stage3 = torch.nn.Sequential(
    *layers[18:],
    output_head,
).to("cuda:3")
```

Then a pipeline runtime coordinates microbatch execution.

For serious use, common choices include:

| Tool | Role |
|---|---|
| PyTorch distributed pipeline APIs | Native pipeline abstractions |
| DeepSpeed Pipeline Parallelism | Large-scale training system |
| Megatron-LM | Transformer tensor and pipeline parallelism |
| FairScale | Earlier PyTorch scaling utilities |
| Hugging Face Accelerate | Higher-level distributed orchestration |

The manual implementation is useful for learning, but production training usually needs a runtime that handles scheduling, communication, and failure cases.

### Activation Checkpointing

Pipeline parallelism often uses activation checkpointing to reduce memory.

During normal backpropagation, intermediate activations are stored so gradients can be computed later. In large models, these activations consume substantial memory.

Activation checkpointing stores only selected activations. During backward, missing activations are recomputed.

This changes the memory-compute tradeoff:

| Choice | Memory | Compute |
|---|---:|---:|
| Store all activations | High | Lower |
| Checkpoint activations | Lower | Higher |

In PyTorch, checkpointing can be used with:

```python
from torch.utils.checkpoint import checkpoint

def forward(self, x):
    x = checkpoint(self.block1, x)
    x = checkpoint(self.block2, x)
    return x
```

In pipeline training, checkpointing is often essential because multiple microbatches may have live activations at the same time.

### Throughput and Latency

Pipeline parallelism improves throughput, not latency.

A single example still has to pass through every stage in order. The latency for one example may increase because activations must move between devices.

The benefit comes from processing many microbatches at once. Once the pipeline is full, different devices work on different microbatches simultaneously.

This makes pipeline parallelism suitable for training and batch inference. It is less attractive for low-latency single-request inference unless combined with careful batching and scheduling.

### Common Failure Modes

The first failure mode is poor stage balance. If one stage takes twice as long as the others, overall throughput is limited by that slow stage.

The second failure mode is too few microbatches. This creates large pipeline bubbles and poor utilization.

The third failure mode is too many microbatches. Very small microbatches may reduce arithmetic efficiency and increase scheduling overhead.

The fourth failure mode is excessive activation communication. Large boundary tensors can make communication dominate.

The fifth failure mode is complicated debugging. Errors may appear far from the stage where they originate, especially in asynchronous distributed execution.

The sixth failure mode is memory blowup from stored activations. This is especially common with schedules that run many forward microbatches before backward begins.

### When to Use Pipeline Parallelism

Use pipeline parallelism when the model has many sequential layers and cannot fit efficiently on one device.

It works well for:

| Model type | Reason |
|---|---|
| Large transformers | Many repeated sequential blocks |
| Deep CNNs | Sequential layer groups |
| Encoder-decoder models | Natural stage boundaries |
| Large multimodal models | Components can be staged |

It works poorly when:

| Situation | Problem |
|---|---|
| Model has irregular branching | Hard to schedule |
| Boundary activations are huge | Communication-heavy |
| Few layers exist | Poor partitioning |
| Low-latency inference is required | Sequential stage latency remains |

Pipeline parallelism solves the idle-device problem of naive layer-wise model parallelism by turning one batch into a stream of microbatches. Its effectiveness depends on balanced stages, enough microbatches, and efficient communication.

