Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.
Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.
This is useful when the model is too large to fit on one device. Data parallelism replicates the model, so each device must hold a full copy. Model parallelism removes this requirement by partitioning the model itself.
A simple example is a network with two large blocks:
We can place on GPU 0 and on GPU 1:
class TwoPartModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.part1 = Block1().to("cuda:0")
self.part2 = Block2().to("cuda:1")
def forward(self, x):
x = x.to("cuda:0")
h = self.part1(x)
h = h.to("cuda:1")
y = self.part2(h)
return yThe activation tensor h must move from GPU 0 to GPU 1. This transfer is the main cost of simple model parallelism.
Why Model Parallelism Is Needed
Large models require memory for several objects:
| Object | Description |
|---|---|
| Parameters | Trainable weights |
| Gradients | Derivatives of the loss with respect to parameters |
| Optimizer state | Momentum, variance, and other optimizer tensors |
| Activations | Intermediate tensors saved for backpropagation |
| Temporary buffers | Workspace used by kernels and communication libraries |
For AdamW training in float32, each parameter may require storage for the parameter, gradient, first moment, and second moment. Mixed precision can add master weights and casting buffers.
As models grow, a single GPU may run out of memory even for batch size 1. Model parallelism allows the model state and activations to be spread across devices.
Layer-Wise Model Parallelism
The simplest form of model parallelism places different layers on different devices.
For example:
self.layers = torch.nn.ModuleList([
Layer0().to("cuda:0"),
Layer1().to("cuda:0"),
Layer2().to("cuda:1"),
Layer3().to("cuda:1"),
])The forward pass moves activations between devices when needed:
def forward(self, x):
x = x.to("cuda:0")
x = self.layers[0](x)
x = self.layers[1](x)
x = x.to("cuda:1")
x = self.layers[2](x)
x = self.layers[3](x)
return xThis approach is easy to understand but often inefficient. While GPU 0 computes the early layers, GPU 1 waits. After the activation is transferred, GPU 1 computes the later layers while GPU 0 waits.
The result is poor device utilization.
The Idle Device Problem
Consider a two-stage model:
If is on GPU 0 and is on GPU 1, only one GPU may be active at a time for a single batch.
Timeline:
| Time | GPU 0 | GPU 1 |
|---|---|---|
| Step 1 | Compute | Idle |
| Step 2 | Transfer activation | Idle or receive |
| Step 3 | Idle | Compute |
This wastes compute. The model fits in memory, but throughput may be worse than single-GPU training if communication and idle time dominate.
Pipeline parallelism, covered next, addresses this by splitting a batch into microbatches and keeping multiple devices busy at the same time.
Tensor Parallelism
Tensor parallelism splits individual tensor operations across devices. Instead of placing whole layers on different GPUs, it partitions the matrices inside a layer.
A linear layer computes
If , we can split by columns:
Each GPU computes part of the output:
The partial outputs are concatenated:
This is column parallelism.
We can also split by rows:
Each device computes a partial contribution, and the results are summed with an all-reduce.
Tensor parallelism is common in large transformer training because transformer blocks contain huge matrix multiplications in attention and feedforward layers.
Column-Parallel Linear Layers
A column-parallel linear layer splits the output dimension across devices.
Suppose
With devices, each device stores
Each device computes:
The local output has shape:
If the next operation can consume partitioned outputs, no immediate gather is needed. Otherwise, the outputs are gathered to form:
Column parallelism reduces parameter memory per device and distributes matrix multiplication.
Row-Parallel Linear Layers
A row-parallel linear layer splits the input dimension.
Suppose each device receives part of the input:
and stores a matching partition:
Each device computes:
The full output is the sum:
This requires an all-reduce across devices.
Column-parallel and row-parallel layers are often paired. A transformer feedforward block may use column parallelism in the first projection and row parallelism in the second projection, reducing unnecessary communication between them.
Tensor Parallelism in Transformers
A transformer block contains several large linear operations:
| Component | Typical tensor operation |
|---|---|
| Query projection | |
| Key projection | |
| Value projection | |
| Attention output projection | |
| Feedforward up projection | |
| Feedforward down projection |
These matrices are natural targets for tensor parallelism.
Attention heads can also be partitioned. If a model has 32 attention heads and 4 GPUs, each GPU can process 8 heads. This reduces memory and computation per device while keeping the mathematical structure intact.
Communication Costs
Model parallelism trades memory savings for communication.
Communication occurs when:
- activations move between layers on different devices
- partial outputs are gathered
- partial sums are all-reduced
- gradients flow backward across partitions
The performance of model parallelism depends on the ratio between computation and communication. Large matrix multiplications are favorable because they perform many arithmetic operations per byte communicated.
Small layers or frequent device transfers are unfavorable.
This is why model parallelism works best with large dense models and fast interconnects such as NVLink, NVSwitch, or high-bandwidth cluster networks.
Autograd Across Devices
PyTorch autograd can track operations across devices. If an activation is moved from one GPU to another, autograd records the transfer as part of the computation graph.
Example:
x = x.to("cuda:0")
h = self.part1(x)
h = h.to("cuda:1")
y = self.part2(h)
loss = loss_fn(y, target.to("cuda:1"))
loss.backward()During the backward pass, gradients flow from GPU 1 back through the transfer operation to GPU 0.
This makes simple model parallelism easy to implement. However, ease of implementation does not guarantee high throughput. Communication and scheduling still dominate performance.
Model Parallelism Versus Data Parallelism
Data parallelism and model parallelism split different axes of the training problem.
| Method | What is split | What each device stores | Best when |
|---|---|---|---|
| Data parallelism | Batch | Full model replica | Model fits on one device |
| Model parallelism | Model | Part of model | Model does not fit on one device |
| Tensor parallelism | Layer tensors | Slices of parameters | Individual layers are too large |
| Pipeline parallelism | Layer groups | Sequential model stages | Many layers can be staged |
These methods are often combined. Large language model training commonly uses data parallelism across groups of replicas, tensor parallelism inside each transformer block, and pipeline parallelism across blocks.
Memory Accounting
To decide whether model parallelism is needed, estimate memory per parameter.
For AdamW training, approximate memory can include:
| Item | Bytes per parameter |
|---|---|
| Parameter in fp16 or bf16 | 2 |
| Gradient in fp16 or bf16 | 2 |
| Master parameter in fp32 | 4 |
| Adam first moment | 4 |
| Adam second moment | 4 |
| Total | 16 |
A 7-billion-parameter model may require about:
bytes, or roughly 112 GB, before accounting for activations, temporary buffers, and fragmentation.
This exceeds the memory of many single GPUs. Model parallelism or sharded training becomes necessary.
Practical PyTorch Patterns
For small model-parallel experiments, explicit device placement is enough:
class SplitMLP(torch.nn.Module):
def __init__(self):
super().__init__()
self.l1 = torch.nn.Linear(4096, 4096).to("cuda:0")
self.l2 = torch.nn.Linear(4096, 4096).to("cuda:1")
def forward(self, x):
x = x.to("cuda:0")
x = torch.relu(self.l1(x))
x = x.to("cuda:1")
x = self.l2(x)
return xFor serious large-model training, manual placement becomes hard to maintain. Common tools include:
- PyTorch FSDP
- PyTorch tensor parallel APIs
- DeepSpeed
- Megatron-LM
- Hugging Face Accelerate
- FairScale
These systems automate partitioning, communication, and memory management.
Common Failure Modes
The first common failure is excessive activation transfer. If tensors move across GPUs too often, communication dominates runtime.
The second failure is device mismatch. Inputs, targets, and model parts must be on compatible devices.
Example error:
Expected all tensors to be on the same deviceThe third failure is poor load balance. If one GPU owns much more computation than another, faster devices wait for the slowest stage.
The fourth failure is hidden memory imbalance. One GPU may hold embeddings, output heads, or normalization layers that make it run out of memory first.
The fifth failure is optimizer state placement. Parameters placed on multiple devices require optimizer state on those devices. A careless optimizer setup may create state in unexpected locations.
When to Use Model Parallelism
Use model parallelism when the model cannot fit on one GPU, or when a single layer is too large for one device.
Avoid simple layer-wise model parallelism when the model already fits on one GPU and throughput is the main goal. Data parallelism is usually faster and simpler.
Use tensor parallelism when large matrix multiplications dominate and the interconnect is fast.
Use pipeline parallelism when the model has many sequential layers that can be divided into balanced stages.
Use sharded data parallelism when the model mostly fits computationally but full replication wastes too much memory.
Model parallelism solves a memory problem first. Efficient model parallelism also solves a scheduling problem. The challenge is to partition the model so that each device stores less, computes enough, and communicates as little as possible.