Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.
Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.
A node is one physical or virtual machine. A typical node may contain 4 or 8 GPUs. If we train on 4 nodes with 8 GPUs each, the job uses 32 GPUs.
For 4 nodes and 8 GPUs per node:
Multi-node training extends the ideas from single-node distributed training. We still need ranks, process groups, distributed samplers, gradient synchronization, checkpointing, and failure handling. The difference is that communication now crosses machine boundaries, so networking becomes a first-class part of the training system.
Why Multi-Node Training Is Used
A single machine can provide only limited compute and memory. Multi-node training is used when a job needs:
| Need | Explanation |
|---|---|
| More throughput | More GPUs process more examples per second |
| Larger global batch size | Many workers contribute to one optimizer step |
| Larger model capacity | Model, optimizer state, or activations may be partitioned |
| Faster experimentation | Shorter wall-clock training time |
| Large-scale pretraining | Foundation models require many accelerator-hours |
For small models, multi-node training may add unnecessary complexity. For large models, it becomes unavoidable.
Process Layout
The common layout is one process per GPU.
If each node has 8 GPUs and we use 4 nodes, we launch 32 processes.
Each process receives:
| Identifier | Meaning |
|---|---|
RANK | Global process ID |
LOCAL_RANK | GPU index within the current node |
WORLD_SIZE | Total process count |
LOCAL_WORLD_SIZE | Process count within the current node |
Example:
| Node | Local GPU | Global rank | Local rank |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 |
| 0 | 7 | 7 | 7 |
| 1 | 0 | 8 | 0 |
| 1 | 7 | 15 | 7 |
| 3 | 7 | 31 | 7 |
The global rank identifies the process across the whole job. The local rank selects the GPU on the current node.
Launching with torchrun
PyTorch commonly launches multi-node jobs with torchrun.
For a 4-node job with 8 GPUs per node:
torchrun \
--nnodes=4 \
--nproc_per_node=8 \
--node_rank=0 \
--master_addr=10.0.0.5 \
--master_port=29500 \
train.pyOn node 1, change only --node_rank:
torchrun \
--nnodes=4 \
--nproc_per_node=8 \
--node_rank=1 \
--master_addr=10.0.0.5 \
--master_port=29500 \
train.pyThe same applies to nodes 2 and 3.
The master address points to the node that coordinates rendezvous. It does not mean that rank 0 performs all training. After initialization, all ranks participate in computation.
Initialization Code
Inside the training script, initialization is almost the same as single-node DDP.
import os
import torch
import torch.distributed as dist
def init_distributed():
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
rank = dist.get_rank()
world_size = dist.get_world_size()
return rank, local_rank, world_sizeThen wrap the model:
from torch.nn.parallel import DistributedDataParallel as DDP
rank, local_rank, world_size = init_distributed()
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])The same code can often run on one node or many nodes, provided it is launched correctly.
Network Communication
Multi-node training depends heavily on network performance.
Within one node, GPUs may communicate through PCIe, NVLink, or NVSwitch. Across nodes, communication uses the network interface.
Common interconnects include:
| Interconnect | Typical use |
|---|---|
| Ethernet | General clusters, lower cost |
| RoCE | RDMA over converged Ethernet |
| InfiniBand | High-performance GPU clusters |
| Cloud provider fabric | Managed accelerator clusters |
Gradient synchronization can require transferring large tensors every training step. If the network is slow, GPUs spend time waiting for communication instead of computing.
The most important network properties are:
| Property | Meaning |
|---|---|
| Bandwidth | How many bytes can move per second |
| Latency | How long a message takes to start |
| Topology | Which nodes communicate efficiently |
| Congestion | Whether other jobs share the fabric |
| Reliability | Whether long jobs survive without resets |
NCCL and GPU Communication
For NVIDIA GPU training, PyTorch usually uses NCCL. NCCL provides optimized collective communication primitives such as all-reduce, broadcast, reduce-scatter, and all-gather.
A standard setup uses:
dist.init_process_group(backend="nccl")Useful NCCL debugging variables include:
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NETOn multi-node systems, NCCL must choose the correct network interface. Sometimes this must be set explicitly:
export NCCL_SOCKET_IFNAME=eth0or for another interface:
export NCCL_SOCKET_IFNAME=ib0Wrong interface selection is a common source of slow training or initialization failure.
Distributed Sampling Across Nodes
The dataset must be partitioned across all ranks, not just across GPUs within one node.
Use DistributedSampler with the global world size:
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(
dataset,
num_replicas=world_size,
rank=rank,
shuffle=True,
)
loader = DataLoader(
dataset,
batch_size=local_batch_size,
sampler=sampler,
num_workers=8,
pin_memory=True,
)At each epoch:
sampler.set_epoch(epoch)This ensures each rank receives a unique shard and that shuffling changes between epochs.
Global Batch Size
In multi-node training, the global batch size is:
where:
| Symbol | Meaning |
|---|---|
| Batch size per GPU | |
| GPUs per node | |
| Number of nodes | |
| Gradient accumulation steps |
For example, with local batch size 8, 8 GPUs per node, 4 nodes, and 2 accumulation steps:
Changing the number of nodes changes the global batch unless local batch or accumulation is adjusted. This may require learning rate retuning.
Checkpointing in Multi-Node Jobs
Checkpointing becomes more expensive at multi-node scale.
For ordinary DDP, every rank has a full model replica. In that case, rank 0 can save the model:
if rank == 0:
torch.save(
{
"model": model.module.state_dict(),
"optimizer": optimizer.state_dict(),
"step": step,
},
"checkpoint.pt",
)For sharded training, every rank may hold only part of the state. Then distributed checkpointing is required. Each rank writes its shard, and a manifest records how shards fit together.
A robust checkpoint directory may look like:
checkpoint_0001000/
manifest.json
rank_00000.pt
rank_00001.pt
rank_00002.pt
...
rank_00031.pt
COMPLETEThe COMPLETE marker indicates that all shards were written successfully.
Validation and Metrics
Metrics must be aggregated across all ranks.
For classification accuracy:
correct = torch.tensor(local_correct, device=local_rank)
total = torch.tensor(local_total, device=local_rank)
dist.all_reduce(correct, op=dist.ReduceOp.SUM)
dist.all_reduce(total, op=dist.ReduceOp.SUM)
accuracy = correct / totalOnly rank 0 should print the result:
if rank == 0:
print(f"accuracy: {accuracy.item():.4f}")Loss values can also be averaged across ranks. For exact weighting, reduce both total loss and number of examples rather than averaging per-rank averages.
Avoiding Stragglers
A straggler is a worker that runs slower than the others. In synchronous training, all workers must wait for the slowest one.
Stragglers can come from:
| Cause | Example |
|---|---|
| Hardware variation | One GPU throttles |
| Data imbalance | One rank processes longer samples |
| Slow storage | One node reads data slowly |
| Network congestion | One link has lower bandwidth |
| Background processes | CPU contention on one node |
For language model training, sequence length variation can cause stragglers. If one rank receives unusually long sequences, its forward and backward pass may take longer.
Mitigation strategies include:
| Strategy | Effect |
|---|---|
| Length-based batching | Reduces per-batch variation |
| Data prefetching | Hides input latency |
| Balanced sharding | Avoids uneven datasets |
| Health monitoring | Detects slow nodes |
| Dedicated interconnect | Reduces congestion |
Multi-Node Failure Modes
Multi-node training introduces failure modes that rarely appear on one machine.
Common failures include:
| Failure | Symptom |
|---|---|
| Wrong master address | Processes cannot rendezvous |
| Port blocked | Initialization hangs |
| Wrong network interface | NCCL timeout or very slow training |
| Rank mismatch | Some workers wait forever |
| Different code versions | Silent divergence or crashes |
| Filesystem inconsistency | Checkpoint load fails |
| Clock or timeout issues | Spurious process failure |
A common debugging sequence is:
export NCCL_DEBUG=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAILThen verify:
- all nodes can reach
MASTER_ADDR:MASTER_PORT - each node sees the expected number of GPUs
- all nodes run the same code and dependency versions
- the dataset paths are valid on every node
- the network interface is correct
Reproducibility Across Nodes
Reproducibility is harder across nodes than on a single GPU.
Sources of variation include:
- different communication ordering
- non-deterministic kernels
- different worker restart timing
- filesystem ordering
- random data augmentation
- floating-point reduction order
Set seeds per rank carefully:
base_seed = 1234
seed = base_seed + rank
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)This gives each rank a distinct but reproducible random stream.
For exact reproducibility, save RNG states in checkpoints. For most large training runs, statistical reproducibility is more practical than bitwise identity.
Combining Multi-Node with Other Parallelisms
Multi-node training is an execution environment, not a single parallelism strategy.
A large run may combine:
| Parallelism | Role |
|---|---|
| Data parallelism | Replicate training across data shards |
| Tensor parallelism | Split large matrix operations |
| Pipeline parallelism | Split model layers into stages |
| Sharded data parallelism | Partition parameters and optimizer states |
For example, a 64-GPU job may use:
| Dimension | Value |
|---|---|
| Data parallel groups | 8 |
| Tensor parallel size | 4 |
| Pipeline parallel size | 2 |
| Total GPUs |
Each rank belongs to several communication groups. One group handles data parallel gradient synchronization. Another handles tensor-parallel collectives. Another handles pipeline transfers.
Correct group construction is one of the main complexities in large-scale training systems.
When Multi-Node Training Is Worth It
Use multi-node training when a single node cannot provide enough compute, memory, or throughput.
Avoid it when:
| Situation | Reason |
|---|---|
| Model trains quickly on one node | Added complexity gives little benefit |
| Dataset pipeline is slow | More GPUs will wait for data |
| Network is weak | Communication dominates |
| Code is still unstable | Debugging becomes harder |
| Batch-size scaling is poor | More workers may harm optimization |
A practical progression is:
- make the model correct on one GPU
- scale to all GPUs on one node
- verify DDP correctness and throughput
- scale to multiple nodes
- add sharding, tensor parallelism, or pipeline parallelism only when needed
Multi-node training is mainly an engineering multiplier. It turns more hardware into faster training only when the software, data pipeline, network, and optimization setup scale together.