Data ParallelismData parallelism is the simplest and most widely used form of distributed deep learning.
Distributed Data ParallelDistributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training.
Model ParallelismModel parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.
Pipeline ParallelismPipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.
Fault ToleranceDistributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.
Multi-Node TrainingMulti-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.
Training Foundation ModelsFoundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.
Inference OptimizationTraining produces model parameters. Inference uses those parameters to generate predictions.