Skip to content

Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster. As training runs become larger and longer, the probability of failure approaches certainty.

Fault tolerance is the collection of techniques that allow training to recover from these failures without losing excessive work.

A small model trained for one hour on one GPU may not need sophisticated recovery. A large foundation model trained across hundreds of GPUs for several weeks cannot operate without it.

The central idea is simple: periodically save enough training state so that execution can resume after interruption.

Why Distributed Systems Fail

A distributed training job contains many interacting components:

ComponentPossible failure
GPUOut-of-memory, ECC error, hardware reset
CPU processCrash, deadlock, segmentation fault
NetworkTimeout, packet loss, broken connection
FilesystemCorruption, unavailable storage
SchedulerNode preemption or eviction
SoftwareBugs, NaNs, hangs
PowerNode reboot or shutdown

If training uses NN machines, the probability that at least one machine fails increases with NN.

Suppose one machine has a 99.9% chance of surviving a day without failure. With 1,000 machines, the probability that all survive the day is approximately

0.99910000.37. 0.999^{1000} \approx 0.37.

So large clusters experience failures frequently.

Checkpointing

The primary fault-tolerance mechanism is checkpointing.

A checkpoint is a saved snapshot of training state.

At minimum, a checkpoint should include:

StatePurpose
Model parametersRestore learned weights
Optimizer stateRestore momentum and adaptive statistics
Scheduler stateContinue learning rate schedule
Current epoch or stepResume progress
Random number generator statePreserve reproducibility

In PyTorch:

checkpoint = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scheduler": scheduler.state_dict(),
    "step": global_step,
}

torch.save(checkpoint, "checkpoint.pt")

Restoring:

checkpoint = torch.load("checkpoint.pt")

model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
scheduler.load_state_dict(checkpoint["scheduler"])

global_step = checkpoint["step"]

Without optimizer state restoration, adaptive optimizers such as AdamW may behave incorrectly after restart because their momentum statistics are lost.

What Must Be Saved

Large-scale training often requires more than parameters and optimizer state.

Additional state may include:

StateWhy it matters
AMP gradient scalerMixed precision stability
Data sampler positionAvoid repeated or skipped data
RNG stateDeterministic continuation
Pipeline schedule stateResume distributed execution
Token countersCorrect learning rate schedules
Exponential moving averagesStable evaluation
Dataset shard progressStreaming datasets

For exact restart reproducibility:

rng_state = torch.get_rng_state()
cuda_rng_state = torch.cuda.get_rng_state_all()

These states can later be restored:

torch.set_rng_state(rng_state)
torch.cuda.set_rng_state_all(cuda_rng_state)

Full Versus Sharded Checkpoints

In distributed training, checkpoint layout becomes important.

A full checkpoint gathers all model state into one file:

AdvantageDisadvantage
Simple to loadLarge communication cost
Easy portabilityHuge files
Standard formatSlow checkpoint writing

A sharded checkpoint saves only local state from each worker:

AdvantageDisadvantage
Faster writesMore complex loading
Lower memory overheadRequires distributed restore
Better scalabilityHarder portability

For very large models, full checkpoints may become impractical. A trillion-parameter model may require many terabytes of checkpoint storage.

Modern large-scale systems often use distributed checkpointing where each worker saves only its local parameter shard.

Checkpoint Frequency

Checkpoint frequency trades off recovery cost against checkpoint overhead.

Suppose:

  • checkpoint time = CC
  • mean time between failures = FF
  • checkpoint interval = TT

If checkpoints are too infrequent, much work is lost after failure. If checkpoints are too frequent, training spends too much time writing checkpoints.

An approximate classical result from fault-tolerance theory gives an optimal interval near:

T2CF. T \approx \sqrt{2CF}.

In practice, large training systems often checkpoint every:

  • few hundred steps
  • few thousand steps
  • fixed wall-clock interval
  • epoch boundary

The correct interval depends on:

  • checkpoint size
  • storage bandwidth
  • cluster reliability
  • acceptable recomputation cost

Distributed Checkpoint Coordination

In distributed training, checkpointing requires coordination.

A common approach:

  1. synchronize workers
  2. ensure gradients are consistent
  3. save distributed state
  4. mark checkpoint as complete

Barrier synchronization is often used:

dist.barrier()

This prevents some workers from advancing while others are still saving.

Usually rank 0 coordinates checkpoint metadata:

if dist.get_rank() == 0:
    save_metadata()

Other workers may save parameter shards simultaneously.

Atomic Checkpoints

A partially written checkpoint is dangerous. If training crashes during checkpoint writing, corrupted files may remain.

To avoid this, checkpoints should be atomic.

Common strategy:

  1. write to temporary path
  2. verify write success
  3. rename to final checkpoint path

Example:

torch.save(state, "checkpoint.tmp")

os.rename(
    "checkpoint.tmp",
    "checkpoint.pt"
)

The rename operation is usually atomic on local filesystems.

Large distributed systems may additionally write:

  • completion markers
  • manifest files
  • checksums
  • version metadata

Elastic Training

Traditional distributed training assumes a fixed number of workers. If one worker fails, the entire job stops.

Elastic training allows workers to join or leave dynamically.

The system may:

  • restart failed workers
  • continue with fewer workers
  • add replacement workers
  • rebalance distributed state

PyTorch provides elastic launch support through:

torchrun --standalone --nnodes=1:4 train.py

Here the job may tolerate varying node counts within the specified range.

Elastic systems are especially useful on preemptible or spot instances where machines may disappear unexpectedly.

Preemption and Spot Instances

Cloud providers often offer cheaper compute through preemptible instances or spot instances.

These machines may terminate at any time.

Fault-tolerant training allows systems to exploit these cheaper resources.

The workflow becomes:

  1. train normally
  2. periodically checkpoint
  3. detect preemption
  4. restart from latest checkpoint

Some schedulers provide advance termination warnings. The training system may respond by immediately triggering a checkpoint.

Failure Recovery

A typical recovery sequence:

  1. detect failure
  2. terminate remaining workers
  3. relaunch job
  4. restore latest checkpoint
  5. continue training

Recovery correctness requires restoring:

(θ,soptimizer,sscheduler,srng,t) (\theta, s_{\text{optimizer}}, s_{\text{scheduler}}, s_{\text{rng}}, t)

where:

SymbolMeaning
θ\thetaModel parameters
soptimizers_{\text{optimizer}}Optimizer state
sschedulers_{\text{scheduler}}Scheduler state
srngs_{\text{rng}}Random state
ttCurrent step

Missing any of these may subtly alter training.

Deterministic Restart

A deterministic restart means resumed training produces identical results to uninterrupted training.

This is difficult in distributed systems because:

  • communication ordering may vary
  • floating-point operations are non-associative
  • asynchronous execution changes timing
  • kernels may be nondeterministic

PyTorch provides deterministic modes:

torch.use_deterministic_algorithms(True)

and:

torch.backends.cudnn.deterministic = True

However, deterministic execution may reduce performance.

Many production systems prioritize statistical reproducibility rather than bitwise identical execution.

Handling NaNs and Divergence

Training instability is another form of failure.

Common causes:

CauseResult
Excessive learning rateExploding updates
Numerical overflowNaNs or infinities
Bad initializationDivergence
Mixed precision instabilityInvalid gradients

Detection logic often checks:

if not torch.isfinite(loss):
    raise RuntimeError("Non-finite loss detected")

Some systems automatically:

  • reduce learning rate
  • skip updates
  • restore earlier checkpoints
  • adjust loss scaling

Mixed precision training frequently uses dynamic loss scaling to reduce overflow risk.

Watchdogs and Health Monitoring

Large systems often include watchdog processes that monitor worker health.

Watchdogs may track:

  • GPU utilization
  • memory usage
  • communication timeouts
  • throughput
  • heartbeat signals

A heartbeat is a periodic signal indicating the worker is alive.

Example conceptually:

while training:
    send_heartbeat(rank)

If heartbeats stop arriving, the scheduler assumes the worker failed.

Distributed Deadlocks

Distributed programs can hang if workers disagree about communication order.

Example:

if rank == 0:
    dist.broadcast(x)
else:
    dist.all_reduce(y)

Rank 0 waits for a broadcast. Rank 1 waits for an all-reduce. Neither operation completes.

Deadlocks are common failure modes in distributed systems.

Best practices include:

  • identical communication order across workers
  • explicit barriers
  • timeout detection
  • careful conditional logic

Checkpoint Storage Systems

Checkpoint storage systems must support:

RequirementReason
High write throughputLarge checkpoints
DurabilitySurvive node failure
Parallel accessDistributed restore
ScalabilityMany checkpoints
Metadata consistencyCorrect recovery

Common storage choices:

Storage typeTypical use
Local NVMeFast temporary checkpoints
Network filesystemShared cluster storage
Object storageDurable cloud checkpoints
Distributed filesystemsLarge HPC systems

Large training systems often checkpoint to local SSD first, then asynchronously upload to durable object storage.

Incremental Checkpoints

Saving the full model every time can be expensive.

Incremental checkpointing saves only changed state between checkpoints.

Example strategy:

Checkpoint typeFrequency
Full checkpointEvery 10,000 steps
Incremental deltaEvery 500 steps

This reduces storage bandwidth and checkpoint latency.

Incremental checkpointing becomes increasingly important as model size grows.

Fault Tolerance in Large Foundation Models

Large foundation model training may run for weeks across thousands of GPUs.

In such systems:

  • failures are expected
  • restart automation is mandatory
  • checkpoint corruption must be detected
  • storage bandwidth becomes a bottleneck

Modern training infrastructure therefore includes:

ComponentRole
Distributed checkpointingScalable save/restore
Elastic launchersWorker recovery
Health monitoringFailure detection
Automatic retry logicRestart failed jobs
Redundant storagePrevent checkpoint loss

Without fault tolerance, large-scale training would become economically infeasible because failures would waste too much compute.

The Economics of Recovery

Fault tolerance is fundamentally about preserving compute investment.

Suppose a training run uses:

  • 2,000 GPUs
  • 2 weeks
  • $2 per GPU-hour

Total compute cost:

2000×24×14×2=1,344,000. 2000 \times 24 \times 14 \times 2 = 1{,}344{,}000.

A single unrecoverable failure near the end could waste more than one million dollars of computation.

Checkpointing overhead may seem expensive, but it is small compared with restarting massive training runs from scratch.

Design Principles

Reliable distributed training systems generally follow several principles:

PrincipleMeaning
Save frequentlyReduce lost work
Save atomicallyAvoid corruption
Recover automaticallyMinimize human intervention
Detect failure earlyReduce wasted compute
Keep state synchronizedPreserve correctness
Prefer idempotent operationsSafe retries

As training systems scale, infrastructure engineering becomes as important as model architecture. A trillion-parameter model is not useful if the training system cannot survive ordinary operational failures.