Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster. As training runs become larger and longer, the probability of failure approaches certainty.

Fault tolerance is the collection of techniques that allow training to recover from these failures without losing excessive work.

A small model trained for one hour on one GPU may not need sophisticated recovery. A large foundation model trained across hundreds of GPUs for several weeks cannot operate without it.

The central idea is simple: periodically save enough training state so that execution can resume after interruption.

Why Distributed Systems Fail

A distributed training job contains many interacting components:

Component	Possible failure
GPU	Out-of-memory, ECC error, hardware reset
CPU process	Crash, deadlock, segmentation fault
Network	Timeout, packet loss, broken connection
Filesystem	Corruption, unavailable storage
Scheduler	Node preemption or eviction
Software	Bugs, NaNs, hangs
Power	Node reboot or shutdown

If training uses $N$ machines, the probability that at least one machine fails increases with $N$ .

Suppose one machine has a 99.9% chance of surviving a day without failure. With 1,000 machines, the probability that all survive the day is approximately

0.999^{1000} \approx 0.37.

So large clusters experience failures frequently.

Checkpointing

The primary fault-tolerance mechanism is checkpointing.

A checkpoint is a saved snapshot of training state.

At minimum, a checkpoint should include:

State	Purpose
Model parameters	Restore learned weights
Optimizer state	Restore momentum and adaptive statistics
Scheduler state	Continue learning rate schedule
Current epoch or step	Resume progress
Random number generator state	Preserve reproducibility

In PyTorch:

checkpoint = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scheduler": scheduler.state_dict(),
    "step": global_step,
}

torch.save(checkpoint, "checkpoint.pt")

Restoring:

checkpoint = torch.load("checkpoint.pt")

model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
scheduler.load_state_dict(checkpoint["scheduler"])

global_step = checkpoint["step"]

Without optimizer state restoration, adaptive optimizers such as AdamW may behave incorrectly after restart because their momentum statistics are lost.

What Must Be Saved

Large-scale training often requires more than parameters and optimizer state.

Additional state may include:

State	Why it matters
AMP gradient scaler	Mixed precision stability
Data sampler position	Avoid repeated or skipped data
RNG state	Deterministic continuation
Pipeline schedule state	Resume distributed execution
Token counters	Correct learning rate schedules
Exponential moving averages	Stable evaluation
Dataset shard progress	Streaming datasets

For exact restart reproducibility:

rng_state = torch.get_rng_state()
cuda_rng_state = torch.cuda.get_rng_state_all()

These states can later be restored:

torch.set_rng_state(rng_state)
torch.cuda.set_rng_state_all(cuda_rng_state)

Full Versus Sharded Checkpoints

In distributed training, checkpoint layout becomes important.

A full checkpoint gathers all model state into one file:

Advantage	Disadvantage
Simple to load	Large communication cost
Easy portability	Huge files
Standard format	Slow checkpoint writing

A sharded checkpoint saves only local state from each worker:

Advantage	Disadvantage
Faster writes	More complex loading
Lower memory overhead	Requires distributed restore
Better scalability	Harder portability

For very large models, full checkpoints may become impractical. A trillion-parameter model may require many terabytes of checkpoint storage.

Modern large-scale systems often use distributed checkpointing where each worker saves only its local parameter shard.

Checkpoint Frequency

Checkpoint frequency trades off recovery cost against checkpoint overhead.

Suppose:

checkpoint time = $C$
mean time between failures = $F$
checkpoint interval = $T$

If checkpoints are too infrequent, much work is lost after failure. If checkpoints are too frequent, training spends too much time writing checkpoints.

An approximate classical result from fault-tolerance theory gives an optimal interval near:

T \approx \sqrt{2CF}.

In practice, large training systems often checkpoint every:

few hundred steps
few thousand steps
fixed wall-clock interval
epoch boundary

The correct interval depends on:

checkpoint size
storage bandwidth
cluster reliability
acceptable recomputation cost

Distributed Checkpoint Coordination

In distributed training, checkpointing requires coordination.

A common approach:

synchronize workers
ensure gradients are consistent
save distributed state
mark checkpoint as complete

Barrier synchronization is often used:

dist.barrier()

This prevents some workers from advancing while others are still saving.

Usually rank 0 coordinates checkpoint metadata:

if dist.get_rank() == 0:
    save_metadata()

Other workers may save parameter shards simultaneously.

Atomic Checkpoints

A partially written checkpoint is dangerous. If training crashes during checkpoint writing, corrupted files may remain.

To avoid this, checkpoints should be atomic.

Common strategy:

write to temporary path
verify write success
rename to final checkpoint path

Example:

torch.save(state, "checkpoint.tmp")

os.rename(
    "checkpoint.tmp",
    "checkpoint.pt"
)

The rename operation is usually atomic on local filesystems.

Large distributed systems may additionally write:

completion markers
manifest files
checksums
version metadata

Elastic Training

Traditional distributed training assumes a fixed number of workers. If one worker fails, the entire job stops.

Elastic training allows workers to join or leave dynamically.

The system may:

restart failed workers
continue with fewer workers
add replacement workers
rebalance distributed state

PyTorch provides elastic launch support through:

torchrun --standalone --nnodes=1:4 train.py

Here the job may tolerate varying node counts within the specified range.

Elastic systems are especially useful on preemptible or spot instances where machines may disappear unexpectedly.

Preemption and Spot Instances

Cloud providers often offer cheaper compute through preemptible instances or spot instances.

These machines may terminate at any time.

Fault-tolerant training allows systems to exploit these cheaper resources.

The workflow becomes:

train normally
periodically checkpoint
detect preemption
restart from latest checkpoint

Some schedulers provide advance termination warnings. The training system may respond by immediately triggering a checkpoint.

Failure Recovery

A typical recovery sequence:

detect failure
terminate remaining workers
relaunch job
restore latest checkpoint
continue training

Recovery correctness requires restoring:

(\theta, s_{\text{optimizer}}, s_{\text{scheduler}}, s_{\text{rng}}, t)

where:

Symbol	Meaning
$\theta$	Model parameters
$s_{\text{optimizer}}$	Optimizer state
$s_{\text{scheduler}}$	Scheduler state
$s_{\text{rng}}$	Random state
$t$	Current step

Missing any of these may subtly alter training.

Deterministic Restart

A deterministic restart means resumed training produces identical results to uninterrupted training.

This is difficult in distributed systems because:

communication ordering may vary
floating-point operations are non-associative
asynchronous execution changes timing
kernels may be nondeterministic

PyTorch provides deterministic modes:

torch.use_deterministic_algorithms(True)

and:

torch.backends.cudnn.deterministic = True

However, deterministic execution may reduce performance.

Many production systems prioritize statistical reproducibility rather than bitwise identical execution.

Handling NaNs and Divergence

Training instability is another form of failure.

Common causes:

Cause	Result
Excessive learning rate	Exploding updates
Numerical overflow	NaNs or infinities
Bad initialization	Divergence
Mixed precision instability	Invalid gradients

Detection logic often checks:

if not torch.isfinite(loss):
    raise RuntimeError("Non-finite loss detected")

Some systems automatically:

reduce learning rate
skip updates
restore earlier checkpoints
adjust loss scaling

Mixed precision training frequently uses dynamic loss scaling to reduce overflow risk.

Watchdogs and Health Monitoring

Large systems often include watchdog processes that monitor worker health.

Watchdogs may track:

GPU utilization
memory usage
communication timeouts
throughput
heartbeat signals

A heartbeat is a periodic signal indicating the worker is alive.

Example conceptually:

while training:
    send_heartbeat(rank)

If heartbeats stop arriving, the scheduler assumes the worker failed.

Distributed Deadlocks

Distributed programs can hang if workers disagree about communication order.

Example:

if rank == 0:
    dist.broadcast(x)
else:
    dist.all_reduce(y)

Rank 0 waits for a broadcast. Rank 1 waits for an all-reduce. Neither operation completes.

Deadlocks are common failure modes in distributed systems.

Best practices include:

identical communication order across workers
explicit barriers
timeout detection
careful conditional logic

Checkpoint Storage Systems

Checkpoint storage systems must support:

Requirement	Reason
High write throughput	Large checkpoints
Durability	Survive node failure
Parallel access	Distributed restore
Scalability	Many checkpoints
Metadata consistency	Correct recovery

Common storage choices:

Storage type	Typical use
Local NVMe	Fast temporary checkpoints
Network filesystem	Shared cluster storage
Object storage	Durable cloud checkpoints
Distributed filesystems	Large HPC systems

Large training systems often checkpoint to local SSD first, then asynchronously upload to durable object storage.

Incremental Checkpoints

Saving the full model every time can be expensive.

Incremental checkpointing saves only changed state between checkpoints.

Example strategy:

Checkpoint type	Frequency
Full checkpoint	Every 10,000 steps
Incremental delta	Every 500 steps

This reduces storage bandwidth and checkpoint latency.

Incremental checkpointing becomes increasingly important as model size grows.

Fault Tolerance in Large Foundation Models

Large foundation model training may run for weeks across thousands of GPUs.

In such systems:

failures are expected
restart automation is mandatory
checkpoint corruption must be detected
storage bandwidth becomes a bottleneck

Modern training infrastructure therefore includes:

Component	Role
Distributed checkpointing	Scalable save/restore
Elastic launchers	Worker recovery
Health monitoring	Failure detection
Automatic retry logic	Restart failed jobs
Redundant storage	Prevent checkpoint loss

Without fault tolerance, large-scale training would become economically infeasible because failures would waste too much compute.

The Economics of Recovery

Fault tolerance is fundamentally about preserving compute investment.

Suppose a training run uses:

2,000 GPUs
2 weeks
$2 per GPU-hour

Total compute cost:

2000 \times 24 \times 14 \times 2 = 1{,}344{,}000.

A single unrecoverable failure near the end could waste more than one million dollars of computation.

Checkpointing overhead may seem expensive, but it is small compared with restarting massive training runs from scratch.

Design Principles

Reliable distributed training systems generally follow several principles:

Principle	Meaning
Save frequently	Reduce lost work
Save atomically	Avoid corruption
Recover automatically	Minimize human intervention
Detect failure early	Reduce wasted compute
Keep state synchronized	Preserve correctness
Prefer idempotent operations	Safe retries

As training systems scale, infrastructure engineering becomes as important as model architecture. A trillion-parameter model is not useful if the training system cannot survive ordinary operational failures.