# Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster. As training runs become larger and longer, the probability of failure approaches certainty.

Fault tolerance is the collection of techniques that allow training to recover from these failures without losing excessive work.

A small model trained for one hour on one GPU may not need sophisticated recovery. A large foundation model trained across hundreds of GPUs for several weeks cannot operate without it.

The central idea is simple: periodically save enough training state so that execution can resume after interruption.

### Why Distributed Systems Fail

A distributed training job contains many interacting components:

| Component | Possible failure |
|---|---|
| GPU | Out-of-memory, ECC error, hardware reset |
| CPU process | Crash, deadlock, segmentation fault |
| Network | Timeout, packet loss, broken connection |
| Filesystem | Corruption, unavailable storage |
| Scheduler | Node preemption or eviction |
| Software | Bugs, NaNs, hangs |
| Power | Node reboot or shutdown |

If training uses $N$ machines, the probability that at least one machine fails increases with $N$.

Suppose one machine has a 99.9% chance of surviving a day without failure. With 1,000 machines, the probability that all survive the day is approximately

$$
0.999^{1000} \approx 0.37.
$$

So large clusters experience failures frequently.

### Checkpointing

The primary fault-tolerance mechanism is checkpointing.

A checkpoint is a saved snapshot of training state.

At minimum, a checkpoint should include:

| State | Purpose |
|---|---|
| Model parameters | Restore learned weights |
| Optimizer state | Restore momentum and adaptive statistics |
| Scheduler state | Continue learning rate schedule |
| Current epoch or step | Resume progress |
| Random number generator state | Preserve reproducibility |

In PyTorch:

```python id="q4x5mu"
checkpoint = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scheduler": scheduler.state_dict(),
    "step": global_step,
}

torch.save(checkpoint, "checkpoint.pt")
```

Restoring:

```python id="vkv4ep"
checkpoint = torch.load("checkpoint.pt")

model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
scheduler.load_state_dict(checkpoint["scheduler"])

global_step = checkpoint["step"]
```

Without optimizer state restoration, adaptive optimizers such as AdamW may behave incorrectly after restart because their momentum statistics are lost.

### What Must Be Saved

Large-scale training often requires more than parameters and optimizer state.

Additional state may include:

| State | Why it matters |
|---|---|
| AMP gradient scaler | Mixed precision stability |
| Data sampler position | Avoid repeated or skipped data |
| RNG state | Deterministic continuation |
| Pipeline schedule state | Resume distributed execution |
| Token counters | Correct learning rate schedules |
| Exponential moving averages | Stable evaluation |
| Dataset shard progress | Streaming datasets |

For exact restart reproducibility:

```python id="2v7bg6"
rng_state = torch.get_rng_state()
cuda_rng_state = torch.cuda.get_rng_state_all()
```

These states can later be restored:

```python id="f2v2mq"
torch.set_rng_state(rng_state)
torch.cuda.set_rng_state_all(cuda_rng_state)
```

### Full Versus Sharded Checkpoints

In distributed training, checkpoint layout becomes important.

A full checkpoint gathers all model state into one file:

| Advantage | Disadvantage |
|---|---|
| Simple to load | Large communication cost |
| Easy portability | Huge files |
| Standard format | Slow checkpoint writing |

A sharded checkpoint saves only local state from each worker:

| Advantage | Disadvantage |
|---|---|
| Faster writes | More complex loading |
| Lower memory overhead | Requires distributed restore |
| Better scalability | Harder portability |

For very large models, full checkpoints may become impractical. A trillion-parameter model may require many terabytes of checkpoint storage.

Modern large-scale systems often use distributed checkpointing where each worker saves only its local parameter shard.

### Checkpoint Frequency

Checkpoint frequency trades off recovery cost against checkpoint overhead.

Suppose:

- checkpoint time = $C$
- mean time between failures = $F$
- checkpoint interval = $T$

If checkpoints are too infrequent, much work is lost after failure. If checkpoints are too frequent, training spends too much time writing checkpoints.

An approximate classical result from fault-tolerance theory gives an optimal interval near:

$$
T \approx \sqrt{2CF}.
$$

In practice, large training systems often checkpoint every:

- few hundred steps
- few thousand steps
- fixed wall-clock interval
- epoch boundary

The correct interval depends on:

- checkpoint size
- storage bandwidth
- cluster reliability
- acceptable recomputation cost

### Distributed Checkpoint Coordination

In distributed training, checkpointing requires coordination.

A common approach:

1. synchronize workers
2. ensure gradients are consistent
3. save distributed state
4. mark checkpoint as complete

Barrier synchronization is often used:

```python id="0h75cx"
dist.barrier()
```

This prevents some workers from advancing while others are still saving.

Usually rank 0 coordinates checkpoint metadata:

```python id="c3qkrl"
if dist.get_rank() == 0:
    save_metadata()
```

Other workers may save parameter shards simultaneously.

### Atomic Checkpoints

A partially written checkpoint is dangerous. If training crashes during checkpoint writing, corrupted files may remain.

To avoid this, checkpoints should be atomic.

Common strategy:

1. write to temporary path
2. verify write success
3. rename to final checkpoint path

Example:

```python id="7y3fny"
torch.save(state, "checkpoint.tmp")

os.rename(
    "checkpoint.tmp",
    "checkpoint.pt"
)
```

The rename operation is usually atomic on local filesystems.

Large distributed systems may additionally write:

- completion markers
- manifest files
- checksums
- version metadata

### Elastic Training

Traditional distributed training assumes a fixed number of workers. If one worker fails, the entire job stops.

Elastic training allows workers to join or leave dynamically.

The system may:

- restart failed workers
- continue with fewer workers
- add replacement workers
- rebalance distributed state

PyTorch provides elastic launch support through:

```bash id="7zmnrm"
torchrun --standalone --nnodes=1:4 train.py
```

Here the job may tolerate varying node counts within the specified range.

Elastic systems are especially useful on preemptible or spot instances where machines may disappear unexpectedly.

### Preemption and Spot Instances

Cloud providers often offer cheaper compute through preemptible instances or spot instances.

These machines may terminate at any time.

Fault-tolerant training allows systems to exploit these cheaper resources.

The workflow becomes:

1. train normally
2. periodically checkpoint
3. detect preemption
4. restart from latest checkpoint

Some schedulers provide advance termination warnings. The training system may respond by immediately triggering a checkpoint.

### Failure Recovery

A typical recovery sequence:

1. detect failure
2. terminate remaining workers
3. relaunch job
4. restore latest checkpoint
5. continue training

Recovery correctness requires restoring:

$$
(\theta, s_{\text{optimizer}}, s_{\text{scheduler}}, s_{\text{rng}}, t)
$$

where:

| Symbol | Meaning |
|---|---|
| $\theta$ | Model parameters |
| $s_{\text{optimizer}}$ | Optimizer state |
| $s_{\text{scheduler}}$ | Scheduler state |
| $s_{\text{rng}}$ | Random state |
| $t$ | Current step |

Missing any of these may subtly alter training.

### Deterministic Restart

A deterministic restart means resumed training produces identical results to uninterrupted training.

This is difficult in distributed systems because:

- communication ordering may vary
- floating-point operations are non-associative
- asynchronous execution changes timing
- kernels may be nondeterministic

PyTorch provides deterministic modes:

```python id="3ak0px"
torch.use_deterministic_algorithms(True)
```

and:

```python id="wfxjlwm"
torch.backends.cudnn.deterministic = True
```

However, deterministic execution may reduce performance.

Many production systems prioritize statistical reproducibility rather than bitwise identical execution.

### Handling NaNs and Divergence

Training instability is another form of failure.

Common causes:

| Cause | Result |
|---|---|
| Excessive learning rate | Exploding updates |
| Numerical overflow | NaNs or infinities |
| Bad initialization | Divergence |
| Mixed precision instability | Invalid gradients |

Detection logic often checks:

```python id="wqcc98"
if not torch.isfinite(loss):
    raise RuntimeError("Non-finite loss detected")
```

Some systems automatically:

- reduce learning rate
- skip updates
- restore earlier checkpoints
- adjust loss scaling

Mixed precision training frequently uses dynamic loss scaling to reduce overflow risk.

### Watchdogs and Health Monitoring

Large systems often include watchdog processes that monitor worker health.

Watchdogs may track:

- GPU utilization
- memory usage
- communication timeouts
- throughput
- heartbeat signals

A heartbeat is a periodic signal indicating the worker is alive.

Example conceptually:

```python id="r5z9ff"
while training:
    send_heartbeat(rank)
```

If heartbeats stop arriving, the scheduler assumes the worker failed.

### Distributed Deadlocks

Distributed programs can hang if workers disagree about communication order.

Example:

```python id="g2n8vm"
if rank == 0:
    dist.broadcast(x)
else:
    dist.all_reduce(y)
```

Rank 0 waits for a broadcast. Rank 1 waits for an all-reduce. Neither operation completes.

Deadlocks are common failure modes in distributed systems.

Best practices include:

- identical communication order across workers
- explicit barriers
- timeout detection
- careful conditional logic

### Checkpoint Storage Systems

Checkpoint storage systems must support:

| Requirement | Reason |
|---|---|
| High write throughput | Large checkpoints |
| Durability | Survive node failure |
| Parallel access | Distributed restore |
| Scalability | Many checkpoints |
| Metadata consistency | Correct recovery |

Common storage choices:

| Storage type | Typical use |
|---|---|
| Local NVMe | Fast temporary checkpoints |
| Network filesystem | Shared cluster storage |
| Object storage | Durable cloud checkpoints |
| Distributed filesystems | Large HPC systems |

Large training systems often checkpoint to local SSD first, then asynchronously upload to durable object storage.

### Incremental Checkpoints

Saving the full model every time can be expensive.

Incremental checkpointing saves only changed state between checkpoints.

Example strategy:

| Checkpoint type | Frequency |
|---|---|
| Full checkpoint | Every 10,000 steps |
| Incremental delta | Every 500 steps |

This reduces storage bandwidth and checkpoint latency.

Incremental checkpointing becomes increasingly important as model size grows.

### Fault Tolerance in Large Foundation Models

Large foundation model training may run for weeks across thousands of GPUs.

In such systems:

- failures are expected
- restart automation is mandatory
- checkpoint corruption must be detected
- storage bandwidth becomes a bottleneck

Modern training infrastructure therefore includes:

| Component | Role |
|---|---|
| Distributed checkpointing | Scalable save/restore |
| Elastic launchers | Worker recovery |
| Health monitoring | Failure detection |
| Automatic retry logic | Restart failed jobs |
| Redundant storage | Prevent checkpoint loss |

Without fault tolerance, large-scale training would become economically infeasible because failures would waste too much compute.

### The Economics of Recovery

Fault tolerance is fundamentally about preserving compute investment.

Suppose a training run uses:

- 2,000 GPUs
- 2 weeks
- \$2 per GPU-hour

Total compute cost:

$$
2000 \times 24 \times 14 \times 2 =
1{,}344{,}000.
$$

A single unrecoverable failure near the end could waste more than one million dollars of computation.

Checkpointing overhead may seem expensive, but it is small compared with restarting massive training runs from scratch.

### Design Principles

Reliable distributed training systems generally follow several principles:

| Principle | Meaning |
|---|---|
| Save frequently | Reduce lost work |
| Save atomically | Avoid corruption |
| Recover automatically | Minimize human intervention |
| Detect failure early | Reduce wasted compute |
| Keep state synchronized | Preserve correctness |
| Prefer idempotent operations | Safe retries |

As training systems scale, infrastructure engineering becomes as important as model architecture. A trillion-parameter model is not useful if the training system cannot survive ordinary operational failures.

