Skip to content

Training Foundation Models

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks. Examples include large language models, multimodal transformers, vision foundation models, audio-language systems, and general-purpose embedding models.

Training these systems requires coordinated advances in:

  • optimization
  • distributed systems
  • data engineering
  • numerical stability
  • infrastructure reliability
  • hardware utilization

Foundation model training differs from ordinary deep learning mainly in scale. The underlying mathematical principles remain similar, but the operational constraints become much more severe.

A small model may train on one GPU for hours. A foundation model may require thousands of accelerators running continuously for weeks or months.

What Defines a Foundation Model

A foundation model typically has several properties:

PropertyDescription
Large parameter countMillions to trillions of parameters
Broad pretraining dataDiverse internet-scale datasets
General-purpose representationsUseful across many tasks
TransferabilityFine-tuned or prompted for downstream use
Emergent capabilitiesBehaviors not explicitly supervised

Most modern foundation models are transformer-based because transformers scale efficiently with data and compute.

Common foundation model categories include:

TypeExample tasks
Language modelsText generation, reasoning, coding
Vision modelsClassification, segmentation, detection
Multimodal modelsVision-language understanding
Audio modelsSpeech recognition, synthesis
Embedding modelsRetrieval and semantic search

Scaling Laws

One of the central discoveries in modern deep learning is that model performance often follows predictable scaling behavior.

Performance depends on:

  • parameter count
  • training data size
  • compute budget

Empirical scaling laws often resemble power-law relationships:

L(N)=ANα+C, L(N) = A N^{-\alpha} + C,

where:

SymbolMeaning
L(N)L(N)Loss
NNScale variable
A,CA, CConstants
α\alphaScaling exponent

Increasing model size, data, or compute generally improves performance, though with diminishing returns.

Scaling laws influenced modern foundation model design because they suggested that larger models trained on more data would continue improving predictably.

This shifted research from hand-designed architectures toward large-scale optimization and infrastructure.

Compute-Optimal Training

Training budgets are finite. A key question becomes how to allocate compute between:

  • larger models
  • more training tokens
  • longer training duration

Suppose:

VariableMeaning
PPParameter count
TTTraining tokens
CCTotal compute

Approximate transformer training cost scales as:

CPT. C \propto P T.

If the model is too large for the available data, parameters are undertrained. If the dataset is too large for the model, capacity may be insufficient.

Modern training recipes attempt to balance model size and data volume to maximize performance for a fixed compute budget.

This idea is often called compute-optimal training.

Token-Based Training

Large language models are usually trained in terms of tokens rather than epochs.

A token is a subword unit produced by tokenization.

Example:

"foundation models are powerful"

might tokenize into:

["foundation", " models", " are", " powerful"]

Training progress is often measured as:

tokens processed. \text{tokens processed}.

For example:

ModelApproximate training tokens
Small language modelBillions
Mid-scale LLMHundreds of billions
Frontier-scale LLMTrillions

Unlike classical datasets, internet-scale corpora may not have clean epoch boundaries. Data pipelines therefore stream tokens continuously.

Data Pipelines

Foundation models require enormous datasets.

Data engineering becomes a major component of the system.

Typical stages include:

StagePurpose
CrawlingCollect raw data
DeduplicationRemove repeated content
FilteringRemove low-quality data
Language identificationSeparate languages
Safety filteringRemove harmful content
TokenizationConvert text to token IDs
ShardingSplit data across workers

Data quality strongly affects model quality.

A smaller high-quality dataset may outperform a much larger noisy dataset.

Streaming Datasets

Large datasets are rarely stored as one monolithic file.

Instead, they are sharded into many files:

shard_00000.bin
shard_00001.bin
shard_00002.bin
...

Workers stream shards in parallel.

Advantages include:

BenefitReason
Parallel readingMultiple workers load simultaneously
Fault toleranceCorruption affects only one shard
Distributed accessNodes read different shards
Incremental updatesNew shards can be added

Streaming avoids loading the entire dataset into memory.

Transformer Training

Most foundation models are transformers.

A simplified decoder-only transformer computes:

xEmbeddingTransformer BlocksOutput Projection. x \rightarrow \text{Embedding} \rightarrow \text{Transformer Blocks} \rightarrow \text{Output Projection}.

Each transformer block contains:

  • self-attention
  • feedforward networks
  • residual connections
  • normalization layers

Training is autoregressive.

Given tokens:

[t1,t2,,tn], [t_1, t_2, \ldots, t_n],

the model predicts:

ti+1 t_{i+1}

from earlier tokens.

The objective is usually next-token prediction:

L=i=1nlogpθ(tit<i). L = - \sum_{i=1}^{n} \log p_\theta(t_i \mid t_{<i}).

Mixed Precision Training

Foundation model training almost always uses mixed precision.

Instead of float32 everywhere, systems use:

FormatCommon use
fp16Earlier mixed precision systems
bf16Modern large-scale training
fp32Master weights or sensitive operations

Mixed precision reduces:

  • memory usage
  • communication volume
  • training time

Bfloat16 became especially important because it preserves the exponent range of float32, improving numerical stability.

A typical configuration:

Tensor typePrecision
Activationsbf16
Gradientsbf16
Matrix multiplicationsbf16
Optimizer statefp32
Master weightsfp32

Parallelism Strategies

Foundation models are too large for simple data parallelism alone.

Modern systems combine multiple parallelism dimensions.

ParallelismPurpose
Data parallelismScale training throughput
Tensor parallelismSplit large matrix operations
Pipeline parallelismSplit sequential layers
Sharded optimizersReduce replicated optimizer state
Expert parallelismRoute sparse experts across devices

Large training systems often organize GPUs into groups.

Example:

Parallelism dimensionSize
Data parallel16
Tensor parallel8
Pipeline parallel4

Total GPUs:

16×8×4=512. 16 \times 8 \times 4 = 512.

Each GPU participates in several communication groups simultaneously.

Memory Optimization

Memory becomes a dominant constraint.

Main memory consumers include:

ComponentScaling behavior
ParametersProportional to model size
Optimizer stateOften 2 to 8 times parameter size
GradientsSimilar to parameter size
ActivationsDepend on batch and sequence length

Techniques used to reduce memory include:

TechniquePurpose
Activation checkpointingTrade compute for memory
Gradient accumulationSimulate large batches
ZeRO/FSDPShard optimizer state and parameters
QuantizationLower precision storage
OffloadingMove state to CPU or NVMe

Without these methods, large models may not fit even across many GPUs.

Throughput Optimization

Training cost is dominated by accelerator time.

Suppose a training run uses:

  • 2,000 GPUs
  • $2 per GPU-hour
  • 30 days

Cost:

2000×24×30×2=2,880,000. 2000 \times 24 \times 30 \times 2 = 2{,}880{,}000.

Even small inefficiencies become expensive.

Important throughput metrics include:

MetricMeaning
Tokens per secondLanguage-model throughput
FLOPs utilizationFraction of peak compute used
GPU utilizationAccelerator activity
Communication overheadTime spent synchronizing
Data pipeline latencyWaiting for input data

High-performance training systems carefully overlap:

  • communication
  • data loading
  • computation
  • checkpointing

Learning Rate Schedules

Foundation models often use carefully tuned learning rate schedules.

A common pattern:

  1. warmup
  2. plateau or cosine decay
  3. gradual reduction

Warmup stabilizes early optimization.

Example cosine schedule:

ηt=ηmin+12(ηmaxηmin)(1+cos(πtT)). \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \left( 1 + \cos\left(\frac{\pi t}{T}\right) \right).

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\eta_t=\eta_{\min}+\frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\left(\frac{\pi t}{T}\right)\right)"}}

Warmup is especially important for large batch training because early gradients can be unstable.

Gradient Stability

Large models are sensitive to numerical instability.

Common problems include:

ProblemSymptom
Exploding gradientsLoss divergence
Vanishing gradientsSlow learning
OverflowNaNs
UnderflowZero gradients
Activation spikesInstability

Stabilization techniques include:

TechniquePurpose
Gradient clippingLimit update magnitude
Normalization layersStabilize activations
Residual connectionsImprove gradient flow
Careful initializationPrevent early divergence
Adaptive optimizersStabilize updates

Gradient clipping often uses:

ggmin(1,τg), g \leftarrow g \cdot \min\left( 1, \frac{\tau}{\|g\|} \right),

where τ\tau is the clipping threshold.

Evaluation During Training

Foundation model evaluation is expensive.

Evaluations may include:

Evaluation typeExample
Validation perplexityLanguage modeling
Benchmark suitesReasoning and QA
Human preference evaluationAlignment
Safety testingHarmful outputs
Retrieval qualityEmbedding models

Frequent evaluation slows training, but infrequent evaluation risks wasting compute on bad runs.

Many systems therefore run lightweight validation frequently and expensive benchmark suites less often.

Alignment and Post-Training

Pretraining produces a general-purpose model, but not necessarily a helpful or safe assistant.

Modern systems often add:

StagePurpose
Supervised fine-tuningTeach instruction following
Preference optimizationAlign outputs with preferences
RLHFReinforcement learning from human feedback
Constitutional methodsRule-guided alignment
Safety tuningReduce harmful behavior

The final model is therefore the result of several training stages, not just one pretraining run.

Infrastructure Reliability

Foundation model training depends heavily on infrastructure engineering.

Key requirements include:

RequirementReason
Fault toleranceFailures are inevitable
Distributed checkpointingLarge model state
Monitoring systemsDetect hangs and instability
Cluster schedulingCoordinate resources
High-bandwidth networkingSynchronization efficiency
Storage throughputMassive datasets and checkpoints

At large scale, infrastructure limitations often dominate algorithmic limitations.

Environmental and Economic Cost

Foundation model training consumes substantial energy and compute resources.

Costs include:

  • accelerator manufacturing
  • electricity
  • cooling
  • datacenter infrastructure
  • engineering labor

Efficiency improvements therefore matter economically and environmentally.

Important efficiency directions include:

DirectionGoal
Better optimizersFewer training steps
Sparse modelsLower compute
QuantizationLower memory and energy
Smaller high-quality datasetsBetter data efficiency
Efficient architecturesHigher throughput

Emergent Behavior

As models scale, new capabilities sometimes appear unexpectedly.

Examples may include:

  • in-context learning
  • chain-of-thought reasoning
  • tool use
  • multilingual transfer
  • coding ability

These behaviors are called emergent because they were not explicitly programmed.

However, emergence is often gradual rather than sudden when measured carefully.

Understanding why scaling produces these capabilities remains an active research area.

The Central Constraint

Foundation model training is fundamentally constrained by:

compute×data×memory×communication. \text{compute} \times \text{data} \times \text{memory} \times \text{communication}.

Every design decision affects one or more of these factors.

For example:

DecisionTradeoff
Larger modelBetter capacity, higher memory
Longer contextBetter reasoning, more compute
More GPUsMore throughput, more communication
Larger batchesBetter hardware utilization, harder optimization

Training systems therefore balance mathematical efficiency with systems efficiency.

From Research to Infrastructure

Early deep learning research focused primarily on architecture design. Foundation model training shifted much of the challenge toward systems engineering.

Modern training requires expertise in:

  • optimization theory
  • distributed systems
  • networking
  • compiler systems
  • numerical methods
  • storage infrastructure
  • data engineering

As models scale, the boundary between machine learning research and large-scale systems engineering becomes increasingly blurred.

A modern foundation model is simultaneously:

  • a statistical learning system
  • a distributed computation graph
  • a large-scale numerical optimization problem
  • a data processing pipeline
  • a fault-tolerant infrastructure system