Skip to content

Scaling Laws

Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.

Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.

A scaling law describes how model performance changes as some resource increases. The resource may be the number of parameters, the number of training tokens, the amount of compute, the dataset size, or the inference-time budget. The performance measure may be loss, accuracy, error rate, reward, perplexity, or another task-specific metric.

In its simplest form, a scaling law says:

lossaXb+c \text{loss} \approx aX^{-b} + c

Here XX is a resource such as compute or dataset size. The constants aa, bb, and cc are estimated from experiments. The exponent bb controls how quickly performance improves. The term cc represents an irreducible floor under the chosen setup.

The important idea is simple: progress is often predictable over a wide range of scales.

Why Scaling Laws Matter

Scaling laws matter because large models are expensive. Training a frontier model may require large datasets, specialized hardware, distributed systems, long training runs, and careful engineering. Without scaling laws, model development becomes guesswork.

Scaling laws help answer practical questions:

QuestionScaling-law use
Should we train a larger model?Estimate whether more parameters will reduce loss
Should we collect more data?Estimate whether the model is data-limited
Should we train longer?Estimate whether more compute is useful
How large should the model be for a fixed budget?Allocate compute between parameters and tokens
When will returns diminish?Estimate the slope of improvement
Which architecture is better?Compare loss curves at equal compute

A small experiment can estimate the behavior of a larger run. This does not remove uncertainty, but it turns model scaling into an engineering discipline.

The Three Main Scaling Axes

Deep learning systems usually scale along three axes.

First, we can increase the number of model parameters. A larger model has more capacity. It can represent more complex functions and store more statistical structure from the data.

Second, we can increase the number of training examples or tokens. More data reduces overfitting and exposes the model to more patterns.

Third, we can increase compute. Compute combines model size, dataset size, sequence length, batch size, and training steps. For transformer language models, a rough approximation is that training compute grows with the number of parameters times the number of training tokens.

CND C \propto ND

Here CC is compute, NN is the number of parameters, and DD is the number of training tokens.

This approximation hides many details, but it captures the basic tradeoff. For a fixed compute budget, we can train a larger model on fewer tokens, or a smaller model on more tokens.

Parameter Scaling

Parameter scaling studies how performance changes as the number of trainable parameters increases.

Suppose we train several models with the same dataset and training procedure:

N1<N2<N3<N4. N_1 < N_2 < N_3 < N_4.

If the training setup is stable, the validation loss often decreases as NN increases:

L(N)aNb+c. L(N) \approx aN^{-b} + c.

A larger model can fit richer functions. In language modeling, this often means better syntax, broader factual recall, stronger pattern completion, and more robust in-context learning.

However, parameter scaling alone is insufficient. A very large model trained on too little data can underuse its capacity. It may memorize training data, overfit, or fail to reach the loss predicted by ideal scaling. Large models need enough data and enough optimization steps.

Data Scaling

Data scaling studies how performance changes as the dataset grows.

For language models, data is often measured in tokens. For vision models, it may be measured in images. For speech models, it may be measured in audio hours. For reinforcement learning, it may be measured in environment steps or trajectories.

A common empirical pattern is:

L(D)aDb+c. L(D) \approx aD^{-b} + c.

More data helps because it reduces sampling error. It also increases coverage. A model trained on a larger and more diverse dataset sees more concepts, styles, domains, and edge cases.

But data quality matters. One billion low-quality examples may be less useful than one hundred million carefully filtered examples. Duplicates, corrupted samples, boilerplate, spam, leakage, and low-information data can weaken scaling. In modern systems, data curation is often as important as raw data volume.

Compute Scaling

Compute scaling asks how loss changes as total training compute increases.

A compute scaling law has the form:

L(C)aCb+c. L(C) \approx aC^{-b} + c.

This is useful because compute is usually the actual constraint. A research group may have a fixed number of GPUs for a fixed number of days. The key question is how to spend that budget.

For a fixed compute budget, the optimal model size and dataset size must be balanced. A model that is too large may see too little data. A model that is too small may lack capacity, even if it sees many tokens. Compute-optimal training chooses NN and DD so that neither the model nor the data is the dominant bottleneck.

This gives the central scaling tradeoff:

compute budgetchoose model size and data size together. \text{compute budget} \quad\Rightarrow\quad \text{choose model size and data size together}.

Compute-Optimal Training

Compute-optimal training means choosing model size and training data size to minimize loss under a fixed compute budget.

Suppose we have budget CC. Since training compute roughly scales like

CND, C \propto ND,

we cannot increase both NN and DD arbitrarily. Increasing model size forces us to reduce the number of training tokens, unless compute also increases.

Earlier large language models often used very large parameter counts with relatively fewer training tokens. Later scaling analyses showed that, for many budgets, smaller models trained on more data can be more compute-efficient.

This changed the practical recipe. Instead of only making models larger, modern training often emphasizes training a moderately large model on many more high-quality tokens.

The lesson is not that smaller models are always better. The lesson is that model size must be matched to data size.

Loss, Perplexity, and Downstream Ability

Scaling laws usually measure pretraining loss. In language modeling, this is often cross-entropy loss or perplexity.

Perplexity is related to cross-entropy:

perplexity=exp(L) \text{perplexity} = \exp(L)

where LL is the average negative log-likelihood in nats.

Lower pretraining loss generally correlates with better downstream performance, but the relationship is imperfect. Some abilities improve smoothly. Others appear more suddenly when the model reaches a certain scale, data quality, or training regime.

Examples of downstream abilities include:

AbilityRelation to scaling
Text predictionUsually improves smoothly with loss
TranslationImproves with multilingual data and model capacity
Code generationStrongly depends on code data and scale
ReasoningOften improves with scale, data, and inference method
Tool useDepends heavily on instruction data and environment
Long-context useDepends on architecture, data, and positional design
RobustnessRequires data diversity and evaluation beyond loss

Pretraining loss is a useful proxy, but it does not fully describe model behavior.

Emergent Behavior

Some model capabilities appear weak or absent at small scale, then become visible at larger scale. These are often called emergent abilities.

Examples may include multi-step reasoning, in-context learning, instruction following, code synthesis, or tool-use behavior. The word “emergent” should be used carefully. Sometimes an ability seems sudden because the metric is discrete or thresholded. A smoother underlying improvement may look abrupt when measured with a hard benchmark.

For example, suppose a model’s probability of solving each reasoning step improves gradually. A benchmark that gives credit only for the final exact answer may show little improvement until the probability crosses a threshold. The measured curve appears sudden, even though the underlying competence changed smoothly.

Thus emergence can be real at the system level, but the measurement may exaggerate its sharpness.

Scaling Architectures

Scaling laws depend on architecture. A transformer, convolutional network, recurrent network, diffusion model, or graph neural network may have different scaling behavior.

In language modeling, transformers became dominant because they scale well with data and compute. Self-attention allows efficient parallel training over sequence positions. Residual connections, layer normalization, and large-batch optimization make very deep networks trainable.

In vision, scaling shifted from convolutional networks to vision transformers and hybrid models for many large-data regimes. CNNs remain strong when data or compute is limited, but transformers often benefit more from very large datasets.

In generative modeling, diffusion models scale through larger denoisers, better noise schedules, more data, and more sampling compute. Diffusion transformers extend this trend by combining transformer scaling with image and video generation.

Architecture determines the slope of the scaling curve. A better architecture may achieve lower loss at the same compute, or it may improve faster as compute increases.

Scaling Data Quality

Raw dataset size is an incomplete measure. The effective size of a dataset depends on quality, diversity, deduplication, and relevance.

A dataset with many duplicates has lower effective diversity. A dataset with noisy labels weakens supervised learning. A dataset with low-quality text may teach undesirable patterns. A dataset with narrow domain coverage may produce brittle models.

Data scaling therefore has two dimensions:

effective data=quantity×quality. \text{effective data} = \text{quantity} \times \text{quality}.

Quality is difficult to measure. Common methods include filtering, deduplication, classifier-based quality scoring, human review, domain balancing, and contamination detection.

For foundation models, data mixture design is a central problem. A language model may train on web text, books, academic papers, code, math, multilingual data, dialogue, and synthetic data. The mixture affects capabilities. More code improves programming. More math improves mathematical reasoning. More multilingual data improves cross-lingual transfer.

Scaling the dataset means scaling the mixture intelligently, not merely adding tokens.

Scaling Context Length

For language models and sequence models, context length is another scaling axis.

A model with context length TT processes sequences of up to TT tokens. Increasing TT allows the model to use longer documents, longer conversations, larger codebases, and more retrieved evidence.

However, standard attention has quadratic complexity in sequence length:

attention costT2. \text{attention cost} \propto T^2.

Doubling context length can more than double memory and compute cost. Long-context scaling therefore requires architectural and systems techniques such as sparse attention, sliding-window attention, memory layers, retrieval, recurrence, compression, or more efficient attention kernels.

Long context also requires appropriate training data. A model with a long context window may still fail to use distant information if it was rarely trained on tasks that require long-range dependency.

Inference-Time Scaling

Training scale is only one part of modern deep learning. Some systems also improve when we spend more compute during inference.

Inference-time scaling can appear in several forms:

MethodIdea
Larger decoding budgetGenerate more candidate answers
RerankingScore multiple outputs and choose the best
Chain-of-thought reasoningAllocate more tokens to intermediate reasoning
SearchExplore possible solution paths
Tool useCall external systems during inference
RetrievalAdd relevant context before generation
Self-consistencySample multiple solutions and aggregate

This changes the old view of a model as a fixed function. A modern model may be part of a larger inference procedure. The same base model can perform better when given more time, more samples, better retrieval, or external tools.

Inference-time scaling is especially important for reasoning, mathematics, code, planning, and agentic tasks.

Scaling and Optimization Stability

Large-scale training is fragile. Scaling laws assume that training remains stable, but instability can break the curve.

Common sources of instability include exploding gradients, poor initialization, bad learning rate schedules, numerical overflow, data bugs, distributed training failures, optimizer state corruption, and loss spikes.

Several techniques improve stability:

TechniquePurpose
WarmupPrevent early optimization instability
Learning rate decayImprove convergence late in training
Gradient clippingLimit extreme updates
Weight decayControl parameter growth
NormalizationStabilize activations
Residual connectionsImprove gradient flow
Mixed precision careAvoid overflow and underflow
Loss monitoringDetect divergence early

At small scale, some bugs are harmless. At large scale, small defects can waste large amounts of compute. Scaling therefore requires careful measurement, logging, checkpointing, and reproducibility.

Scaling and Evaluation

A scaling law is only as useful as the metric it predicts. If the metric is poor, the scaling law may optimize the wrong target.

Pretraining loss is easy to measure and mathematically clean. But deployed systems need broader evaluation. A model may have low loss while still hallucinating, failing safety tests, leaking private data, producing biased outputs, or performing poorly on rare domains.

Evaluation must scale with the model. As models become stronger, simple benchmarks saturate. New benchmarks must test harder tasks, longer contexts, better adversarial examples, and more realistic settings.

Good evaluation includes:

Evaluation typePurpose
Held-out lossMeasure general prediction quality
Task benchmarksMeasure capability on specific tasks
Robustness testsMeasure behavior under shift
Safety evaluationsMeasure harmful or unreliable behavior
Calibration testsMeasure confidence quality
Human evaluationMeasure usefulness and preference
Long-context testsMeasure retrieval and synthesis ability
Agentic evaluationsMeasure tool use and planning

Scaling performance without scaling evaluation leads to false confidence.

Limits of Scaling

Scaling is powerful, but it has limits.

First, returns diminish. Power laws improve steadily, but each additional unit of improvement requires more resources. If loss decreases as CbC^{-b}, then large gains become increasingly expensive.

Second, data may become scarce. High-quality human-generated data is finite. As models consume more data, future gains may require better filtering, synthetic data, multimodal data, interaction data, or new training objectives.

Third, compute and energy are constrained. Large-scale training requires power, hardware supply chains, cooling, networking, and capital. Efficiency becomes central.

Fourth, some abilities may require more than scale. Causal reasoning, reliable planning, grounded world models, continual learning, and robust agency may need architectural, algorithmic, or data changes.

Scaling should be treated as a strong empirical tool, not as a complete theory of intelligence.

Scaling Laws and PyTorch Practice

In PyTorch, scaling experiments should be designed systematically. A good scaling study changes one or two variables at a time while keeping the rest of the training recipe controlled.

For example, to study parameter scaling, we may train models with different hidden sizes and layer counts:

configs = [
    {"layers": 4,  "hidden": 256},
    {"layers": 8,  "hidden": 512},
    {"layers": 12, "hidden": 768},
    {"layers": 24, "hidden": 1024},
]

For each configuration, we record parameter count, dataset size, training tokens, compute estimate, validation loss, and downstream metrics.

A minimal logging record might look like this:

record = {
    "params": num_parameters(model),
    "tokens": tokens_seen,
    "steps": step,
    "batch_size": batch_size,
    "learning_rate": lr,
    "train_loss": train_loss,
    "val_loss": val_loss,
}

A simple parameter counter is:

def num_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

For scaling work, the most important engineering habit is consistency. The same tokenizer, data pipeline, optimizer, schedule, initialization method, and evaluation set should be used unless the experiment is explicitly testing one of those variables.

Fitting a Simple Scaling Law

Suppose we train several models and obtain validation losses at different compute budgets. We can fit a power-law curve.

A simple model is:

L(C)=aCb+c. L(C) = aC^{-b} + c.

In practice, fitting this equation requires care because cc, the irreducible loss floor, is unknown. A rough first analysis often uses log-log plots. If cc is small or ignored over the observed range, then

L(C)aCb. L(C) \approx aC^{-b}.

Taking logs gives

logL(C)logablogC. \log L(C) \approx \log a - b\log C.

This becomes a linear regression problem in log space.

import torch

# Example measurements
compute = torch.tensor([1e18, 3e18, 1e19, 3e19, 1e20])
loss = torch.tensor([3.20, 2.85, 2.50, 2.25, 2.05])

x = torch.log(compute)
y = torch.log(loss)

X = torch.stack([torch.ones_like(x), x], dim=1)

# Solve least squares: y ≈ alpha + beta * x
solution = torch.linalg.lstsq(X, y).solution

alpha = solution[0]
beta = solution[1]

a = torch.exp(alpha)
b = -beta

print("a:", a.item())
print("b:", b.item())

This simplified fit should not be used as a final scaling forecast, but it illustrates the method. Serious scaling studies fit more complete equations, account for irreducible loss, and test predictions against held-out training runs.

Practical Scaling Checklist

Before scaling a model, check the following:

AreaQuestion
DataIs the dataset clean, diverse, deduplicated, and relevant?
ModelDoes the architecture train stably at smaller scale?
OptimizerAre learning rate, warmup, decay, and weight decay tuned?
ComputeIs the run compute-optimal for the budget?
SystemsCan the pipeline keep accelerators utilized?
CheckpointingCan training recover from failure?
EvaluationAre metrics broad enough for the intended use?
SafetyAre harmful behaviors measured before deployment?
ReproducibilityAre seeds, configs, code versions, and data versions tracked?

Scaling amplifies everything. It amplifies model capability, but it also amplifies data problems, optimization bugs, infrastructure defects, and evaluation gaps.

Summary

Scaling laws describe how performance changes with model size, data size, and compute. They are empirical, but they are among the most useful tools in modern deep learning.

The central lesson is that scale must be balanced. More parameters help only when there is enough data and compute. More data helps only when the model has enough capacity. More compute helps only when it is allocated well.

For modern PyTorch systems, scaling is not only a mathematical idea. It is an experimental discipline. It requires controlled runs, careful logging, stable training recipes, clean data, reliable infrastructure, and evaluation that measures more than loss.