Scaling Laws

Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.

A scaling law describes how model performance changes as some resource increases. The resource may be the number of parameters, the number of training tokens, the amount of compute, the dataset size, or the inference-time budget. The performance measure may be loss, accuracy, error rate, reward, perplexity, or another task-specific metric.

In its simplest form, a scaling law says:

\text{loss} \approx aX^{-b} + c

Here $X$ is a resource such as compute or dataset size. The constants $a$ , $b$ , and $c$ are estimated from experiments. The exponent $b$ controls how quickly performance improves. The term $c$ represents an irreducible floor under the chosen setup.

The important idea is simple: progress is often predictable over a wide range of scales.

Why Scaling Laws Matter

Scaling laws matter because large models are expensive. Training a frontier model may require large datasets, specialized hardware, distributed systems, long training runs, and careful engineering. Without scaling laws, model development becomes guesswork.

Scaling laws help answer practical questions:

Question	Scaling-law use
Should we train a larger model?	Estimate whether more parameters will reduce loss
Should we collect more data?	Estimate whether the model is data-limited
Should we train longer?	Estimate whether more compute is useful
How large should the model be for a fixed budget?	Allocate compute between parameters and tokens
When will returns diminish?	Estimate the slope of improvement
Which architecture is better?	Compare loss curves at equal compute

A small experiment can estimate the behavior of a larger run. This does not remove uncertainty, but it turns model scaling into an engineering discipline.

The Three Main Scaling Axes

Deep learning systems usually scale along three axes.

First, we can increase the number of model parameters. A larger model has more capacity. It can represent more complex functions and store more statistical structure from the data.

Second, we can increase the number of training examples or tokens. More data reduces overfitting and exposes the model to more patterns.

Third, we can increase compute. Compute combines model size, dataset size, sequence length, batch size, and training steps. For transformer language models, a rough approximation is that training compute grows with the number of parameters times the number of training tokens.

C \propto ND

Here $C$ is compute, $N$ is the number of parameters, and $D$ is the number of training tokens.

This approximation hides many details, but it captures the basic tradeoff. For a fixed compute budget, we can train a larger model on fewer tokens, or a smaller model on more tokens.

Parameter Scaling

Parameter scaling studies how performance changes as the number of trainable parameters increases.

Suppose we train several models with the same dataset and training procedure:

N_1 < N_2 < N_3 < N_4.

If the training setup is stable, the validation loss often decreases as $N$ increases:

L(N) \approx aN^{-b} + c.

A larger model can fit richer functions. In language modeling, this often means better syntax, broader factual recall, stronger pattern completion, and more robust in-context learning.

However, parameter scaling alone is insufficient. A very large model trained on too little data can underuse its capacity. It may memorize training data, overfit, or fail to reach the loss predicted by ideal scaling. Large models need enough data and enough optimization steps.

Data Scaling

Data scaling studies how performance changes as the dataset grows.

For language models, data is often measured in tokens. For vision models, it may be measured in images. For speech models, it may be measured in audio hours. For reinforcement learning, it may be measured in environment steps or trajectories.

A common empirical pattern is:

L(D) \approx aD^{-b} + c.

More data helps because it reduces sampling error. It also increases coverage. A model trained on a larger and more diverse dataset sees more concepts, styles, domains, and edge cases.

But data quality matters. One billion low-quality examples may be less useful than one hundred million carefully filtered examples. Duplicates, corrupted samples, boilerplate, spam, leakage, and low-information data can weaken scaling. In modern systems, data curation is often as important as raw data volume.

Compute Scaling

Compute scaling asks how loss changes as total training compute increases.

A compute scaling law has the form:

L(C) \approx aC^{-b} + c.

This is useful because compute is usually the actual constraint. A research group may have a fixed number of GPUs for a fixed number of days. The key question is how to spend that budget.

For a fixed compute budget, the optimal model size and dataset size must be balanced. A model that is too large may see too little data. A model that is too small may lack capacity, even if it sees many tokens. Compute-optimal training chooses $N$ and $D$ so that neither the model nor the data is the dominant bottleneck.

This gives the central scaling tradeoff:

\text{compute budget} \quad\Rightarrow\quad \text{choose model size and data size together}.

Compute-Optimal Training

Compute-optimal training means choosing model size and training data size to minimize loss under a fixed compute budget.

Suppose we have budget $C$ . Since training compute roughly scales like

C \propto ND,

we cannot increase both $N$ and $D$ arbitrarily. Increasing model size forces us to reduce the number of training tokens, unless compute also increases.

Earlier large language models often used very large parameter counts with relatively fewer training tokens. Later scaling analyses showed that, for many budgets, smaller models trained on more data can be more compute-efficient.

This changed the practical recipe. Instead of only making models larger, modern training often emphasizes training a moderately large model on many more high-quality tokens.

The lesson is not that smaller models are always better. The lesson is that model size must be matched to data size.

Loss, Perplexity, and Downstream Ability

Scaling laws usually measure pretraining loss. In language modeling, this is often cross-entropy loss or perplexity.

Perplexity is related to cross-entropy:

\text{perplexity} = \exp(L)

where $L$ is the average negative log-likelihood in nats.

Lower pretraining loss generally correlates with better downstream performance, but the relationship is imperfect. Some abilities improve smoothly. Others appear more suddenly when the model reaches a certain scale, data quality, or training regime.

Examples of downstream abilities include:

Ability	Relation to scaling
Text prediction	Usually improves smoothly with loss
Translation	Improves with multilingual data and model capacity
Code generation	Strongly depends on code data and scale
Reasoning	Often improves with scale, data, and inference method
Tool use	Depends heavily on instruction data and environment
Long-context use	Depends on architecture, data, and positional design
Robustness	Requires data diversity and evaluation beyond loss

Pretraining loss is a useful proxy, but it does not fully describe model behavior.

Emergent Behavior

Some model capabilities appear weak or absent at small scale, then become visible at larger scale. These are often called emergent abilities.

Examples may include multi-step reasoning, in-context learning, instruction following, code synthesis, or tool-use behavior. The word “emergent” should be used carefully. Sometimes an ability seems sudden because the metric is discrete or thresholded. A smoother underlying improvement may look abrupt when measured with a hard benchmark.

For example, suppose a model’s probability of solving each reasoning step improves gradually. A benchmark that gives credit only for the final exact answer may show little improvement until the probability crosses a threshold. The measured curve appears sudden, even though the underlying competence changed smoothly.

Thus emergence can be real at the system level, but the measurement may exaggerate its sharpness.

Scaling Architectures

Scaling laws depend on architecture. A transformer, convolutional network, recurrent network, diffusion model, or graph neural network may have different scaling behavior.

In language modeling, transformers became dominant because they scale well with data and compute. Self-attention allows efficient parallel training over sequence positions. Residual connections, layer normalization, and large-batch optimization make very deep networks trainable.

In vision, scaling shifted from convolutional networks to vision transformers and hybrid models for many large-data regimes. CNNs remain strong when data or compute is limited, but transformers often benefit more from very large datasets.

In generative modeling, diffusion models scale through larger denoisers, better noise schedules, more data, and more sampling compute. Diffusion transformers extend this trend by combining transformer scaling with image and video generation.

Architecture determines the slope of the scaling curve. A better architecture may achieve lower loss at the same compute, or it may improve faster as compute increases.

Scaling Data Quality

Raw dataset size is an incomplete measure. The effective size of a dataset depends on quality, diversity, deduplication, and relevance.

A dataset with many duplicates has lower effective diversity. A dataset with noisy labels weakens supervised learning. A dataset with low-quality text may teach undesirable patterns. A dataset with narrow domain coverage may produce brittle models.

Data scaling therefore has two dimensions:

\text{effective data} = \text{quantity} \times \text{quality}.

Quality is difficult to measure. Common methods include filtering, deduplication, classifier-based quality scoring, human review, domain balancing, and contamination detection.

For foundation models, data mixture design is a central problem. A language model may train on web text, books, academic papers, code, math, multilingual data, dialogue, and synthetic data. The mixture affects capabilities. More code improves programming. More math improves mathematical reasoning. More multilingual data improves cross-lingual transfer.

Scaling the dataset means scaling the mixture intelligently, not merely adding tokens.

Scaling Context Length

For language models and sequence models, context length is another scaling axis.

A model with context length $T$ processes sequences of up to $T$ tokens. Increasing $T$ allows the model to use longer documents, longer conversations, larger codebases, and more retrieved evidence.

However, standard attention has quadratic complexity in sequence length:

\text{attention cost} \propto T^2.

Doubling context length can more than double memory and compute cost. Long-context scaling therefore requires architectural and systems techniques such as sparse attention, sliding-window attention, memory layers, retrieval, recurrence, compression, or more efficient attention kernels.

Long context also requires appropriate training data. A model with a long context window may still fail to use distant information if it was rarely trained on tasks that require long-range dependency.

Inference-Time Scaling

Training scale is only one part of modern deep learning. Some systems also improve when we spend more compute during inference.

Inference-time scaling can appear in several forms:

Method	Idea
Larger decoding budget	Generate more candidate answers
Reranking	Score multiple outputs and choose the best
Chain-of-thought reasoning	Allocate more tokens to intermediate reasoning
Search	Explore possible solution paths
Tool use	Call external systems during inference
Retrieval	Add relevant context before generation
Self-consistency	Sample multiple solutions and aggregate

This changes the old view of a model as a fixed function. A modern model may be part of a larger inference procedure. The same base model can perform better when given more time, more samples, better retrieval, or external tools.

Inference-time scaling is especially important for reasoning, mathematics, code, planning, and agentic tasks.

Scaling and Optimization Stability

Large-scale training is fragile. Scaling laws assume that training remains stable, but instability can break the curve.

Common sources of instability include exploding gradients, poor initialization, bad learning rate schedules, numerical overflow, data bugs, distributed training failures, optimizer state corruption, and loss spikes.

Several techniques improve stability:

Technique	Purpose
Warmup	Prevent early optimization instability
Learning rate decay	Improve convergence late in training
Gradient clipping	Limit extreme updates
Weight decay	Control parameter growth
Normalization	Stabilize activations
Residual connections	Improve gradient flow
Mixed precision care	Avoid overflow and underflow
Loss monitoring	Detect divergence early

At small scale, some bugs are harmless. At large scale, small defects can waste large amounts of compute. Scaling therefore requires careful measurement, logging, checkpointing, and reproducibility.

Scaling and Evaluation

A scaling law is only as useful as the metric it predicts. If the metric is poor, the scaling law may optimize the wrong target.

Pretraining loss is easy to measure and mathematically clean. But deployed systems need broader evaluation. A model may have low loss while still hallucinating, failing safety tests, leaking private data, producing biased outputs, or performing poorly on rare domains.

Evaluation must scale with the model. As models become stronger, simple benchmarks saturate. New benchmarks must test harder tasks, longer contexts, better adversarial examples, and more realistic settings.

Good evaluation includes:

Evaluation type	Purpose
Held-out loss	Measure general prediction quality
Task benchmarks	Measure capability on specific tasks
Robustness tests	Measure behavior under shift
Safety evaluations	Measure harmful or unreliable behavior
Calibration tests	Measure confidence quality
Human evaluation	Measure usefulness and preference
Long-context tests	Measure retrieval and synthesis ability
Agentic evaluations	Measure tool use and planning

Scaling performance without scaling evaluation leads to false confidence.

Limits of Scaling

Scaling is powerful, but it has limits.

First, returns diminish. Power laws improve steadily, but each additional unit of improvement requires more resources. If loss decreases as $C^{-b}$ , then large gains become increasingly expensive.

Second, data may become scarce. High-quality human-generated data is finite. As models consume more data, future gains may require better filtering, synthetic data, multimodal data, interaction data, or new training objectives.

Third, compute and energy are constrained. Large-scale training requires power, hardware supply chains, cooling, networking, and capital. Efficiency becomes central.

Fourth, some abilities may require more than scale. Causal reasoning, reliable planning, grounded world models, continual learning, and robust agency may need architectural, algorithmic, or data changes.

Scaling should be treated as a strong empirical tool, not as a complete theory of intelligence.

Scaling Laws and PyTorch Practice

In PyTorch, scaling experiments should be designed systematically. A good scaling study changes one or two variables at a time while keeping the rest of the training recipe controlled.

For example, to study parameter scaling, we may train models with different hidden sizes and layer counts:

configs = [
    {"layers": 4,  "hidden": 256},
    {"layers": 8,  "hidden": 512},
    {"layers": 12, "hidden": 768},
    {"layers": 24, "hidden": 1024},
]

For each configuration, we record parameter count, dataset size, training tokens, compute estimate, validation loss, and downstream metrics.

A minimal logging record might look like this:

record = {
    "params": num_parameters(model),
    "tokens": tokens_seen,
    "steps": step,
    "batch_size": batch_size,
    "learning_rate": lr,
    "train_loss": train_loss,
    "val_loss": val_loss,
}

A simple parameter counter is:

def num_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

For scaling work, the most important engineering habit is consistency. The same tokenizer, data pipeline, optimizer, schedule, initialization method, and evaluation set should be used unless the experiment is explicitly testing one of those variables.

Fitting a Simple Scaling Law

Suppose we train several models and obtain validation losses at different compute budgets. We can fit a power-law curve.

A simple model is:

L(C) = aC^{-b} + c.

In practice, fitting this equation requires care because $c$ , the irreducible loss floor, is unknown. A rough first analysis often uses log-log plots. If $c$ is small or ignored over the observed range, then

L(C) \approx aC^{-b}.

Taking logs gives

\log L(C) \approx \log a - b\log C.

This becomes a linear regression problem in log space.

import torch

# Example measurements
compute = torch.tensor([1e18, 3e18, 1e19, 3e19, 1e20])
loss = torch.tensor([3.20, 2.85, 2.50, 2.25, 2.05])

x = torch.log(compute)
y = torch.log(loss)

X = torch.stack([torch.ones_like(x), x], dim=1)

# Solve least squares: y ≈ alpha + beta * x
solution = torch.linalg.lstsq(X, y).solution

alpha = solution[0]
beta = solution[1]

a = torch.exp(alpha)
b = -beta

print("a:", a.item())
print("b:", b.item())

This simplified fit should not be used as a final scaling forecast, but it illustrates the method. Serious scaling studies fit more complete equations, account for irreducible loss, and test predictions against held-out training runs.

Practical Scaling Checklist

Before scaling a model, check the following:

Area	Question
Data	Is the dataset clean, diverse, deduplicated, and relevant?
Model	Does the architecture train stably at smaller scale?
Optimizer	Are learning rate, warmup, decay, and weight decay tuned?
Compute	Is the run compute-optimal for the budget?
Systems	Can the pipeline keep accelerators utilized?
Checkpointing	Can training recover from failure?
Evaluation	Are metrics broad enough for the intended use?
Safety	Are harmful behaviors measured before deployment?
Reproducibility	Are seeds, configs, code versions, and data versions tracked?

Scaling amplifies everything. It amplifies model capability, but it also amplifies data problems, optimization bugs, infrastructure defects, and evaluation gaps.

Summary

Scaling laws describe how performance changes with model size, data size, and compute. They are empirical, but they are among the most useful tools in modern deep learning.

The central lesson is that scale must be balanced. More parameters help only when there is enough data and compute. More data helps only when the model has enough capacity. More compute helps only when it is allocated well.

For modern PyTorch systems, scaling is not only a mathematical idea. It is an experimental discipline. It requires controlled runs, careful logging, stable training recipes, clean data, reliable infrastructure, and evaluation that measures more than loss.