Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.
Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.
A scaling law describes how model performance changes as some resource increases. The resource may be the number of parameters, the number of training tokens, the amount of compute, the dataset size, or the inference-time budget. The performance measure may be loss, accuracy, error rate, reward, perplexity, or another task-specific metric.
In its simplest form, a scaling law says:
Here is a resource such as compute or dataset size. The constants , , and are estimated from experiments. The exponent controls how quickly performance improves. The term represents an irreducible floor under the chosen setup.
The important idea is simple: progress is often predictable over a wide range of scales.
Why Scaling Laws Matter
Scaling laws matter because large models are expensive. Training a frontier model may require large datasets, specialized hardware, distributed systems, long training runs, and careful engineering. Without scaling laws, model development becomes guesswork.
Scaling laws help answer practical questions:
| Question | Scaling-law use |
|---|---|
| Should we train a larger model? | Estimate whether more parameters will reduce loss |
| Should we collect more data? | Estimate whether the model is data-limited |
| Should we train longer? | Estimate whether more compute is useful |
| How large should the model be for a fixed budget? | Allocate compute between parameters and tokens |
| When will returns diminish? | Estimate the slope of improvement |
| Which architecture is better? | Compare loss curves at equal compute |
A small experiment can estimate the behavior of a larger run. This does not remove uncertainty, but it turns model scaling into an engineering discipline.
The Three Main Scaling Axes
Deep learning systems usually scale along three axes.
First, we can increase the number of model parameters. A larger model has more capacity. It can represent more complex functions and store more statistical structure from the data.
Second, we can increase the number of training examples or tokens. More data reduces overfitting and exposes the model to more patterns.
Third, we can increase compute. Compute combines model size, dataset size, sequence length, batch size, and training steps. For transformer language models, a rough approximation is that training compute grows with the number of parameters times the number of training tokens.
Here is compute, is the number of parameters, and is the number of training tokens.
This approximation hides many details, but it captures the basic tradeoff. For a fixed compute budget, we can train a larger model on fewer tokens, or a smaller model on more tokens.
Parameter Scaling
Parameter scaling studies how performance changes as the number of trainable parameters increases.
Suppose we train several models with the same dataset and training procedure:
If the training setup is stable, the validation loss often decreases as increases:
A larger model can fit richer functions. In language modeling, this often means better syntax, broader factual recall, stronger pattern completion, and more robust in-context learning.
However, parameter scaling alone is insufficient. A very large model trained on too little data can underuse its capacity. It may memorize training data, overfit, or fail to reach the loss predicted by ideal scaling. Large models need enough data and enough optimization steps.
Data Scaling
Data scaling studies how performance changes as the dataset grows.
For language models, data is often measured in tokens. For vision models, it may be measured in images. For speech models, it may be measured in audio hours. For reinforcement learning, it may be measured in environment steps or trajectories.
A common empirical pattern is:
More data helps because it reduces sampling error. It also increases coverage. A model trained on a larger and more diverse dataset sees more concepts, styles, domains, and edge cases.
But data quality matters. One billion low-quality examples may be less useful than one hundred million carefully filtered examples. Duplicates, corrupted samples, boilerplate, spam, leakage, and low-information data can weaken scaling. In modern systems, data curation is often as important as raw data volume.
Compute Scaling
Compute scaling asks how loss changes as total training compute increases.
A compute scaling law has the form:
This is useful because compute is usually the actual constraint. A research group may have a fixed number of GPUs for a fixed number of days. The key question is how to spend that budget.
For a fixed compute budget, the optimal model size and dataset size must be balanced. A model that is too large may see too little data. A model that is too small may lack capacity, even if it sees many tokens. Compute-optimal training chooses and so that neither the model nor the data is the dominant bottleneck.
This gives the central scaling tradeoff:
Compute-Optimal Training
Compute-optimal training means choosing model size and training data size to minimize loss under a fixed compute budget.
Suppose we have budget . Since training compute roughly scales like
we cannot increase both and arbitrarily. Increasing model size forces us to reduce the number of training tokens, unless compute also increases.
Earlier large language models often used very large parameter counts with relatively fewer training tokens. Later scaling analyses showed that, for many budgets, smaller models trained on more data can be more compute-efficient.
This changed the practical recipe. Instead of only making models larger, modern training often emphasizes training a moderately large model on many more high-quality tokens.
The lesson is not that smaller models are always better. The lesson is that model size must be matched to data size.
Loss, Perplexity, and Downstream Ability
Scaling laws usually measure pretraining loss. In language modeling, this is often cross-entropy loss or perplexity.
Perplexity is related to cross-entropy:
where is the average negative log-likelihood in nats.
Lower pretraining loss generally correlates with better downstream performance, but the relationship is imperfect. Some abilities improve smoothly. Others appear more suddenly when the model reaches a certain scale, data quality, or training regime.
Examples of downstream abilities include:
| Ability | Relation to scaling |
|---|---|
| Text prediction | Usually improves smoothly with loss |
| Translation | Improves with multilingual data and model capacity |
| Code generation | Strongly depends on code data and scale |
| Reasoning | Often improves with scale, data, and inference method |
| Tool use | Depends heavily on instruction data and environment |
| Long-context use | Depends on architecture, data, and positional design |
| Robustness | Requires data diversity and evaluation beyond loss |
Pretraining loss is a useful proxy, but it does not fully describe model behavior.
Emergent Behavior
Some model capabilities appear weak or absent at small scale, then become visible at larger scale. These are often called emergent abilities.
Examples may include multi-step reasoning, in-context learning, instruction following, code synthesis, or tool-use behavior. The word “emergent” should be used carefully. Sometimes an ability seems sudden because the metric is discrete or thresholded. A smoother underlying improvement may look abrupt when measured with a hard benchmark.
For example, suppose a model’s probability of solving each reasoning step improves gradually. A benchmark that gives credit only for the final exact answer may show little improvement until the probability crosses a threshold. The measured curve appears sudden, even though the underlying competence changed smoothly.
Thus emergence can be real at the system level, but the measurement may exaggerate its sharpness.
Scaling Architectures
Scaling laws depend on architecture. A transformer, convolutional network, recurrent network, diffusion model, or graph neural network may have different scaling behavior.
In language modeling, transformers became dominant because they scale well with data and compute. Self-attention allows efficient parallel training over sequence positions. Residual connections, layer normalization, and large-batch optimization make very deep networks trainable.
In vision, scaling shifted from convolutional networks to vision transformers and hybrid models for many large-data regimes. CNNs remain strong when data or compute is limited, but transformers often benefit more from very large datasets.
In generative modeling, diffusion models scale through larger denoisers, better noise schedules, more data, and more sampling compute. Diffusion transformers extend this trend by combining transformer scaling with image and video generation.
Architecture determines the slope of the scaling curve. A better architecture may achieve lower loss at the same compute, or it may improve faster as compute increases.
Scaling Data Quality
Raw dataset size is an incomplete measure. The effective size of a dataset depends on quality, diversity, deduplication, and relevance.
A dataset with many duplicates has lower effective diversity. A dataset with noisy labels weakens supervised learning. A dataset with low-quality text may teach undesirable patterns. A dataset with narrow domain coverage may produce brittle models.
Data scaling therefore has two dimensions:
Quality is difficult to measure. Common methods include filtering, deduplication, classifier-based quality scoring, human review, domain balancing, and contamination detection.
For foundation models, data mixture design is a central problem. A language model may train on web text, books, academic papers, code, math, multilingual data, dialogue, and synthetic data. The mixture affects capabilities. More code improves programming. More math improves mathematical reasoning. More multilingual data improves cross-lingual transfer.
Scaling the dataset means scaling the mixture intelligently, not merely adding tokens.
Scaling Context Length
For language models and sequence models, context length is another scaling axis.
A model with context length processes sequences of up to tokens. Increasing allows the model to use longer documents, longer conversations, larger codebases, and more retrieved evidence.
However, standard attention has quadratic complexity in sequence length:
Doubling context length can more than double memory and compute cost. Long-context scaling therefore requires architectural and systems techniques such as sparse attention, sliding-window attention, memory layers, retrieval, recurrence, compression, or more efficient attention kernels.
Long context also requires appropriate training data. A model with a long context window may still fail to use distant information if it was rarely trained on tasks that require long-range dependency.
Inference-Time Scaling
Training scale is only one part of modern deep learning. Some systems also improve when we spend more compute during inference.
Inference-time scaling can appear in several forms:
| Method | Idea |
|---|---|
| Larger decoding budget | Generate more candidate answers |
| Reranking | Score multiple outputs and choose the best |
| Chain-of-thought reasoning | Allocate more tokens to intermediate reasoning |
| Search | Explore possible solution paths |
| Tool use | Call external systems during inference |
| Retrieval | Add relevant context before generation |
| Self-consistency | Sample multiple solutions and aggregate |
This changes the old view of a model as a fixed function. A modern model may be part of a larger inference procedure. The same base model can perform better when given more time, more samples, better retrieval, or external tools.
Inference-time scaling is especially important for reasoning, mathematics, code, planning, and agentic tasks.
Scaling and Optimization Stability
Large-scale training is fragile. Scaling laws assume that training remains stable, but instability can break the curve.
Common sources of instability include exploding gradients, poor initialization, bad learning rate schedules, numerical overflow, data bugs, distributed training failures, optimizer state corruption, and loss spikes.
Several techniques improve stability:
| Technique | Purpose |
|---|---|
| Warmup | Prevent early optimization instability |
| Learning rate decay | Improve convergence late in training |
| Gradient clipping | Limit extreme updates |
| Weight decay | Control parameter growth |
| Normalization | Stabilize activations |
| Residual connections | Improve gradient flow |
| Mixed precision care | Avoid overflow and underflow |
| Loss monitoring | Detect divergence early |
At small scale, some bugs are harmless. At large scale, small defects can waste large amounts of compute. Scaling therefore requires careful measurement, logging, checkpointing, and reproducibility.
Scaling and Evaluation
A scaling law is only as useful as the metric it predicts. If the metric is poor, the scaling law may optimize the wrong target.
Pretraining loss is easy to measure and mathematically clean. But deployed systems need broader evaluation. A model may have low loss while still hallucinating, failing safety tests, leaking private data, producing biased outputs, or performing poorly on rare domains.
Evaluation must scale with the model. As models become stronger, simple benchmarks saturate. New benchmarks must test harder tasks, longer contexts, better adversarial examples, and more realistic settings.
Good evaluation includes:
| Evaluation type | Purpose |
|---|---|
| Held-out loss | Measure general prediction quality |
| Task benchmarks | Measure capability on specific tasks |
| Robustness tests | Measure behavior under shift |
| Safety evaluations | Measure harmful or unreliable behavior |
| Calibration tests | Measure confidence quality |
| Human evaluation | Measure usefulness and preference |
| Long-context tests | Measure retrieval and synthesis ability |
| Agentic evaluations | Measure tool use and planning |
Scaling performance without scaling evaluation leads to false confidence.
Limits of Scaling
Scaling is powerful, but it has limits.
First, returns diminish. Power laws improve steadily, but each additional unit of improvement requires more resources. If loss decreases as , then large gains become increasingly expensive.
Second, data may become scarce. High-quality human-generated data is finite. As models consume more data, future gains may require better filtering, synthetic data, multimodal data, interaction data, or new training objectives.
Third, compute and energy are constrained. Large-scale training requires power, hardware supply chains, cooling, networking, and capital. Efficiency becomes central.
Fourth, some abilities may require more than scale. Causal reasoning, reliable planning, grounded world models, continual learning, and robust agency may need architectural, algorithmic, or data changes.
Scaling should be treated as a strong empirical tool, not as a complete theory of intelligence.
Scaling Laws and PyTorch Practice
In PyTorch, scaling experiments should be designed systematically. A good scaling study changes one or two variables at a time while keeping the rest of the training recipe controlled.
For example, to study parameter scaling, we may train models with different hidden sizes and layer counts:
configs = [
{"layers": 4, "hidden": 256},
{"layers": 8, "hidden": 512},
{"layers": 12, "hidden": 768},
{"layers": 24, "hidden": 1024},
]For each configuration, we record parameter count, dataset size, training tokens, compute estimate, validation loss, and downstream metrics.
A minimal logging record might look like this:
record = {
"params": num_parameters(model),
"tokens": tokens_seen,
"steps": step,
"batch_size": batch_size,
"learning_rate": lr,
"train_loss": train_loss,
"val_loss": val_loss,
}A simple parameter counter is:
def num_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)For scaling work, the most important engineering habit is consistency. The same tokenizer, data pipeline, optimizer, schedule, initialization method, and evaluation set should be used unless the experiment is explicitly testing one of those variables.
Fitting a Simple Scaling Law
Suppose we train several models and obtain validation losses at different compute budgets. We can fit a power-law curve.
A simple model is:
In practice, fitting this equation requires care because , the irreducible loss floor, is unknown. A rough first analysis often uses log-log plots. If is small or ignored over the observed range, then
Taking logs gives
This becomes a linear regression problem in log space.
import torch
# Example measurements
compute = torch.tensor([1e18, 3e18, 1e19, 3e19, 1e20])
loss = torch.tensor([3.20, 2.85, 2.50, 2.25, 2.05])
x = torch.log(compute)
y = torch.log(loss)
X = torch.stack([torch.ones_like(x), x], dim=1)
# Solve least squares: y ≈ alpha + beta * x
solution = torch.linalg.lstsq(X, y).solution
alpha = solution[0]
beta = solution[1]
a = torch.exp(alpha)
b = -beta
print("a:", a.item())
print("b:", b.item())This simplified fit should not be used as a final scaling forecast, but it illustrates the method. Serious scaling studies fit more complete equations, account for irreducible loss, and test predictions against held-out training runs.
Practical Scaling Checklist
Before scaling a model, check the following:
| Area | Question |
|---|---|
| Data | Is the dataset clean, diverse, deduplicated, and relevant? |
| Model | Does the architecture train stably at smaller scale? |
| Optimizer | Are learning rate, warmup, decay, and weight decay tuned? |
| Compute | Is the run compute-optimal for the budget? |
| Systems | Can the pipeline keep accelerators utilized? |
| Checkpointing | Can training recover from failure? |
| Evaluation | Are metrics broad enough for the intended use? |
| Safety | Are harmful behaviors measured before deployment? |
| Reproducibility | Are seeds, configs, code versions, and data versions tracked? |
Scaling amplifies everything. It amplifies model capability, but it also amplifies data problems, optimization bugs, infrastructure defects, and evaluation gaps.
Summary
Scaling laws describe how performance changes with model size, data size, and compute. They are empirical, but they are among the most useful tools in modern deep learning.
The central lesson is that scale must be balanced. More parameters help only when there is enough data and compute. More data helps only when the model has enough capacity. More compute helps only when it is allocated well.
For modern PyTorch systems, scaling is not only a mathematical idea. It is an experimental discipline. It requires controlled runs, careful logging, stable training recipes, clean data, reliable infrastructure, and evaluation that measures more than loss.