Skip to content

Scaling Laws for Language Models

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens.

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens. They matter because large language models are expensive to train. Before spending millions of GPU-hours, we want a principled estimate of what a given training run is likely to achieve.

A scaling law usually relates training resources to loss. For language models, the main measured quantity is often cross-entropy loss on held-out text. Lower loss means the model assigns higher probability to the validation data.

The basic observation is simple: when model size, data size, and compute increase together, language modeling loss tends to decrease in a predictable way.

The Scaling Variables

There are three central variables:

SymbolMeaning
NNNumber of model parameters
DDNumber of training tokens
CCTraining compute
LLValidation loss

A transformer training run has a rough compute cost proportional to

CND. C \propto N D.

This approximation ignores constants, architecture details, sequence length effects, optimizer overhead, activation recomputation, and hardware efficiency. Still, it captures the dominant tradeoff: a larger model trained on the same number of tokens costs more, and the same model trained for more tokens also costs more.

A scaling law asks questions such as:

QuestionMeaning
If we double parameters, how much does loss improve?Model-size scaling
If we double training tokens, how much does loss improve?Data scaling
If we double compute, how should we split it between NN and DD?Compute-optimal scaling
At fixed compute, should we train a smaller model longer or a larger model shorter?Allocation problem

The last question is the most important in practice.

Power-Law Behavior

Empirically, language model loss often follows an approximate power law. A simple form is

L(N)L+aNα, L(N) \approx L_\infty + aN^{-\alpha},

where LL_\infty is an irreducible loss term, aa is a constant, and α\alpha controls how quickly loss improves as the model grows.

A similar relationship can be written for dataset size:

L(D)L+bDβ. L(D) \approx L_\infty + bD^{-\beta}.

These equations say that improvement continues with scale, but each additional unit of scale gives a smaller gain than the previous one. This is diminishing returns.

For example, increasing a model from 100 million to 1 billion parameters may reduce loss substantially. Increasing from 100 billion to 1 trillion parameters may still help, but the improvement per additional parameter is smaller.

The practical lesson is that scale helps, but it must be allocated carefully.

Model-Limited and Data-Limited Regimes

A training run can be limited by model size or by data.

In a model-limited regime, the dataset is large enough, but the model is too small to absorb the available structure. Adding parameters helps.

In a data-limited regime, the model is large enough, but it has seen too little data. Training a larger model may waste compute because the model cannot generalize well from insufficient tokens. Adding data or training for more tokens helps.

A small model trained on massive data may underfit because it lacks capacity. A huge model trained on too little data may overfit or fail to use its capacity efficiently.

Compute-optimal training balances these two regimes.

Compute-Optimal Training

Suppose we have a fixed compute budget CC. Since compute is roughly proportional to NDND, we cannot increase both without limit. We must choose a model size NN and token budget DD.

A compute-optimal training rule tells us how NN and DD should grow as CC grows.

Earlier large language model practice often favored very large models trained for relatively few tokens. Later empirical work showed that many large models were undertrained relative to their size. In compute-optimal training, a smaller model trained on more tokens can outperform a larger model trained on fewer tokens at the same compute budget.

The tradeoff can be summarized as:

ChoiceEffect
Larger NN, smaller DDMore capacity, less practice
Smaller NN, larger DDLess capacity, more practice
Balanced NN and DDBetter compute efficiency

This matters for deployment too. A smaller model trained longer may have lower inference cost than a larger undertrained model with similar quality.

Tokens per Parameter

A common heuristic is to compare the number of training tokens to the number of model parameters.

tokens per parameter=DN. \text{tokens per parameter} = \frac{D}{N}.

This ratio gives a rough sense of whether the model has been trained long enough for its size.

If the ratio is too low, the model may be undertrained. If the ratio is very high, the model may be trained heavily relative to its capacity. The best ratio depends on the architecture, data quality, tokenizer, objective, compute budget, and target use case.

For many modern dense transformer models, compute-optimal training uses substantially more tokens per parameter than early GPT-style scaling practice. This is why recent models often train smaller or medium-size models on very large token counts.

Scaling Data Quality

Scaling laws are often written in terms of token count, but not all tokens have equal value.

A trillion low-quality tokens may train a worse model than a smaller, cleaner, more diverse corpus. Data quality affects loss, downstream performance, factuality, toxicity, code ability, multilingual ability, and reasoning behavior.

Important data properties include:

Data propertyEffect
DeduplicationReduces memorization and benchmark leakage
FilteringRemoves low-quality or harmful documents
DiversityImproves domain coverage
Code mixtureImproves programming and formal reasoning
Math mixtureImproves symbolic and quantitative reasoning
Multilingual balanceImproves non-English performance
RecencyImproves knowledge of recent facts
Document structureImproves long-context behavior

Scaling token count without controlling quality can produce misleading results. A model trained on more tokens may improve validation loss while becoming worse for specific tasks.

Scaling Architecture

Scaling laws are usually measured within an architecture family. A law estimated for one transformer design may not transfer exactly to another design.

Architectural choices affect scaling efficiency:

ChoiceScaling effect
DepthMore sequential computation, stronger hierarchical processing
WidthMore parallel capacity per layer
Attention headsMore subspace interactions
Context lengthMore long-range conditioning, higher attention cost
Feedforward sizeMore token-wise transformation capacity
Normalization placementAffects stability
Activation functionAffects optimization and expressivity
Mixture-of-expertsIncreases parameters without activating all of them per token

A dense transformer activates all parameters for each token. A mixture-of-experts model activates only part of the network per token. This changes the relation between parameter count, compute, and performance. For this reason, total parameters and active parameters should be reported separately.

Scaling Context Length

Context length is another axis of scale. A model with a longer context window can condition on more tokens at inference time.

However, standard attention has quadratic cost in sequence length:

attention costT2. \text{attention cost} \propto T^2.

Doubling context length can make attention much more expensive, especially during training.

Long-context scaling introduces several questions:

QuestionWhy it matters
Can the model use the full context?Long windows are useless if attention ignores distant tokens
Was the model trained on long sequences?Extrapolating beyond training length is unreliable
Does retrieval work better?External retrieval may beat brute-force context expansion
How is position represented?Positional encoding affects length generalization
What is the inference memory cost?KV cache grows with context length

A longer context window increases capacity for document-level tasks, codebases, multi-turn dialogue, and retrieval-augmented generation. It also increases serving cost.

Scaling Inference

Training scaling and inference scaling differ.

Training cost is paid once. Inference cost is paid every time the model is used. A model that is cheap to train but expensive to serve may be unattractive in production.

Inference cost depends on:

FactorEffect
Parameter countLarger models require more memory bandwidth and compute
Generated lengthMore output tokens increase cost
Context lengthLonger prompts increase prefill cost
Batch sizeLarger batches improve throughput but may increase latency
KV cache sizeLong contexts require more memory
QuantizationReduces memory and can improve throughput
Speculative decodingReduces latency by drafting tokens with a smaller model

For many applications, the best model is not the largest model. It is the model that gives sufficient quality at acceptable latency and cost.

Emergent Abilities and Smooth Scaling

Some abilities appear to emerge suddenly as models become larger. Examples often include multi-step reasoning, in-context learning, instruction following, tool use, and code synthesis.

However, apparent emergence can sometimes result from the metric. If a task is graded with a strict threshold, smooth improvement in underlying probability may look like sudden improvement in accuracy. For example, a model may gradually assign more probability to correct answers, but accuracy only rises once the correct answer becomes the top choice.

This means we should be careful with claims of emergence. Some abilities may reflect genuinely new internal organization at scale. Others may reflect measurement artifacts, prompting changes, or benchmark saturation.

Scaling Laws and Evaluation

Validation loss is useful because it is stable, cheap to measure, and directly tied to the pretraining objective. But it is incomplete.

A lower language modeling loss may correlate with better downstream performance, but the relationship varies by task. For example, loss may improve while factual calibration, safety, or instruction following remain weak.

A practical evaluation suite should include:

Evaluation typeExample
Language modelingHeld-out perplexity
KnowledgeQuestion answering
ReasoningMath and logic tasks
CodeUnit-tested programming problems
Instruction followingHuman or model-graded tasks
RobustnessDistribution-shift tests
SafetyHarmful request refusal and jailbreak resistance
CalibrationConfidence versus correctness
EfficiencyLatency, throughput, memory, cost

Scaling laws help forecast loss. They do not replace evaluation.

Budgeting a Training Run

A training plan should specify:

ItemDescription
Target model sizeParameters, layers, width, heads
Token budgetNumber of tokens and data mixture
Compute budgetGPU type, count, duration, utilization
Sequence lengthTraining context window
Batch sizeGlobal tokens per optimizer step
OptimizerUsually AdamW or related variants
Learning rate scheduleWarmup, decay, final learning rate
Precisionfp32, fp16, bf16, or mixed
Checkpoint policyFrequency and retention
Evaluation suiteLoss and downstream tasks
Stop criteriaCompute budget, loss target, or overfitting signal

Scaling estimates should be conservative. Hardware failures, data pipeline bottlenecks, optimizer instability, and poor utilization can dominate the practical cost.

PyTorch View: Counting Tokens and Parameters

In PyTorch, parameter count can be computed directly:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

Trainable parameter count excludes frozen parameters:

def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Token count is usually tracked by the data pipeline. If each batch has shape [B, T], then each step processes approximately

tokens_per_step = B * T

For distributed training with world_size workers:

global_tokens_per_step = micro_batch_size * sequence_length * grad_accum_steps * world_size

Total tokens after num_steps optimizer steps:

total_tokens = global_tokens_per_step * num_steps

A simple training logger should record both parameter count and cumulative token count:

params = count_parameters(model)

for step, batch in enumerate(loader):
    input_ids = batch["input_ids"]          # [B, T]
    tokens_seen += input_ids.numel()

    logits = model(input_ids[:, :-1])
    loss = loss_fn(logits, input_ids[:, 1:])

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if step % 100 == 0:
        print({
            "step": step,
            "params": params,
            "tokens_seen": tokens_seen,
            "loss": float(loss.item()),
        })

This is not enough for a production training run, but it captures the basic accounting needed to reason about scale.

Practical Lessons

Scaling laws give several practical rules.

First, train on enough tokens. A large model trained on too little data often wastes compute.

Second, report both model size and token count. Parameter count alone says little about training quality.

Third, treat data quality as a scaling variable. More data helps only when the additional data improves the training distribution.

Fourth, evaluate beyond loss. Pretraining loss is important, but downstream behavior depends on adaptation, prompting, decoding, retrieval, and safety work.

Fifth, optimize for the full lifecycle. Training cost matters, but inference cost often dominates over time.

Summary

Scaling laws describe predictable relationships between loss, model size, data size, and compute. They help answer a central engineering question: for a fixed budget, how large should the model be, and how long should it be trained?

The key variables are parameters NN, training tokens DD, compute CC, and validation loss LL. Since compute is roughly proportional to NDND, scaling requires a tradeoff between model capacity and training duration.

Modern language model training favors compute-balanced choices: enough parameters to learn rich structure, enough tokens to train those parameters well, and enough evaluation to detect failures that loss alone cannot show.