# Scaling Laws for Language Models

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens. They matter because large language models are expensive to train. Before spending millions of GPU-hours, we want a principled estimate of what a given training run is likely to achieve.

A scaling law usually relates training resources to loss. For language models, the main measured quantity is often cross-entropy loss on held-out text. Lower loss means the model assigns higher probability to the validation data.

The basic observation is simple: when model size, data size, and compute increase together, language modeling loss tends to decrease in a predictable way.

### The Scaling Variables

There are three central variables:

| Symbol | Meaning |
|---|---|
| $N$ | Number of model parameters |
| $D$ | Number of training tokens |
| $C$ | Training compute |
| $L$ | Validation loss |

A transformer training run has a rough compute cost proportional to

$$
C \propto N D.
$$

This approximation ignores constants, architecture details, sequence length effects, optimizer overhead, activation recomputation, and hardware efficiency. Still, it captures the dominant tradeoff: a larger model trained on the same number of tokens costs more, and the same model trained for more tokens also costs more.

A scaling law asks questions such as:

| Question | Meaning |
|---|---|
| If we double parameters, how much does loss improve? | Model-size scaling |
| If we double training tokens, how much does loss improve? | Data scaling |
| If we double compute, how should we split it between $N$ and $D$? | Compute-optimal scaling |
| At fixed compute, should we train a smaller model longer or a larger model shorter? | Allocation problem |

The last question is the most important in practice.

### Power-Law Behavior

Empirically, language model loss often follows an approximate power law. A simple form is

$$
L(N) \approx L_\infty + aN^{-\alpha},
$$

where $L_\infty$ is an irreducible loss term, $a$ is a constant, and $\alpha$ controls how quickly loss improves as the model grows.

A similar relationship can be written for dataset size:

$$
L(D) \approx L_\infty + bD^{-\beta}.
$$

These equations say that improvement continues with scale, but each additional unit of scale gives a smaller gain than the previous one. This is diminishing returns.

For example, increasing a model from 100 million to 1 billion parameters may reduce loss substantially. Increasing from 100 billion to 1 trillion parameters may still help, but the improvement per additional parameter is smaller.

The practical lesson is that scale helps, but it must be allocated carefully.

### Model-Limited and Data-Limited Regimes

A training run can be limited by model size or by data.

In a model-limited regime, the dataset is large enough, but the model is too small to absorb the available structure. Adding parameters helps.

In a data-limited regime, the model is large enough, but it has seen too little data. Training a larger model may waste compute because the model cannot generalize well from insufficient tokens. Adding data or training for more tokens helps.

A small model trained on massive data may underfit because it lacks capacity. A huge model trained on too little data may overfit or fail to use its capacity efficiently.

Compute-optimal training balances these two regimes.

### Compute-Optimal Training

Suppose we have a fixed compute budget $C$. Since compute is roughly proportional to $ND$, we cannot increase both without limit. We must choose a model size $N$ and token budget $D$.

A compute-optimal training rule tells us how $N$ and $D$ should grow as $C$ grows.

Earlier large language model practice often favored very large models trained for relatively few tokens. Later empirical work showed that many large models were undertrained relative to their size. In compute-optimal training, a smaller model trained on more tokens can outperform a larger model trained on fewer tokens at the same compute budget.

The tradeoff can be summarized as:

| Choice | Effect |
|---|---|
| Larger $N$, smaller $D$ | More capacity, less practice |
| Smaller $N$, larger $D$ | Less capacity, more practice |
| Balanced $N$ and $D$ | Better compute efficiency |

This matters for deployment too. A smaller model trained longer may have lower inference cost than a larger undertrained model with similar quality.

### Tokens per Parameter

A common heuristic is to compare the number of training tokens to the number of model parameters.

$$
\text{tokens per parameter} = \frac{D}{N}.
$$

This ratio gives a rough sense of whether the model has been trained long enough for its size.

If the ratio is too low, the model may be undertrained. If the ratio is very high, the model may be trained heavily relative to its capacity. The best ratio depends on the architecture, data quality, tokenizer, objective, compute budget, and target use case.

For many modern dense transformer models, compute-optimal training uses substantially more tokens per parameter than early GPT-style scaling practice. This is why recent models often train smaller or medium-size models on very large token counts.

### Scaling Data Quality

Scaling laws are often written in terms of token count, but not all tokens have equal value.

A trillion low-quality tokens may train a worse model than a smaller, cleaner, more diverse corpus. Data quality affects loss, downstream performance, factuality, toxicity, code ability, multilingual ability, and reasoning behavior.

Important data properties include:

| Data property | Effect |
|---|---|
| Deduplication | Reduces memorization and benchmark leakage |
| Filtering | Removes low-quality or harmful documents |
| Diversity | Improves domain coverage |
| Code mixture | Improves programming and formal reasoning |
| Math mixture | Improves symbolic and quantitative reasoning |
| Multilingual balance | Improves non-English performance |
| Recency | Improves knowledge of recent facts |
| Document structure | Improves long-context behavior |

Scaling token count without controlling quality can produce misleading results. A model trained on more tokens may improve validation loss while becoming worse for specific tasks.

### Scaling Architecture

Scaling laws are usually measured within an architecture family. A law estimated for one transformer design may not transfer exactly to another design.

Architectural choices affect scaling efficiency:

| Choice | Scaling effect |
|---|---|
| Depth | More sequential computation, stronger hierarchical processing |
| Width | More parallel capacity per layer |
| Attention heads | More subspace interactions |
| Context length | More long-range conditioning, higher attention cost |
| Feedforward size | More token-wise transformation capacity |
| Normalization placement | Affects stability |
| Activation function | Affects optimization and expressivity |
| Mixture-of-experts | Increases parameters without activating all of them per token |

A dense transformer activates all parameters for each token. A mixture-of-experts model activates only part of the network per token. This changes the relation between parameter count, compute, and performance. For this reason, total parameters and active parameters should be reported separately.

### Scaling Context Length

Context length is another axis of scale. A model with a longer context window can condition on more tokens at inference time.

However, standard attention has quadratic cost in sequence length:

$$
\text{attention cost} \propto T^2.
$$

Doubling context length can make attention much more expensive, especially during training.

Long-context scaling introduces several questions:

| Question | Why it matters |
|---|---|
| Can the model use the full context? | Long windows are useless if attention ignores distant tokens |
| Was the model trained on long sequences? | Extrapolating beyond training length is unreliable |
| Does retrieval work better? | External retrieval may beat brute-force context expansion |
| How is position represented? | Positional encoding affects length generalization |
| What is the inference memory cost? | KV cache grows with context length |

A longer context window increases capacity for document-level tasks, codebases, multi-turn dialogue, and retrieval-augmented generation. It also increases serving cost.

### Scaling Inference

Training scaling and inference scaling differ.

Training cost is paid once. Inference cost is paid every time the model is used. A model that is cheap to train but expensive to serve may be unattractive in production.

Inference cost depends on:

| Factor | Effect |
|---|---|
| Parameter count | Larger models require more memory bandwidth and compute |
| Generated length | More output tokens increase cost |
| Context length | Longer prompts increase prefill cost |
| Batch size | Larger batches improve throughput but may increase latency |
| KV cache size | Long contexts require more memory |
| Quantization | Reduces memory and can improve throughput |
| Speculative decoding | Reduces latency by drafting tokens with a smaller model |

For many applications, the best model is not the largest model. It is the model that gives sufficient quality at acceptable latency and cost.

### Emergent Abilities and Smooth Scaling

Some abilities appear to emerge suddenly as models become larger. Examples often include multi-step reasoning, in-context learning, instruction following, tool use, and code synthesis.

However, apparent emergence can sometimes result from the metric. If a task is graded with a strict threshold, smooth improvement in underlying probability may look like sudden improvement in accuracy. For example, a model may gradually assign more probability to correct answers, but accuracy only rises once the correct answer becomes the top choice.

This means we should be careful with claims of emergence. Some abilities may reflect genuinely new internal organization at scale. Others may reflect measurement artifacts, prompting changes, or benchmark saturation.

### Scaling Laws and Evaluation

Validation loss is useful because it is stable, cheap to measure, and directly tied to the pretraining objective. But it is incomplete.

A lower language modeling loss may correlate with better downstream performance, but the relationship varies by task. For example, loss may improve while factual calibration, safety, or instruction following remain weak.

A practical evaluation suite should include:

| Evaluation type | Example |
|---|---|
| Language modeling | Held-out perplexity |
| Knowledge | Question answering |
| Reasoning | Math and logic tasks |
| Code | Unit-tested programming problems |
| Instruction following | Human or model-graded tasks |
| Robustness | Distribution-shift tests |
| Safety | Harmful request refusal and jailbreak resistance |
| Calibration | Confidence versus correctness |
| Efficiency | Latency, throughput, memory, cost |

Scaling laws help forecast loss. They do not replace evaluation.

### Budgeting a Training Run

A training plan should specify:

| Item | Description |
|---|---|
| Target model size | Parameters, layers, width, heads |
| Token budget | Number of tokens and data mixture |
| Compute budget | GPU type, count, duration, utilization |
| Sequence length | Training context window |
| Batch size | Global tokens per optimizer step |
| Optimizer | Usually AdamW or related variants |
| Learning rate schedule | Warmup, decay, final learning rate |
| Precision | fp32, fp16, bf16, or mixed |
| Checkpoint policy | Frequency and retention |
| Evaluation suite | Loss and downstream tasks |
| Stop criteria | Compute budget, loss target, or overfitting signal |

Scaling estimates should be conservative. Hardware failures, data pipeline bottlenecks, optimizer instability, and poor utilization can dominate the practical cost.

### PyTorch View: Counting Tokens and Parameters

In PyTorch, parameter count can be computed directly:

```python
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())
```

Trainable parameter count excludes frozen parameters:

```python
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
```

Token count is usually tracked by the data pipeline. If each batch has shape `[B, T]`, then each step processes approximately

```python
tokens_per_step = B * T
```

For distributed training with `world_size` workers:

```python
global_tokens_per_step = micro_batch_size * sequence_length * grad_accum_steps * world_size
```

Total tokens after `num_steps` optimizer steps:

```python
total_tokens = global_tokens_per_step * num_steps
```

A simple training logger should record both parameter count and cumulative token count:

```python
params = count_parameters(model)

for step, batch in enumerate(loader):
    input_ids = batch["input_ids"]          # [B, T]
    tokens_seen += input_ids.numel()

    logits = model(input_ids[:, :-1])
    loss = loss_fn(logits, input_ids[:, 1:])

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if step % 100 == 0:
        print({
            "step": step,
            "params": params,
            "tokens_seen": tokens_seen,
            "loss": float(loss.item()),
        })
```

This is not enough for a production training run, but it captures the basic accounting needed to reason about scale.

### Practical Lessons

Scaling laws give several practical rules.

First, train on enough tokens. A large model trained on too little data often wastes compute.

Second, report both model size and token count. Parameter count alone says little about training quality.

Third, treat data quality as a scaling variable. More data helps only when the additional data improves the training distribution.

Fourth, evaluate beyond loss. Pretraining loss is important, but downstream behavior depends on adaptation, prompting, decoding, retrieval, and safety work.

Fifth, optimize for the full lifecycle. Training cost matters, but inference cost often dominates over time.

### Summary

Scaling laws describe predictable relationships between loss, model size, data size, and compute. They help answer a central engineering question: for a fixed budget, how large should the model be, and how long should it be trained?

The key variables are parameters $N$, training tokens $D$, compute $C$, and validation loss $L$. Since compute is roughly proportional to $ND$, scaling requires a tradeoff between model capacity and training duration.

Modern language model training favors compute-balanced choices: enough parameters to learn rich structure, enough tokens to train those parameters well, and enough evaluation to detect failures that loss alone cannot show.

