# Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Given a token sequence:

$$
x = (x_1, x_2, \dots, x_T),
$$

a language model estimates:

$$
P(x_1, x_2, \dots, x_T).
$$

Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems.

### Autoregressive Factorization

A sequence probability can be decomposed using the chain rule:

$$
P(x_1, x_2, \dots, x_T) =
\prod_{t=1}^{T}
P(x_t \mid x_{<t}),
$$

where:

$$
x_{<t} =
(x_1, x_2, \dots, x_{t-1}).
$$

The model predicts the next token conditioned on all previous tokens.

Example:

```text id="9t5rjn"
the cat sat on the
```

The model predicts the next token distribution:

| Token | Probability |
|---|---:|
| `mat` | 0.42 |
| `floor` | 0.11 |
| `chair` | 0.05 |
| `moon` | 0.0001 |

A good language model assigns high probability to plausible continuations.

### Vocabulary and Tokens

Language models operate on token sequences rather than raw text.

A tokenizer converts text into token IDs:

```text id="y59b4d"
"The cat sleeps."
```

may become:

```text id="xg6d2j"
[314, 892, 12011, 13]
```

The vocabulary size is:

$$
|V|.
$$

Each token corresponds to one row in the embedding matrix:

$$
E \in \mathbb{R}^{|V| \times D}.
$$

The input token IDs have shape:

```text id="4t2pup"
[B, T]
```

After embedding:

```text id="v2wr7w"
[B, T, D]
```

where:

| Symbol | Meaning |
|---|---|
| $B$ | Batch size |
| $T$ | Sequence length |
| $D$ | Embedding dimension |

### Next-Token Prediction

The central training objective of autoregressive language models is next-token prediction.

Suppose the token sequence is:

```text id="9wjrfu"
the cat sat
```

The model receives:

| Input | Target |
|---|---|
| `the` | `cat` |
| `the cat` | `sat` |
| `the cat sat` | `<eos>` |

The model predicts one token at each position.

If:

```text id="nd7d3m"
logits: [B, T, V]
```

then the target tensor is:

```text id="q6wh4e"
targets: [B, T]
```

The loss compares predicted logits with the next-token targets.

### Causal Masking

Autoregressive models must not see future tokens during training.

For the sequence:

```text id="9q7k7j"
the cat sat
```

the prediction for `cat` must not depend on `sat`.

Transformers enforce this using a causal attention mask.

For sequence length $T=4$:

```text id="r8twz5"
1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1
```

Position $t$ may attend only to positions:

$$
\le t.
$$

In PyTorch:

```python id="7v9j8j"
import torch

T = 4

mask = torch.tril(torch.ones(T, T))
print(mask)
```

Output:

```text id="j83gmw"
tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])
```

Without causal masking, the model could trivially copy future tokens during training.

### Cross-Entropy Training Objective

Autoregressive language models usually use cross-entropy loss.

Suppose:

```text id="grs4gh"
logits: [B, T, V]
targets: [B, T]
```

We flatten the tensors:

```python id="uhvczx"
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)
```

The target at each position is the next token.

The model minimizes:

$$
-\log P(x_t \mid x_{<t}).
$$

A lower loss means the model assigns higher probability to the correct next token.

### Perplexity

Perplexity is a common evaluation metric for language models.

If the average negative log-likelihood per token is:

$$
L,
$$

then perplexity is:

$$
\operatorname{PPL} =
\exp(L).
$$

Perplexity measures how uncertain the model is.

Interpretation:

| Perplexity | Interpretation |
|---|---|
| Low | Model predicts tokens confidently |
| High | Model is uncertain |

If a model has perplexity 10, it behaves roughly as if it chooses among 10 equally likely options per step.

Lower perplexity usually indicates better language modeling performance, though it does not perfectly correlate with downstream usefulness or factual accuracy.

### Recurrent Language Models

Before transformers, many language models used recurrent neural networks.

An RNN language model processes tokens sequentially:

$$
h_t = f(h_{t-1}, x_t).
$$

The hidden state summarizes previous tokens.

An LSTM language model:

```python id="vlh6u0"
import torch.nn as nn

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
        )

        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        h, _ = self.lstm(x)
        logits = self.output(h)
        return logits
```

RNN models struggle with long-range dependencies and parallelization. Transformers largely replaced them for large-scale language modeling.

### Transformer Language Models

Transformer language models use self-attention instead of recurrence.

Advantages:

| Advantage | Description |
|---|---|
| Parallel training | All tokens processed simultaneously |
| Long-range interactions | Direct token-to-token attention |
| Scalable training | Efficient GPU utilization |
| Better representation learning | Rich contextual embeddings |

A decoder-only transformer computes:

```text id="4hm21v"
input_ids: [B, T]
-> embeddings
-> transformer blocks
-> hidden_states: [B, T, D]
-> output projection
-> logits: [B, T, V]
```

Each position predicts the next token.

Modern large language models such as GPT-style systems use this architecture.

### Weight Tying

Many language models tie input embeddings and output projection weights.

The embedding matrix:

$$
E \in \mathbb{R}^{V \times D}
$$

is reused for output logits:

$$
z_t = h_t E^\top.
$$

Advantages:

| Benefit | Description |
|---|---|
| Fewer parameters | Reduced memory usage |
| Better generalization | Shared token representations |
| Faster training | Smaller model size |

Weight tying is now common in transformer language models.

### Positional Encoding

Transformers do not inherently know token order.

Example:

```text id="djs0md"
dog bites man
man bites dog
```

contain the same tokens but different meanings.

Positional information must therefore be added.

A positional encoding provides a vector:

$$
p_t
$$

for each position $t$.

The transformer input becomes:

$$
x_t = e_t + p_t,
$$

where:

| Symbol | Meaning |
|---|---|
| $e_t$ | Token embedding |
| $p_t$ | Positional embedding |

Modern models use several positional methods:

| Method | Description |
|---|---|
| Learned embeddings | Trainable position vectors |
| Sinusoidal encoding | Fixed trigonometric patterns |
| Rotary embeddings | Rotate hidden dimensions |
| Relative attention | Encode token distance |

Position encoding strongly affects long-context behavior.

### Context Length

A transformer attends over a finite context window.

If the maximum context length is:

$$
L,
$$

then tokens beyond $L$ positions cannot be attended to directly.

Longer context windows improve:

| Capability | Example |
|---|---|
| Long-document reasoning | Research papers |
| Multi-turn dialogue | Long conversations |
| Code understanding | Large repositories |
| Retrieval integration | Many retrieved passages |

However, self-attention cost grows approximately as:

$$
O(T^2),
$$

where $T$ is sequence length.

This motivates research into sparse attention, memory systems, state-space models, and linear attention methods.

### Training Data

Language models are trained on large corpora.

Common data sources:

| Source | Example |
|---|---|
| Web pages | Common Crawl |
| Books | Digitized books |
| Code repositories | GitHub |
| Scientific papers | arXiv |
| Dialogues | Chat logs |
| Documentation | Technical manuals |

Training quality depends heavily on data quality.

Problems include:

| Issue | Description |
|---|---|
| Duplicates | Memorization risk |
| Spam | Low-quality language |
| Toxic content | Harmful outputs |
| Imbalance | Overrepresentation of domains |
| Copyright concerns | Legal restrictions |

Data filtering and deduplication are important parts of large-scale language model training.

### Scaling Laws

Large language models exhibit scaling behavior.

Performance improves predictably as:

| Variable | Increases |
|---|---|
| Model parameters | Larger networks |
| Training tokens | More data |
| Compute | More optimization steps |

Empirical scaling laws show approximate power-law relationships between loss and compute scale.

However, scaling eventually encounters constraints:

| Constraint | Example |
|---|---|
| Compute cost | GPU expense |
| Memory limits | Model size |
| Data quality | Finite high-quality text |
| Latency | Inference speed |
| Energy usage | Training power consumption |

Scaling alone does not guarantee reasoning ability, factuality, or safety.

### Inference and KV Caching

Autoregressive generation repeatedly predicts one token at a time.

Naively recomputing all attention states each step is expensive.

Transformers therefore cache previous key and value tensors.

At generation step $t$:

| Cached tensor | Shape |
|---|---|
| Keys | `[B, H, T, D_h]` |
| Values | `[B, H, T, D_h]` |

where:

| Symbol | Meaning |
|---|---|
| $H$ | Number of attention heads |
| $D_h$ | Head dimension |

KV caching reduces generation complexity from recomputing the entire sequence at every step.

### Sampling from Language Models

The model outputs logits:

$$
z_t \in \mathbb{R}^{V}.
$$

A decoding algorithm converts logits into tokens.

Common methods:

| Method | Behavior |
|---|---|
| Greedy decoding | Deterministic highest-probability token |
| Beam search | Explore several sequences |
| Top-k sampling | Restrict to top-k tokens |
| Top-p sampling | Restrict cumulative probability mass |
| Temperature sampling | Adjust randomness |

Generation quality depends strongly on decoding configuration.

Low temperature:

| Effect |
|---|
| More deterministic |
| More repetitive |
| Less creative |

High temperature:

| Effect |
|---|
| More diverse |
| More random |
| Less stable |

### Emergent Behaviors

Large language models sometimes exhibit capabilities not obvious in smaller models.

Examples:

| Capability | Example |
|---|---|
| In-context learning | Learn from prompt examples |
| Few-shot reasoning | Solve unseen tasks |
| Tool coordination | Use external APIs |
| Chain-of-thought reasoning | Multi-step explanations |
| Code synthesis | Generate programs |

The exact causes remain an active research topic.

Some behaviors appear gradually with scale. Others appear more abruptly.

### Failure Modes

Language models have important limitations.

| Failure mode | Example |
|---|---|
| Hallucination | False factual claims |
| Memorization | Reproducing training data |
| Bias | Harmful stereotypes |
| Prompt injection | Unsafe instruction following |
| Context confusion | Losing track of dialogue |
| Arithmetic weakness | Calculation errors |

Language models optimize token prediction, not truth, reasoning correctness, or safety.

This distinction is critical when deploying systems in high-stakes settings.

### Pretraining and Fine-Tuning

Most modern systems use two stages:

| Stage | Purpose |
|---|---|
| Pretraining | Learn general language structure |
| Fine-tuning | Adapt to downstream tasks |

Pretraining uses large-scale next-token prediction.

Fine-tuning adapts the model for:

| Task | Example |
|---|---|
| Dialogue | Chat systems |
| Translation | Multilingual systems |
| Coding | Code generation |
| QA | Reading comprehension |
| Summarization | Condensed outputs |

Instruction tuning and RLHF further shape model behavior.

### PyTorch Training Example

A simplified transformer language model training step:

```python id="23a1zb"
def training_step(model, batch, optimizer):
    input_ids = batch["input_ids"]
    targets = batch["targets"]

    logits = model(input_ids)
    # logits: [B, T, V]

    B, T, V = logits.shape

    loss_fn = nn.CrossEntropyLoss()

    loss = loss_fn(
        logits.reshape(B * T, V),
        targets.reshape(B * T),
    )

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()
```

The targets are usually shifted by one token relative to the inputs.

### Summary

Language modeling predicts token sequences autoregressively. Modern language models use transformer architectures with causal masking and next-token prediction objectives.

Key components include tokenization, embeddings, positional encoding, self-attention, output projections, and decoding algorithms. Training uses cross-entropy loss over large text corpora. Evaluation often uses perplexity.

Large language models extend basic language modeling into dialogue, reasoning, retrieval augmentation, tool use, and multimodal systems, but they still inherit core limitations from probabilistic next-token prediction.

