# Language Modeling Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language. Given a token sequence: $$ x = (x_1, x_2, \dots, x_T), $$ a language model estimates: $$ P(x_1, x_2, \dots, x_T). $$ Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems. ### Autoregressive Factorization A sequence probability can be decomposed using the chain rule: $$ P(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{` | The model predicts one token at each position. If: ```text id="nd7d3m" logits: [B, T, V] ``` then the target tensor is: ```text id="q6wh4e" targets: [B, T] ``` The loss compares predicted logits with the next-token targets. ### Causal Masking Autoregressive models must not see future tokens during training. For the sequence: ```text id="9q7k7j" the cat sat ``` the prediction for `cat` must not depend on `sat`. Transformers enforce this using a causal attention mask. For sequence length $T=4$: ```text id="r8twz5" 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 ``` Position $t$ may attend only to positions: $$ \le t. $$ In PyTorch: ```python id="7v9j8j" import torch T = 4 mask = torch.tril(torch.ones(T, T)) print(mask) ``` Output: ```text id="j83gmw" tensor([[1., 0., 0., 0.], [1., 1., 0., 0.], [1., 1., 1., 0.], [1., 1., 1., 1.]]) ``` Without causal masking, the model could trivially copy future tokens during training. ### Cross-Entropy Training Objective Autoregressive language models usually use cross-entropy loss. Suppose: ```text id="grs4gh" logits: [B, T, V] targets: [B, T] ``` We flatten the tensors: ```python id="uhvczx" import torch.nn as nn loss_fn = nn.CrossEntropyLoss() B, T, V = logits.shape loss = loss_fn( logits.reshape(B * T, V), targets.reshape(B * T), ) ``` The target at each position is the next token. The model minimizes: $$ -\log P(x_t \mid x_{ embeddings -> transformer blocks -> hidden_states: [B, T, D] -> output projection -> logits: [B, T, V] ``` Each position predicts the next token. Modern large language models such as GPT-style systems use this architecture. ### Weight Tying Many language models tie input embeddings and output projection weights. The embedding matrix: $$ E \in \mathbb{R}^{V \times D} $$ is reused for output logits: $$ z_t = h_t E^\top. $$ Advantages: | Benefit | Description | |---|---| | Fewer parameters | Reduced memory usage | | Better generalization | Shared token representations | | Faster training | Smaller model size | Weight tying is now common in transformer language models. ### Positional Encoding Transformers do not inherently know token order. Example: ```text id="djs0md" dog bites man man bites dog ``` contain the same tokens but different meanings. Positional information must therefore be added. A positional encoding provides a vector: $$ p_t $$ for each position $t$. The transformer input becomes: $$ x_t = e_t + p_t, $$ where: | Symbol | Meaning | |---|---| | $e_t$ | Token embedding | | $p_t$ | Positional embedding | Modern models use several positional methods: | Method | Description | |---|---| | Learned embeddings | Trainable position vectors | | Sinusoidal encoding | Fixed trigonometric patterns | | Rotary embeddings | Rotate hidden dimensions | | Relative attention | Encode token distance | Position encoding strongly affects long-context behavior. ### Context Length A transformer attends over a finite context window. If the maximum context length is: $$ L, $$ then tokens beyond $L$ positions cannot be attended to directly. Longer context windows improve: | Capability | Example | |---|---| | Long-document reasoning | Research papers | | Multi-turn dialogue | Long conversations | | Code understanding | Large repositories | | Retrieval integration | Many retrieved passages | However, self-attention cost grows approximately as: $$ O(T^2), $$ where $T$ is sequence length. This motivates research into sparse attention, memory systems, state-space models, and linear attention methods. ### Training Data Language models are trained on large corpora. Common data sources: | Source | Example | |---|---| | Web pages | Common Crawl | | Books | Digitized books | | Code repositories | GitHub | | Scientific papers | arXiv | | Dialogues | Chat logs | | Documentation | Technical manuals | Training quality depends heavily on data quality. Problems include: | Issue | Description | |---|---| | Duplicates | Memorization risk | | Spam | Low-quality language | | Toxic content | Harmful outputs | | Imbalance | Overrepresentation of domains | | Copyright concerns | Legal restrictions | Data filtering and deduplication are important parts of large-scale language model training. ### Scaling Laws Large language models exhibit scaling behavior. Performance improves predictably as: | Variable | Increases | |---|---| | Model parameters | Larger networks | | Training tokens | More data | | Compute | More optimization steps | Empirical scaling laws show approximate power-law relationships between loss and compute scale. However, scaling eventually encounters constraints: | Constraint | Example | |---|---| | Compute cost | GPU expense | | Memory limits | Model size | | Data quality | Finite high-quality text | | Latency | Inference speed | | Energy usage | Training power consumption | Scaling alone does not guarantee reasoning ability, factuality, or safety. ### Inference and KV Caching Autoregressive generation repeatedly predicts one token at a time. Naively recomputing all attention states each step is expensive. Transformers therefore cache previous key and value tensors. At generation step $t$: | Cached tensor | Shape | |---|---| | Keys | `[B, H, T, D_h]` | | Values | `[B, H, T, D_h]` | where: | Symbol | Meaning | |---|---| | $H$ | Number of attention heads | | $D_h$ | Head dimension | KV caching reduces generation complexity from recomputing the entire sequence at every step. ### Sampling from Language Models The model outputs logits: $$ z_t \in \mathbb{R}^{V}. $$ A decoding algorithm converts logits into tokens. Common methods: | Method | Behavior | |---|---| | Greedy decoding | Deterministic highest-probability token | | Beam search | Explore several sequences | | Top-k sampling | Restrict to top-k tokens | | Top-p sampling | Restrict cumulative probability mass | | Temperature sampling | Adjust randomness | Generation quality depends strongly on decoding configuration. Low temperature: | Effect | |---| | More deterministic | | More repetitive | | Less creative | High temperature: | Effect | |---| | More diverse | | More random | | Less stable | ### Emergent Behaviors Large language models sometimes exhibit capabilities not obvious in smaller models. Examples: | Capability | Example | |---|---| | In-context learning | Learn from prompt examples | | Few-shot reasoning | Solve unseen tasks | | Tool coordination | Use external APIs | | Chain-of-thought reasoning | Multi-step explanations | | Code synthesis | Generate programs | The exact causes remain an active research topic. Some behaviors appear gradually with scale. Others appear more abruptly. ### Failure Modes Language models have important limitations. | Failure mode | Example | |---|---| | Hallucination | False factual claims | | Memorization | Reproducing training data | | Bias | Harmful stereotypes | | Prompt injection | Unsafe instruction following | | Context confusion | Losing track of dialogue | | Arithmetic weakness | Calculation errors | Language models optimize token prediction, not truth, reasoning correctness, or safety. This distinction is critical when deploying systems in high-stakes settings. ### Pretraining and Fine-Tuning Most modern systems use two stages: | Stage | Purpose | |---|---| | Pretraining | Learn general language structure | | Fine-tuning | Adapt to downstream tasks | Pretraining uses large-scale next-token prediction. Fine-tuning adapts the model for: | Task | Example | |---|---| | Dialogue | Chat systems | | Translation | Multilingual systems | | Coding | Code generation | | QA | Reading comprehension | | Summarization | Condensed outputs | Instruction tuning and RLHF further shape model behavior. ### PyTorch Training Example A simplified transformer language model training step: ```python id="23a1zb" def training_step(model, batch, optimizer): input_ids = batch["input_ids"] targets = batch["targets"] logits = model(input_ids) # logits: [B, T, V] B, T, V = logits.shape loss_fn = nn.CrossEntropyLoss() loss = loss_fn( logits.reshape(B * T, V), targets.reshape(B * T), ) optimizer.zero_grad(set_to_none=True) loss.backward() optimizer.step() return loss.item() ``` The targets are usually shifted by one token relative to the inputs. ### Summary Language modeling predicts token sequences autoregressively. Modern language models use transformer architectures with causal masking and next-token prediction objectives. Key components include tokenization, embeddings, positional encoding, self-attention, output projections, and decoding algorithms. Training uses cross-entropy loss over large text corpora. Evaluation often uses perplexity. Large language models extend basic language modeling into dialogue, reasoning, retrieval augmentation, tool use, and multimodal systems, but they still inherit core limitations from probabilistic next-token prediction.