Skip to content

Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Given a token sequence:

x=(x1,x2,,xT), x = (x_1, x_2, \dots, x_T),

a language model estimates:

P(x1,x2,,xT). P(x_1, x_2, \dots, x_T).

Modern language models are the foundation of many NLP systems, including text generation, dialogue systems, translation systems, summarizers, code assistants, and retrieval-augmented systems.

Autoregressive Factorization

A sequence probability can be decomposed using the chain rule:

P(x1,x2,,xT)=t=1TP(xtx<t), P(x_1, x_2, \dots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{<t}),

where:

x<t=(x1,x2,,xt1). x_{<t} = (x_1, x_2, \dots, x_{t-1}).

The model predicts the next token conditioned on all previous tokens.

Example:

the cat sat on the

The model predicts the next token distribution:

TokenProbability
mat0.42
floor0.11
chair0.05
moon0.0001

A good language model assigns high probability to plausible continuations.

Vocabulary and Tokens

Language models operate on token sequences rather than raw text.

A tokenizer converts text into token IDs:

"The cat sleeps."

may become:

[314, 892, 12011, 13]

The vocabulary size is:

V. |V|.

Each token corresponds to one row in the embedding matrix:

ERV×D. E \in \mathbb{R}^{|V| \times D}.

The input token IDs have shape:

[B, T]

After embedding:

[B, T, D]

where:

SymbolMeaning
BBBatch size
TTSequence length
DDEmbedding dimension

Next-Token Prediction

The central training objective of autoregressive language models is next-token prediction.

Suppose the token sequence is:

the cat sat

The model receives:

InputTarget
thecat
the catsat
the cat sat<eos>

The model predicts one token at each position.

If:

logits: [B, T, V]

then the target tensor is:

targets: [B, T]

The loss compares predicted logits with the next-token targets.

Causal Masking

Autoregressive models must not see future tokens during training.

For the sequence:

the cat sat

the prediction for cat must not depend on sat.

Transformers enforce this using a causal attention mask.

For sequence length T=4T=4:

1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1

Position tt may attend only to positions:

t. \le t.

In PyTorch:

import torch

T = 4

mask = torch.tril(torch.ones(T, T))
print(mask)

Output:

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

Without causal masking, the model could trivially copy future tokens during training.

Cross-Entropy Training Objective

Autoregressive language models usually use cross-entropy loss.

Suppose:

logits: [B, T, V]
targets: [B, T]

We flatten the tensors:

import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

B, T, V = logits.shape

loss = loss_fn(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

The target at each position is the next token.

The model minimizes:

logP(xtx<t). -\log P(x_t \mid x_{<t}).

A lower loss means the model assigns higher probability to the correct next token.

Perplexity

Perplexity is a common evaluation metric for language models.

If the average negative log-likelihood per token is:

L, L,

then perplexity is:

PPL=exp(L). \operatorname{PPL} = \exp(L).

Perplexity measures how uncertain the model is.

Interpretation:

PerplexityInterpretation
LowModel predicts tokens confidently
HighModel is uncertain

If a model has perplexity 10, it behaves roughly as if it chooses among 10 equally likely options per step.

Lower perplexity usually indicates better language modeling performance, though it does not perfectly correlate with downstream usefulness or factual accuracy.

Recurrent Language Models

Before transformers, many language models used recurrent neural networks.

An RNN language model processes tokens sequentially:

ht=f(ht1,xt). h_t = f(h_{t-1}, x_t).

The hidden state summarizes previous tokens.

An LSTM language model:

import torch.nn as nn

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
        )

        self.output = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        h, _ = self.lstm(x)
        logits = self.output(h)
        return logits

RNN models struggle with long-range dependencies and parallelization. Transformers largely replaced them for large-scale language modeling.

Transformer Language Models

Transformer language models use self-attention instead of recurrence.

Advantages:

AdvantageDescription
Parallel trainingAll tokens processed simultaneously
Long-range interactionsDirect token-to-token attention
Scalable trainingEfficient GPU utilization
Better representation learningRich contextual embeddings

A decoder-only transformer computes:

input_ids: [B, T]
-> embeddings
-> transformer blocks
-> hidden_states: [B, T, D]
-> output projection
-> logits: [B, T, V]

Each position predicts the next token.

Modern large language models such as GPT-style systems use this architecture.

Weight Tying

Many language models tie input embeddings and output projection weights.

The embedding matrix:

ERV×D E \in \mathbb{R}^{V \times D}

is reused for output logits:

zt=htE. z_t = h_t E^\top.

Advantages:

BenefitDescription
Fewer parametersReduced memory usage
Better generalizationShared token representations
Faster trainingSmaller model size

Weight tying is now common in transformer language models.

Positional Encoding

Transformers do not inherently know token order.

Example:

dog bites man
man bites dog

contain the same tokens but different meanings.

Positional information must therefore be added.

A positional encoding provides a vector:

pt p_t

for each position tt.

The transformer input becomes:

xt=et+pt, x_t = e_t + p_t,

where:

SymbolMeaning
ete_tToken embedding
ptp_tPositional embedding

Modern models use several positional methods:

MethodDescription
Learned embeddingsTrainable position vectors
Sinusoidal encodingFixed trigonometric patterns
Rotary embeddingsRotate hidden dimensions
Relative attentionEncode token distance

Position encoding strongly affects long-context behavior.

Context Length

A transformer attends over a finite context window.

If the maximum context length is:

L, L,

then tokens beyond LL positions cannot be attended to directly.

Longer context windows improve:

CapabilityExample
Long-document reasoningResearch papers
Multi-turn dialogueLong conversations
Code understandingLarge repositories
Retrieval integrationMany retrieved passages

However, self-attention cost grows approximately as:

O(T2), O(T^2),

where TT is sequence length.

This motivates research into sparse attention, memory systems, state-space models, and linear attention methods.

Training Data

Language models are trained on large corpora.

Common data sources:

SourceExample
Web pagesCommon Crawl
BooksDigitized books
Code repositoriesGitHub
Scientific papersarXiv
DialoguesChat logs
DocumentationTechnical manuals

Training quality depends heavily on data quality.

Problems include:

IssueDescription
DuplicatesMemorization risk
SpamLow-quality language
Toxic contentHarmful outputs
ImbalanceOverrepresentation of domains
Copyright concernsLegal restrictions

Data filtering and deduplication are important parts of large-scale language model training.

Scaling Laws

Large language models exhibit scaling behavior.

Performance improves predictably as:

VariableIncreases
Model parametersLarger networks
Training tokensMore data
ComputeMore optimization steps

Empirical scaling laws show approximate power-law relationships between loss and compute scale.

However, scaling eventually encounters constraints:

ConstraintExample
Compute costGPU expense
Memory limitsModel size
Data qualityFinite high-quality text
LatencyInference speed
Energy usageTraining power consumption

Scaling alone does not guarantee reasoning ability, factuality, or safety.

Inference and KV Caching

Autoregressive generation repeatedly predicts one token at a time.

Naively recomputing all attention states each step is expensive.

Transformers therefore cache previous key and value tensors.

At generation step tt:

Cached tensorShape
Keys[B, H, T, D_h]
Values[B, H, T, D_h]

where:

SymbolMeaning
HHNumber of attention heads
DhD_hHead dimension

KV caching reduces generation complexity from recomputing the entire sequence at every step.

Sampling from Language Models

The model outputs logits:

ztRV. z_t \in \mathbb{R}^{V}.

A decoding algorithm converts logits into tokens.

Common methods:

MethodBehavior
Greedy decodingDeterministic highest-probability token
Beam searchExplore several sequences
Top-k samplingRestrict to top-k tokens
Top-p samplingRestrict cumulative probability mass
Temperature samplingAdjust randomness

Generation quality depends strongly on decoding configuration.

Low temperature:

Effect
More deterministic
More repetitive
Less creative

High temperature:

Effect
More diverse
More random
Less stable

Emergent Behaviors

Large language models sometimes exhibit capabilities not obvious in smaller models.

Examples:

CapabilityExample
In-context learningLearn from prompt examples
Few-shot reasoningSolve unseen tasks
Tool coordinationUse external APIs
Chain-of-thought reasoningMulti-step explanations
Code synthesisGenerate programs

The exact causes remain an active research topic.

Some behaviors appear gradually with scale. Others appear more abruptly.

Failure Modes

Language models have important limitations.

Failure modeExample
HallucinationFalse factual claims
MemorizationReproducing training data
BiasHarmful stereotypes
Prompt injectionUnsafe instruction following
Context confusionLosing track of dialogue
Arithmetic weaknessCalculation errors

Language models optimize token prediction, not truth, reasoning correctness, or safety.

This distinction is critical when deploying systems in high-stakes settings.

Pretraining and Fine-Tuning

Most modern systems use two stages:

StagePurpose
PretrainingLearn general language structure
Fine-tuningAdapt to downstream tasks

Pretraining uses large-scale next-token prediction.

Fine-tuning adapts the model for:

TaskExample
DialogueChat systems
TranslationMultilingual systems
CodingCode generation
QAReading comprehension
SummarizationCondensed outputs

Instruction tuning and RLHF further shape model behavior.

PyTorch Training Example

A simplified transformer language model training step:

def training_step(model, batch, optimizer):
    input_ids = batch["input_ids"]
    targets = batch["targets"]

    logits = model(input_ids)
    # logits: [B, T, V]

    B, T, V = logits.shape

    loss_fn = nn.CrossEntropyLoss()

    loss = loss_fn(
        logits.reshape(B * T, V),
        targets.reshape(B * T),
    )

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    return loss.item()

The targets are usually shifted by one token relative to the inputs.

Summary

Language modeling predicts token sequences autoregressively. Modern language models use transformer architectures with causal masking and next-token prediction objectives.

Key components include tokenization, embeddings, positional encoding, self-attention, output projections, and decoding algorithms. Training uses cross-entropy loss over large text corpora. Evaluation often uses perplexity.

Large language models extend basic language modeling into dialogue, reasoning, retrieval augmentation, tool use, and multimodal systems, but they still inherit core limitations from probabilistic next-token prediction.