A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence
A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence
where each belongs to a vocabulary . A language model defines a probability distribution over such sequences:
The central question is simple: how likely is this sequence of text?
This question appears in many tasks. Speech recognition uses language models to prefer fluent transcriptions. Machine translation uses them to prefer natural target sentences. Text generation uses them to sample continuations. Information retrieval, spelling correction, summarization, dialogue, and code completion all use some form of language modeling.
A language model must solve two problems at once. First, it must assign probability mass to valid or plausible sequences. Second, it must generalize to sequences it has never seen. Natural language has a combinatorial structure. Even a small vocabulary can produce an enormous number of possible sentences. No training corpus can contain all valid sentences.
The Chain Rule of Probability
The most basic decomposition of a sequence probability follows from the chain rule of probability:
Equivalently,
This form is called an autoregressive factorization. It says that the probability of a sequence can be computed by predicting each token from its previous context.
For example, for the sequence
we write
This factorization is exact. The approximation begins when we decide how much context the model can use.
The Markov Assumption
Classical statistical language models usually approximate the full conditional distribution by using only a fixed number of previous tokens.
A first-order Markov model assumes
A second-order Markov model assumes
More generally, an -gram model assumes
The context has length . A unigram model uses no context. A bigram model uses one previous token. A trigram model uses two previous tokens.
| Model | Approximation |
|---|---|
| Unigram | |
| Bigram | |
| Trigram | |
| -gram |
The Markov assumption makes estimation practical. It also limits the model. Long-range dependencies, discourse structure, syntax across clauses, and semantic consistency often require more context than a fixed small window.
Maximum Likelihood Estimation
An -gram model can be estimated by counting occurrences in a corpus.
For a bigram model, the maximum likelihood estimate is
where is the number of times token follows token , and is the number of times appears as a previous token.
For a trigram model,
More generally,
This estimator is intuitive. It assigns probability according to observed frequency. If “deep learning” appears often and “deep ocean” appears less often in a machine learning corpus, then the model assigns a higher probability to “learning” after “deep.”
However, maximum likelihood estimates are fragile when counts are small. If an -gram never appears in the training corpus, the estimated probability is zero. This causes serious problems, because many valid sequences are absent from any finite corpus.
Sparsity
Language is sparse. The number of possible -grams grows exponentially with . If the vocabulary has size , then the number of possible -grams is
A vocabulary of 50,000 words has possible trigrams. Most possible trigrams will never appear in a training corpus.
This creates the zero-probability problem. Suppose a test sentence contains one trigram that was absent from the training data. A maximum likelihood trigram model assigns that trigram probability zero. Since the sequence probability is a product, the whole sentence receives probability zero.
That result is usually wrong. The sentence may be perfectly grammatical. The corpus was merely incomplete.
Statistical language modeling therefore requires smoothing.
Smoothing
Smoothing adjusts raw count estimates so that unseen events receive nonzero probability and unreliable estimates are moderated.
The simplest method is add-one smoothing, also called Laplace smoothing:
This assigns one artificial count to every possible continuation.
Add-one smoothing is easy to understand, but it often performs poorly for language modeling. It gives too much probability mass to unseen events when the vocabulary is large.
A more general form is add- smoothing:
where is usually smaller than 1.
More effective smoothing methods include Good-Turing smoothing, Katz backoff, Jelinek-Mercer interpolation, and Kneser-Ney smoothing. These methods differ in how they redistribute probability mass from observed events to unseen events.
The main idea is always the same: a model should trust high-count statistics more than low-count statistics.
Backoff and Interpolation
An -gram model may fail when a long context has poor statistics. Backoff methods handle this by falling back to shorter contexts.
For example, if the trigram count is unreliable, the model can use the bigram probability . If the bigram is also unreliable, it can use the unigram probability .
The rough structure is
Interpolation combines several context lengths at once:
where
The weights control how much the model trusts each order. These weights can be fixed or learned from validation data.
Interpolation is often better than pure backoff because even shorter-context statistics can provide useful information when longer-context statistics are available.
Perplexity
Language models are commonly evaluated using perplexity. Perplexity measures how well a model predicts a sequence.
Given a test sequence , the average negative log-likelihood is
Perplexity is the exponential of this quantity:
Lower perplexity means better predictive performance. A model with low perplexity assigns higher probability to the observed test sequence.
Perplexity can be interpreted as an effective branching factor. If a model has perplexity 20, then on average it is about as uncertain as choosing uniformly among 20 plausible next tokens at each step. This interpretation is approximate, but useful.
Perplexity depends strongly on tokenization. Word-level, subword-level, character-level, and byte-level models can have different perplexity scales. For this reason, perplexity comparisons are meaningful only when the dataset, preprocessing, vocabulary, and tokenization are comparable.
Statistical Language Models as Tables
An -gram language model is essentially a large conditional probability table.
For a bigram model, the table has one row for each previous token and one column for each next token:
Each row is a probability distribution:
A trigram model has a table indexed by two previous tokens and one next token. Higher-order models require larger tables.
This table-based view shows both the strength and weakness of classical statistical language models. They are simple, interpretable, and efficient for small context lengths. But they cannot share statistical strength across similar contexts.
For example, a table-based model treats these contexts as unrelated unless both are explicitly observed:
A neural language model can represent “dog” and “cat” as similar vectors, allowing evidence from one context to help another. This is one of the main reasons neural language models replaced classical -gram models in most modern systems.
Limitations of Classical Statistical Models
Classical language models have several structural limitations.
First, they use fixed-length contexts. A trigram model cannot directly condition on information from ten tokens earlier. Increasing helps, but the number of possible contexts grows rapidly.
Second, they suffer from sparsity. Most long -grams are rare or unseen. Smoothing reduces the damage but does not solve the underlying representation problem.
Third, they treat tokens as atomic symbols. The model does not know that “run,” “running,” and “ran” are related unless such patterns are captured indirectly through counts.
Fourth, they cannot learn distributed representations. In a neural model, each token can be represented by an embedding vector, and similar words can have similar representations. In a classical -gram model, similarity must be manually engineered or approximated by external methods.
Fifth, they struggle with semantic coherence. They may capture local fluency but fail to maintain global meaning over long passages.
These limitations led to neural language models, recurrent networks, attention mechanisms, and eventually transformer-based large language models.
A Minimal Bigram Model in PyTorch
Although classical -gram models are usually implemented with counts rather than neural networks, we can build a simple neural bigram model in PyTorch.
The model learns a table of logits. Given a current token , it predicts the next token .
import torch
import torch.nn as nn
import torch.nn.functional as F
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size: int):
super().__init__()
self.logits = nn.Embedding(vocab_size, vocab_size)
def forward(self, x, targets=None):
# x: [B, T]
# logits: [B, T, V]
logits = self.logits(x)
loss = None
if targets is not None:
B, T, V = logits.shape
loss = F.cross_entropy(
logits.reshape(B * T, V),
targets.reshape(B * T),
)
return logits, lossHere nn.Embedding(vocab_size, vocab_size) acts as a lookup table. Each input token selects one row of logits over the vocabulary. After applying softmax, that row becomes a probability distribution over the next token.
This model is neural in implementation but equivalent in structure to a bigram table. It has no hidden layers and no long context. It is useful because it shows the connection between statistical language modeling and modern neural language modeling.
A training step looks like this:
vocab_size = 10000
model = BigramLanguageModel(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
x = torch.randint(0, vocab_size, (32, 128))
y = torch.randint(0, vocab_size, (32, 128))
logits, loss = model(x, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()The input tensor x has shape [B, T]. The target tensor y also has shape [B, T]. For each position, the model predicts the next token.
In a real language modeling dataset, y is usually x shifted one position to the left:
tokens = torch.randint(0, vocab_size, (32, 129))
x = tokens[:, :-1]
y = tokens[:, 1:]The model receives tokens and learns to predict .
From Statistical to Neural Language Models
A statistical language model estimates probabilities from observed counts. A neural language model estimates probabilities using learned parameters and continuous representations.
The statistical bigram model uses
A neural model may use
where denotes learned parameters. Instead of storing a separate count for every observed context, the neural model maps tokens and contexts into vectors and computes probabilities from those vectors.
This change has large consequences. Neural models can share information across similar tokens, represent long contexts, learn syntactic and semantic patterns, and scale to massive datasets. Transformers extend this idea by allowing every token in a context window to interact through attention.
Still, the basic objective remains the same as in statistical language modeling:
Modern language models are therefore direct descendants of statistical language models. They use richer parameterizations, larger context windows, learned representations, and much larger datasets, but the probabilistic foundation is the same.