Skip to content

Statistical Language Models

A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence

A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence

x1:T=(x1,x2,,xT), x_{1:T} = (x_1, x_2, \ldots, x_T),

where each xtx_t belongs to a vocabulary VV. A language model defines a probability distribution over such sequences:

p(x1:T). p(x_{1:T}).

The central question is simple: how likely is this sequence of text?

This question appears in many tasks. Speech recognition uses language models to prefer fluent transcriptions. Machine translation uses them to prefer natural target sentences. Text generation uses them to sample continuations. Information retrieval, spelling correction, summarization, dialogue, and code completion all use some form of language modeling.

A language model must solve two problems at once. First, it must assign probability mass to valid or plausible sequences. Second, it must generalize to sequences it has never seen. Natural language has a combinatorial structure. Even a small vocabulary can produce an enormous number of possible sentences. No training corpus can contain all valid sentences.

The Chain Rule of Probability

The most basic decomposition of a sequence probability follows from the chain rule of probability:

p(x1:T)=p(x1)p(x2x1)p(x3x1,x2)p(xTx1,,xT1). p(x_{1:T}) = p(x_1)p(x_2 \mid x_1)p(x_3 \mid x_1,x_2)\cdots p(x_T \mid x_1,\ldots,x_{T-1}).

Equivalently,

p(x1:T)=t=1Tp(xtx1:t1). p(x_{1:T}) = \prod_{t=1}^{T} p(x_t \mid x_{1:t-1}).

This form is called an autoregressive factorization. It says that the probability of a sequence can be computed by predicting each token from its previous context.

For example, for the sequence

the cat sat \text{the cat sat}

we write

p(the cat sat)=p(the)p(catthe)p(satthe cat). p(\text{the cat sat}) = p(\text{the}) p(\text{cat}\mid \text{the}) p(\text{sat}\mid \text{the cat}).

This factorization is exact. The approximation begins when we decide how much context the model can use.

The Markov Assumption

Classical statistical language models usually approximate the full conditional distribution by using only a fixed number of previous tokens.

A first-order Markov model assumes

p(xtx1:t1)p(xtxt1). p(x_t \mid x_{1:t-1}) \approx p(x_t \mid x_{t-1}).

A second-order Markov model assumes

p(xtx1:t1)p(xtxt2,xt1). p(x_t \mid x_{1:t-1}) \approx p(x_t \mid x_{t-2},x_{t-1}).

More generally, an nn-gram model assumes

p(xtx1:t1)p(xtxtn+1:t1). p(x_t \mid x_{1:t-1}) \approx p(x_t \mid x_{t-n+1:t-1}).

The context has length n1n-1. A unigram model uses no context. A bigram model uses one previous token. A trigram model uses two previous tokens.

ModelApproximation
Unigramp(xt)p(x_t)
Bigramp(xtxt1)p(x_t \mid x_{t-1})
Trigramp(xtxt2,xt1)p(x_t \mid x_{t-2},x_{t-1})
nn-gramp(xtxtn+1:t1)p(x_t \mid x_{t-n+1:t-1})

The Markov assumption makes estimation practical. It also limits the model. Long-range dependencies, discourse structure, syntax across clauses, and semantic consistency often require more context than a fixed small window.

Maximum Likelihood Estimation

An nn-gram model can be estimated by counting occurrences in a corpus.

For a bigram model, the maximum likelihood estimate is

p^(xt=wjxt1=wi)=C(wi,wj)C(wi), \hat{p}(x_t = w_j \mid x_{t-1}=w_i) = \frac{C(w_i,w_j)}{C(w_i)},

where C(wi,wj)C(w_i,w_j) is the number of times token wjw_j follows token wiw_i, and C(wi)C(w_i) is the number of times wiw_i appears as a previous token.

For a trigram model,

p^(wkwi,wj)=C(wi,wj,wk)C(wi,wj). \hat{p}(w_k \mid w_i,w_j) = \frac{C(w_i,w_j,w_k)}{C(w_i,w_j)}.

More generally,

p^(xtxtn+1:t1)=C(xtn+1:t)C(xtn+1:t1). \hat{p}(x_t \mid x_{t-n+1:t-1}) = \frac{C(x_{t-n+1:t})}{C(x_{t-n+1:t-1})}.

This estimator is intuitive. It assigns probability according to observed frequency. If “deep learning” appears often and “deep ocean” appears less often in a machine learning corpus, then the model assigns a higher probability to “learning” after “deep.”

However, maximum likelihood estimates are fragile when counts are small. If an nn-gram never appears in the training corpus, the estimated probability is zero. This causes serious problems, because many valid sequences are absent from any finite corpus.

Sparsity

Language is sparse. The number of possible nn-grams grows exponentially with nn. If the vocabulary has size V|V|, then the number of possible nn-grams is

Vn. |V|^n.

A vocabulary of 50,000 words has 50,000350{,}000^3 possible trigrams. Most possible trigrams will never appear in a training corpus.

This creates the zero-probability problem. Suppose a test sentence contains one trigram that was absent from the training data. A maximum likelihood trigram model assigns that trigram probability zero. Since the sequence probability is a product, the whole sentence receives probability zero.

That result is usually wrong. The sentence may be perfectly grammatical. The corpus was merely incomplete.

Statistical language modeling therefore requires smoothing.

Smoothing

Smoothing adjusts raw count estimates so that unseen events receive nonzero probability and unreliable estimates are moderated.

The simplest method is add-one smoothing, also called Laplace smoothing:

p^Laplace(wjwi)=C(wi,wj)+1C(wi)+V. \hat{p}_{\text{Laplace}}(w_j \mid w_i) = \frac{C(w_i,w_j)+1}{C(w_i)+|V|}.

This assigns one artificial count to every possible continuation.

Add-one smoothing is easy to understand, but it often performs poorly for language modeling. It gives too much probability mass to unseen events when the vocabulary is large.

A more general form is add-α\alpha smoothing:

p^α(wjwi)=C(wi,wj)+αC(wi)+αV, \hat{p}_{\alpha}(w_j \mid w_i) = \frac{C(w_i,w_j)+\alpha}{C(w_i)+\alpha |V|},

where 0<α<10 < \alpha < 1 is usually smaller than 1.

More effective smoothing methods include Good-Turing smoothing, Katz backoff, Jelinek-Mercer interpolation, and Kneser-Ney smoothing. These methods differ in how they redistribute probability mass from observed events to unseen events.

The main idea is always the same: a model should trust high-count statistics more than low-count statistics.

Backoff and Interpolation

An nn-gram model may fail when a long context has poor statistics. Backoff methods handle this by falling back to shorter contexts.

For example, if the trigram count C(wi,wj,wk)C(w_i,w_j,w_k) is unreliable, the model can use the bigram probability p(wkwj)p(w_k \mid w_j). If the bigram is also unreliable, it can use the unigram probability p(wk)p(w_k).

The rough structure is

p(wkwi,wj)p(wkwj)p(wk). p(w_k \mid w_i,w_j) \rightarrow p(w_k \mid w_j) \rightarrow p(w_k).

Interpolation combines several context lengths at once:

p(wkwi,wj)=λ3ptri(wkwi,wj)+λ2pbi(wkwj)+λ1puni(wk), p(w_k \mid w_i,w_j) = \lambda_3 p_{\text{tri}}(w_k \mid w_i,w_j) + \lambda_2 p_{\text{bi}}(w_k \mid w_j) + \lambda_1 p_{\text{uni}}(w_k),

where

λ1+λ2+λ3=1. \lambda_1+\lambda_2+\lambda_3=1.

The weights λ1,λ2,λ3\lambda_1,\lambda_2,\lambda_3 control how much the model trusts each order. These weights can be fixed or learned from validation data.

Interpolation is often better than pure backoff because even shorter-context statistics can provide useful information when longer-context statistics are available.

Perplexity

Language models are commonly evaluated using perplexity. Perplexity measures how well a model predicts a sequence.

Given a test sequence x1:Tx_{1:T}, the average negative log-likelihood is

1Tt=1Tlogp(xtx1:t1). -\frac{1}{T}\sum_{t=1}^{T}\log p(x_t \mid x_{1:t-1}).

Perplexity is the exponential of this quantity:

PPL(x1:T)=exp(1Tt=1Tlogp(xtx1:t1)). \text{PPL}(x_{1:T}) = \exp\left( -\frac{1}{T}\sum_{t=1}^{T}\log p(x_t \mid x_{1:t-1}) \right).

Lower perplexity means better predictive performance. A model with low perplexity assigns higher probability to the observed test sequence.

Perplexity can be interpreted as an effective branching factor. If a model has perplexity 20, then on average it is about as uncertain as choosing uniformly among 20 plausible next tokens at each step. This interpretation is approximate, but useful.

Perplexity depends strongly on tokenization. Word-level, subword-level, character-level, and byte-level models can have different perplexity scales. For this reason, perplexity comparisons are meaningful only when the dataset, preprocessing, vocabulary, and tokenization are comparable.

Statistical Language Models as Tables

An nn-gram language model is essentially a large conditional probability table.

For a bigram model, the table has one row for each previous token and one column for each next token:

Pij=p(wjwi). P_{ij}=p(w_j \mid w_i).

Each row is a probability distribution:

jPij=1. \sum_j P_{ij}=1.

A trigram model has a table indexed by two previous tokens and one next token. Higher-order models require larger tables.

This table-based view shows both the strength and weakness of classical statistical language models. They are simple, interpretable, and efficient for small context lengths. But they cannot share statistical strength across similar contexts.

For example, a table-based model treats these contexts as unrelated unless both are explicitly observed:

the dog chased \text{the dog chased} the cat chased \text{the cat chased}

A neural language model can represent “dog” and “cat” as similar vectors, allowing evidence from one context to help another. This is one of the main reasons neural language models replaced classical nn-gram models in most modern systems.

Limitations of Classical Statistical Models

Classical language models have several structural limitations.

First, they use fixed-length contexts. A trigram model cannot directly condition on information from ten tokens earlier. Increasing nn helps, but the number of possible contexts grows rapidly.

Second, they suffer from sparsity. Most long nn-grams are rare or unseen. Smoothing reduces the damage but does not solve the underlying representation problem.

Third, they treat tokens as atomic symbols. The model does not know that “run,” “running,” and “ran” are related unless such patterns are captured indirectly through counts.

Fourth, they cannot learn distributed representations. In a neural model, each token can be represented by an embedding vector, and similar words can have similar representations. In a classical nn-gram model, similarity must be manually engineered or approximated by external methods.

Fifth, they struggle with semantic coherence. They may capture local fluency but fail to maintain global meaning over long passages.

These limitations led to neural language models, recurrent networks, attention mechanisms, and eventually transformer-based large language models.

A Minimal Bigram Model in PyTorch

Although classical nn-gram models are usually implemented with counts rather than neural networks, we can build a simple neural bigram model in PyTorch.

The model learns a table of logits. Given a current token xtx_t, it predicts the next token xt+1x_{t+1}.

import torch
import torch.nn as nn
import torch.nn.functional as F

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size: int):
        super().__init__()
        self.logits = nn.Embedding(vocab_size, vocab_size)

    def forward(self, x, targets=None):
        # x: [B, T]
        # logits: [B, T, V]
        logits = self.logits(x)

        loss = None
        if targets is not None:
            B, T, V = logits.shape
            loss = F.cross_entropy(
                logits.reshape(B * T, V),
                targets.reshape(B * T),
            )

        return logits, loss

Here nn.Embedding(vocab_size, vocab_size) acts as a lookup table. Each input token selects one row of logits over the vocabulary. After applying softmax, that row becomes a probability distribution over the next token.

This model is neural in implementation but equivalent in structure to a bigram table. It has no hidden layers and no long context. It is useful because it shows the connection between statistical language modeling and modern neural language modeling.

A training step looks like this:

vocab_size = 10000
model = BigramLanguageModel(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

x = torch.randint(0, vocab_size, (32, 128))
y = torch.randint(0, vocab_size, (32, 128))

logits, loss = model(x, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

The input tensor x has shape [B, T]. The target tensor y also has shape [B, T]. For each position, the model predicts the next token.

In a real language modeling dataset, y is usually x shifted one position to the left:

tokens = torch.randint(0, vocab_size, (32, 129))

x = tokens[:, :-1]
y = tokens[:, 1:]

The model receives tokens x1,,xTx_1,\ldots,x_T and learns to predict x2,,xT+1x_2,\ldots,x_{T+1}.

From Statistical to Neural Language Models

A statistical language model estimates probabilities from observed counts. A neural language model estimates probabilities using learned parameters and continuous representations.

The statistical bigram model uses

p(xt+1xt). p(x_{t+1} \mid x_t).

A neural model may use

pθ(xt+1x1:t), p_\theta(x_{t+1} \mid x_{1:t}),

where θ\theta denotes learned parameters. Instead of storing a separate count for every observed context, the neural model maps tokens and contexts into vectors and computes probabilities from those vectors.

This change has large consequences. Neural models can share information across similar tokens, represent long contexts, learn syntactic and semantic patterns, and scale to massive datasets. Transformers extend this idea by allowing every token in a context window to interact through attention.

Still, the basic objective remains the same as in statistical language modeling:

maxθt=1Tlogpθ(xtx1:t1). \max_\theta \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).

Modern language models are therefore direct descendants of statistical language models. They use richer parameterizations, larger context windows, learned representations, and much larger datasets, but the probabilistic foundation is the same.