Statistical language models estimate probabilities from discrete counts. Neural language models replace count tables with differentiable functions parameterized by neural networks. Instead of memorizing exact token sequences, the model learns continuous representations that generalize across similar contexts.
A neural language model defines a conditional probability distribution
where denotes the model parameters. These parameters are learned from data by maximizing the likelihood of observed sequences.
The central idea is simple: map tokens into vectors, transform those vectors through neural layers, and use the resulting representation to predict the next token.
Distributed Representations
Classical statistical models treat words as independent symbols. Neural language models instead represent tokens as dense vectors called embeddings.
Suppose the vocabulary size is
A one-hot representation of a token is a vector
where exactly one entry is 1 and all others are 0.
For example, if the vocabulary contains five words,
| Token | One-hot vector |
|---|---|
| cat | |
| dog | |
| bird |
This representation has no notion of similarity. “Cat” and “dog” are orthogonal vectors.
Neural language models learn an embedding matrix
where is the embedding dimension. Each token corresponds to one row of the matrix:
The token is therefore represented by a dense vector
Words with similar meanings often acquire similar embeddings during training.
For example,
This geometric structure emerges because the model learns embeddings that help predict context.
In PyTorch:
import torch
import torch.nn as nn
embedding = nn.Embedding(
num_embeddings=50000,
embedding_dim=768
)
tokens = torch.tensor([10, 42, 900])
x = embedding(tokens)
print(x.shape) # torch.Size([3, 768])Each integer token ID becomes a 768-dimensional vector.
Feedforward Neural Language Models
One of the earliest neural language models was the feedforward model introduced by Yoshua Bengio and collaborators in 2003.
Instead of storing explicit -gram probabilities, the model learns a neural function over embeddings.
Suppose we use a context window of length :
Each token is mapped to an embedding vector:
These vectors are concatenated:
The concatenated vector is passed through hidden layers:
where is an activation function such as tanh or ReLU.
The output logits are
Finally, a softmax converts logits into probabilities:
Unlike count-based models, the neural network shares parameters across contexts. Similar embeddings lead to similar predictions, even for contexts not seen during training.
The Softmax Function
The output layer of a language model usually produces logits:
These logits are unnormalized scores. To convert them into probabilities, we apply the softmax function:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“p_i=\frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}”}}
The output distribution satisfies
The token with the highest probability is the model’s most likely next token.
In PyTorch:
import torch.nn.functional as F
logits = torch.randn(4, 50000)
probs = F.softmax(logits, dim=-1)
print(probs.shape) # [4, 50000]Computing softmax over large vocabularies can be expensive because every token requires normalization across the entire vocabulary.
Several approximations were historically developed to reduce this cost, including hierarchical softmax, sampled softmax, and noise-contrastive estimation.
Modern transformer systems often rely on large-scale GPU parallelism instead.
Cross-Entropy Training
Neural language models are trained using maximum likelihood estimation.
Given a training sequence
the objective is
Equivalently, we minimize the negative log-likelihood:
This objective is implemented using cross-entropy loss.
For one target token , if the model predicts probability distribution , the loss is
If the model assigns high probability to the correct token, the loss is small. If it assigns low probability, the loss becomes large.
For example:
| Correct token probability | Loss |
|---|---|
| 0.9 | 0.105 |
| 0.5 | 0.693 |
| 0.01 | 4.605 |
The logarithm strongly penalizes confident incorrect predictions.
In PyTorch:
import torch
import torch.nn.functional as F
logits = torch.randn(8, 50000)
targets = torch.randint(0, 50000, (8,))
loss = F.cross_entropy(logits, targets)
print(loss)The function cross_entropy internally applies log-softmax and computes the negative log-likelihood.
Learning Semantic Structure
Neural language models learn semantic and syntactic structure because predicting language requires understanding patterns in context.
Suppose the training corpus frequently contains sentences such as
- “the cat sat on the mat”
- “the dog sat on the floor”
- “the child sat on the chair”
The model learns that “cat,” “dog,” and “child” often appear in similar grammatical contexts. Their embeddings therefore become similar.
This phenomenon is sometimes summarized by the distributional hypothesis:
Words appearing in similar contexts tend to have similar meanings.
Embedding geometry often captures surprisingly rich relationships:
| Relationship | Vector pattern |
|---|---|
| gender | king - man + woman ≈ queen |
| tense | walk - walked ≈ run - ran |
| geography | Paris - France ≈ Tokyo - Japan |
These patterns are not explicitly programmed. They emerge from the optimization objective.
Continuous Generalization
A classical -gram model must observe a specific context to estimate its probability reliably.
Neural language models generalize continuously. If two contexts produce similar hidden representations, the model can produce similar predictions even if one context never appeared during training.
For example:
and
may activate similar internal representations because the embeddings for “black” and “white” are related, and the embeddings for “cat” and “dog” are related.
This parameter sharing is one of the major advantages of neural models.
Instead of memorizing all possible contexts, the model learns smooth functions over representation space.
Context Windows
Early feedforward neural language models used fixed-length contexts.
For example:
Only the previous four tokens are visible. This limitation resembles an -gram model, although the neural representation generalizes better.
Fixed windows create several problems:
- Important information outside the window is inaccessible.
- Larger windows increase parameter count.
- Long-range dependencies remain difficult.
These limitations motivated recurrent neural networks.
Recurrent Neural Language Models
Recurrent neural networks process sequences one token at a time while maintaining a hidden state.
At time step :
where is the hidden state.
The next-token distribution is computed from the hidden state:
The hidden state acts as a compressed summary of previous tokens.
Unlike fixed-window models, recurrent networks can theoretically condition on arbitrarily long contexts.
A simple recurrent update may be written as
The model repeatedly updates the hidden state as new tokens arrive.
In practice, standard RNNs struggle with long-range dependencies because gradients vanish or explode during training.
LSTM and GRU architectures partially solve this problem using gating mechanisms.
Neural Language Modeling in PyTorch
A minimal recurrent language model can be implemented using embeddings, an RNN, and a linear output layer.
import torch
import torch.nn as nn
class RNNLanguageModel(nn.Module):
def __init__(
self,
vocab_size,
embed_dim,
hidden_dim,
):
super().__init__()
self.embedding = nn.Embedding(
vocab_size,
embed_dim
)
self.rnn = nn.GRU(
embed_dim,
hidden_dim,
batch_first=True
)
self.output = nn.Linear(
hidden_dim,
vocab_size
)
def forward(self, x):
x = self.embedding(x)
h, _ = self.rnn(x)
logits = self.output(h)
return logitsSuppose:
- batch size = 32
- sequence length = 128
- vocabulary size = 50,000
Then:
| Tensor | Shape |
|---|---|
| Input tokens | [32, 128] |
| Embeddings | [32, 128, d] |
| Hidden states | [32, 128, h] |
| Output logits | [32, 128, 50000] |
Training uses shifted targets:
tokens = torch.randint(
0,
50000,
(32, 129)
)
x = tokens[:, :-1]
y = tokens[:, 1:]The model predicts the next token at every position.
Exposure Bias
During training, autoregressive language models usually receive true previous tokens as context. This is called teacher forcing.
At inference time, however, the model receives its own generated tokens.
This mismatch can cause error accumulation.
Suppose the model generates one incorrect token. That incorrect token becomes part of the future context, potentially causing more errors.
This phenomenon is called exposure bias.
Several methods attempt to reduce it:
| Method | Idea |
|---|---|
| Scheduled sampling | Gradually replace ground-truth tokens with model predictions during training |
| Sequence-level training | Optimize sequence objectives directly |
| Reinforcement learning | Optimize long-horizon generation quality |
Modern transformer models still use teacher forcing during pretraining because it remains computationally efficient and effective at scale.
Scaling Neural Language Models
Neural language model performance improves strongly with scale.
Three scaling dimensions are especially important:
| Dimension | Meaning |
|---|---|
| Model size | Number of parameters |
| Dataset size | Number of training tokens |
| Compute budget | Total training computation |
Larger models can learn richer statistical structure. Larger datasets expose the model to more language patterns. More compute enables longer training and larger architectures.
Empirical scaling laws show that language model loss decreases predictably as these quantities increase.
This scaling behavior eventually led to transformer-based large language models with billions or trillions of parameters.
Limitations of Early Neural Models
Early neural language models improved greatly over -gram systems, but they still had limitations.
Feedforward models used fixed windows.
Recurrent models processed tokens sequentially, limiting parallelism.
RNN hidden states compressed all past information into one vector, creating bottlenecks.
Long-range dependencies remained difficult.
Training very deep recurrent systems was unstable.
These issues motivated attention mechanisms and transformer architectures, which allow direct interactions between tokens across long contexts.
Transition to Transformers
Neural language modeling evolved through several major stages:
| Era | Main idea |
|---|---|
| Statistical models | Count-based conditional probabilities |
| Feedforward neural models | Distributed embeddings and learned functions |
| Recurrent models | Sequential hidden-state processing |
| Attention models | Direct context access |
| Transformers | Fully attention-based sequence modeling |
Transformers removed recurrence entirely and replaced it with self-attention.
Instead of compressing history into one hidden state, the model computes interactions between all tokens in the context window.
This change dramatically improved scalability, optimization, and long-range modeling.
Modern large language models are transformer language models trained autoregressively on massive corpora using the same probabilistic objective introduced in classical statistical language modeling.