Neural Language Models

Statistical language models estimate probabilities from discrete counts. Neural language models replace count tables with differentiable functions parameterized by neural networks. Instead of memorizing exact token sequences, the model learns continuous representations that generalize across similar contexts.

A neural language model defines a conditional probability distribution

$$ p_\theta(x_t \mid x_{1:t-1}), $$

where $\theta$ denotes the model parameters. These parameters are learned from data by maximizing the likelihood of observed sequences.

The central idea is simple: map tokens into vectors, transform those vectors through neural layers, and use the resulting representation to predict the next token.

Distributed Representations

Classical statistical models treat words as independent symbols. Neural language models instead represent tokens as dense vectors called embeddings.

Suppose the vocabulary size is

$$ |V| = 50{,}000. $$

A one-hot representation of a token $w_i$ is a vector

$$ x_i \in \mathbb{R}^{|V|}, $$

where exactly one entry is 1 and all others are 0.

For example, if the vocabulary contains five words,

Token	One-hot vector
cat	$[1,0,0,0,0]$
dog	$[0,1,0,0,0]$
bird	$[0,0,1,0,0]$

This representation has no notion of similarity. “Cat” and “dog” are orthogonal vectors.

Neural language models learn an embedding matrix

$$ E \in \mathbb{R}^{|V| \times d}, $$

where $d$ is the embedding dimension. Each token corresponds to one row of the matrix:

$$ e_i = E[i]. $$

The token is therefore represented by a dense vector

$$ e_i \in \mathbb{R}^d. $$

Words with similar meanings often acquire similar embeddings during training.

For example,

$$ \text{king} - \text{man} + \text{woman} \approx \text{queen}. $$

This geometric structure emerges because the model learns embeddings that help predict context.

In PyTorch:

import torch
import torch.nn as nn

embedding = nn.Embedding(
    num_embeddings=50000,
    embedding_dim=768
)

tokens = torch.tensor([10, 42, 900])

x = embedding(tokens)

print(x.shape)  # torch.Size([3, 768])

Each integer token ID becomes a 768-dimensional vector.

Feedforward Neural Language Models

One of the earliest neural language models was the feedforward model introduced by Yoshua Bengio and collaborators in 2003.

Instead of storing explicit $n$-gram probabilities, the model learns a neural function over embeddings.

Suppose we use a context window of length $n-1$:

$$ (x_{t-n+1}, \ldots, x_{t-1}). $$

Each token is mapped to an embedding vector:

$$ e_{t-i} \in \mathbb{R}^d. $$

These vectors are concatenated:

$$ h_0 = [e_{t-n+1}; e_{t-n+2}; \cdots ; e_{t-1}]. $$

The concatenated vector is passed through hidden layers:

$$ h_1 = \phi(W_1 h_0 + b_1), $$

where $\phi$ is an activation function such as tanh or ReLU.

The output logits are

$$ z = W_2 h_1 + b_2. $$

Finally, a softmax converts logits into probabilities:

$$ p(x_t = w_i \mid x_{t-n+1:t-1}) = \frac{\exp(z_i)} {\sum_j \exp(z_j)}. $$

Unlike count-based models, the neural network shares parameters across contexts. Similar embeddings lead to similar predictions, even for contexts not seen during training.

The Softmax Function

The output layer of a language model usually produces logits:

$$ z \in \mathbb{R}^{|V|}. $$

These logits are unnormalized scores. To convert them into probabilities, we apply the softmax function:

genui{"math_block_widget_always_prefetch_v2":{"content":"p_i=\frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}"}}

The output distribution satisfies

$$ \sum_i p_i = 1. $$

The token with the highest probability is the model’s most likely next token.

In PyTorch:

import torch.nn.functional as F

logits = torch.randn(4, 50000)

probs = F.softmax(logits, dim=-1)

print(probs.shape)  # [4, 50000]

Computing softmax over large vocabularies can be expensive because every token requires normalization across the entire vocabulary.

Several approximations were historically developed to reduce this cost, including hierarchical softmax, sampled softmax, and noise-contrastive estimation.

Modern transformer systems often rely on large-scale GPU parallelism instead.

Cross-Entropy Training

Neural language models are trained using maximum likelihood estimation.

Given a training sequence

$$ x_{1:T}, $$

the objective is

$$ \max_\theta \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}). $$

Equivalently, we minimize the negative log-likelihood:

$$ \mathcal{L} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}). $$

This objective is implemented using cross-entropy loss.

For one target token $y$, if the model predicts probability distribution $p$, the loss is

$$ \mathcal{L} = -\log p(y). $$

If the model assigns high probability to the correct token, the loss is small. If it assigns low probability, the loss becomes large.

For example:

Correct token probability	Loss
0.9	0.105
0.5	0.693
0.01	4.605

The logarithm strongly penalizes confident incorrect predictions.

In PyTorch:

import torch
import torch.nn.functional as F

logits = torch.randn(8, 50000)
targets = torch.randint(0, 50000, (8,))

loss = F.cross_entropy(logits, targets)

print(loss)

The function cross_entropy internally applies log-softmax and computes the negative log-likelihood.

Learning Semantic Structure

Neural language models learn semantic and syntactic structure because predicting language requires understanding patterns in context.

Suppose the training corpus frequently contains sentences such as

“the cat sat on the mat”
“the dog sat on the floor”
“the child sat on the chair”

The model learns that “cat,” “dog,” and “child” often appear in similar grammatical contexts. Their embeddings therefore become similar.

This phenomenon is sometimes summarized by the distributional hypothesis:

Words appearing in similar contexts tend to have similar meanings.

Embedding geometry often captures surprisingly rich relationships:

Relationship	Vector pattern
gender	king - man + woman ≈ queen
tense	walk - walked ≈ run - ran
geography	Paris - France ≈ Tokyo - Japan

These patterns are not explicitly programmed. They emerge from the optimization objective.

Continuous Generalization

A classical $n$-gram model must observe a specific context to estimate its probability reliably.

Neural language models generalize continuously. If two contexts produce similar hidden representations, the model can produce similar predictions even if one context never appeared during training.

For example:

$$ \text{the black cat sat} $$

and

$$ \text{the white dog sat} $$

may activate similar internal representations because the embeddings for “black” and “white” are related, and the embeddings for “cat” and “dog” are related.

This parameter sharing is one of the major advantages of neural models.

Instead of memorizing all possible contexts, the model learns smooth functions over representation space.

Context Windows

Early feedforward neural language models used fixed-length contexts.

For example:

$$ p(x_t \mid x_{t-4:t-1}). $$

Only the previous four tokens are visible. This limitation resembles an $n$-gram model, although the neural representation generalizes better.

Fixed windows create several problems:

Important information outside the window is inaccessible.
Larger windows increase parameter count.
Long-range dependencies remain difficult.

These limitations motivated recurrent neural networks.

Recurrent Neural Language Models

Recurrent neural networks process sequences one token at a time while maintaining a hidden state.

At time step $t$:

$$ h_t = f(h_{t-1}, x_t), $$

where $h_t$ is the hidden state.

The next-token distribution is computed from the hidden state:

$$ p(x_{t+1} \mid x_{1:t}) = g(h_t). $$

The hidden state acts as a compressed summary of previous tokens.

Unlike fixed-window models, recurrent networks can theoretically condition on arbitrarily long contexts.

A simple recurrent update may be written as

$$ h_t = \tanh(W_h h_{t-1} + W_x x_t + b). $$

The model repeatedly updates the hidden state as new tokens arrive.

In practice, standard RNNs struggle with long-range dependencies because gradients vanish or explode during training.

LSTM and GRU architectures partially solve this problem using gating mechanisms.

Neural Language Modeling in PyTorch

A minimal recurrent language model can be implemented using embeddings, an RNN, and a linear output layer.

import torch
import torch.nn as nn

class RNNLanguageModel(nn.Module):
    def __init__(
        self,
        vocab_size,
        embed_dim,
        hidden_dim,
    ):
        super().__init__()

        self.embedding = nn.Embedding(
            vocab_size,
            embed_dim
        )

        self.rnn = nn.GRU(
            embed_dim,
            hidden_dim,
            batch_first=True
        )

        self.output = nn.Linear(
            hidden_dim,
            vocab_size
        )

    def forward(self, x):
        x = self.embedding(x)

        h, _ = self.rnn(x)

        logits = self.output(h)

        return logits

Suppose:

batch size = 32
sequence length = 128
vocabulary size = 50,000

Then:

Tensor	Shape
Input tokens	`[32, 128]`
Embeddings	`[32, 128, d]`
Hidden states	`[32, 128, h]`
Output logits	`[32, 128, 50000]`

Training uses shifted targets:

tokens = torch.randint(
    0,
    50000,
    (32, 129)
)

x = tokens[:, :-1]
y = tokens[:, 1:]

The model predicts the next token at every position.

Exposure Bias

During training, autoregressive language models usually receive true previous tokens as context. This is called teacher forcing.

At inference time, however, the model receives its own generated tokens.

This mismatch can cause error accumulation.

Suppose the model generates one incorrect token. That incorrect token becomes part of the future context, potentially causing more errors.

This phenomenon is called exposure bias.

Several methods attempt to reduce it:

Method	Idea
Scheduled sampling	Gradually replace ground-truth tokens with model predictions during training
Sequence-level training	Optimize sequence objectives directly
Reinforcement learning	Optimize long-horizon generation quality

Modern transformer models still use teacher forcing during pretraining because it remains computationally efficient and effective at scale.

Scaling Neural Language Models

Neural language model performance improves strongly with scale.

Three scaling dimensions are especially important:

Dimension	Meaning
Model size	Number of parameters
Dataset size	Number of training tokens
Compute budget	Total training computation

Larger models can learn richer statistical structure. Larger datasets expose the model to more language patterns. More compute enables longer training and larger architectures.

Empirical scaling laws show that language model loss decreases predictably as these quantities increase.

This scaling behavior eventually led to transformer-based large language models with billions or trillions of parameters.

Limitations of Early Neural Models

Early neural language models improved greatly over $n$-gram systems, but they still had limitations.

Feedforward models used fixed windows.

Recurrent models processed tokens sequentially, limiting parallelism.

RNN hidden states compressed all past information into one vector, creating bottlenecks.

Long-range dependencies remained difficult.

Training very deep recurrent systems was unstable.

These issues motivated attention mechanisms and transformer architectures, which allow direct interactions between tokens across long contexts.

Transition to Transformers

Neural language modeling evolved through several major stages:

Era	Main idea
Statistical models	Count-based conditional probabilities
Feedforward neural models	Distributed embeddings and learned functions
Recurrent models	Sequential hidden-state processing
Attention models	Direct context access
Transformers	Fully attention-based sequence modeling

Transformers removed recurrence entirely and replaced it with self-attention.

Instead of compressing history into one hidden state, the model computes interactions between all tokens in the context window.

This change dramatically improved scalability, optimization, and long-range modeling.

Modern large language models are transformer language models trained autoregressively on massive corpora using the same probabilistic objective introduced in classical statistical language modeling.