Word Embeddings

Natural language models cannot operate directly on words as strings. A neural network receives numbers, performs arithmetic on those numbers, and produces numerical outputs. Before text can be processed by a model, words must be represented as vectors.

A word embedding is a vector representation of a word. Instead of representing a word as a discrete symbol such as "cat" or "run", we represent it as a point in a continuous vector space:

\text{cat} \mapsto x_{\text{cat}} \in \mathbb{R}^d.

The dimension $d$ is called the embedding dimension. Common embedding dimensions include 50, 100, 300, 768, 1024, or larger values in modern language models.

The central idea is simple: words with related meanings should have related vectors. For example, the vectors for cat, dog, and animal should be closer to one another than the vectors for cat, democracy, and voltage.

From Symbols to Vectors

A vocabulary is a finite set of tokens:

V = \{w_1, w_2, \dots, w_{|V|}\}.

Each word or token is assigned an integer ID. For example:

Token	Integer ID
`<pad>`	0
`<unk>`	1
`the`	2
`cat`	3
`sat`	4
`on`	5
`mat`	6

A sentence such as

the cat sat on the mat

can then be represented as a sequence of integer IDs:

[2, 3, 4, 5, 2, 6].

These integers are indices, not numerical measurements. The fact that cat has ID 3 and sat has ID 4 does not mean that sat is numerically greater than cat. The IDs only tell the model where to look in an embedding table.

One-Hot Encodings

Before embeddings, a simple way to represent a word is by a one-hot vector. If the vocabulary size is $|V|$ , then each word is represented by a vector in $\mathbb{R}^{|V|}$ with one entry equal to 1 and all others equal to 0.

For example, if the vocabulary has seven tokens, the word cat with ID 3 may be represented as

x_{\text{cat}} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}.

One-hot vectors are simple, but they have two major problems.

First, they are high-dimensional. A vocabulary may contain 50,000, 100,000, or millions of tokens. This makes one-hot vectors large and inefficient.

Second, they do not encode similarity. The one-hot vectors for cat and dog are just as different as the one-hot vectors for cat and airplane. Every pair of distinct words has the same dot product, namely zero.

Word embeddings solve both problems by mapping words into dense, lower-dimensional vectors.

The Embedding Matrix

An embedding layer is a lookup table. If the vocabulary has size $|V|$ and each embedding has dimension $d$ , then the embedding layer stores a matrix

E \in \mathbb{R}^{|V| \times d}.

Each row of $E$ is the embedding vector for one token. If token $w_i$ has integer ID $i$ , then its embedding is the $i$ -th row of $E$ :

x_i = E_i.

For a sequence of token IDs

[t_1, t_2, \dots, t_T],

the embedding layer returns

[E_{t_1}, E_{t_2}, \dots, E_{t_T}].

The output has shape

T \times d.

For a batch of $B$ sequences, each of length $T$ , the input token IDs have shape

B \times T,

and the embedding output has shape

B \times T \times d.

In PyTorch:

import torch
import torch.nn as nn

vocab_size = 10000
embedding_dim = 300

embedding = nn.Embedding(vocab_size, embedding_dim)

token_ids = torch.tensor([
    [2, 3, 4, 5],
    [2, 8, 9, 0],
])

x = embedding(token_ids)

print(token_ids.shape)  # torch.Size([2, 4])
print(x.shape)          # torch.Size([2, 4, 300])

The input is a batch of two sequences, each with four tokens. The output is a batch of two sequences, where each token has been replaced by a 300-dimensional vector.

Embedding Layers Are Learnable

The embedding matrix is a model parameter. During training, the entries of the embedding matrix are updated by gradient descent.

Suppose a model predicts a label from a sentence. The embedding vectors influence the prediction. The loss measures how wrong the prediction is. Backpropagation computes gradients not only for the later layers of the model, but also for the embedding vectors used in the input.

If the word cat appears in a training example, then the row of the embedding matrix corresponding to cat receives a gradient update. Over many examples, words that occur in similar contexts tend to acquire similar embeddings.

In PyTorch:

embedding = nn.Embedding(10000, 300)

print(embedding.weight.shape)
print(embedding.weight.requires_grad)

Output:

torch.Size([10000, 300])
True

The tensor embedding.weight is the embedding matrix. Since requires_grad is true, PyTorch will compute gradients for it during training.

A Simple Text Classifier with Embeddings

Consider a sentence classification model. The input is a sequence of token IDs. The model embeds each token, averages the token embeddings, and applies a linear classifier.

import torch
import torch.nn as nn

class AverageEmbeddingClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_classes, padding_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(
            vocab_size,
            embedding_dim,
            padding_idx=padding_idx,
        )
        self.classifier = nn.Linear(embedding_dim, num_classes)

    def forward(self, token_ids):
        # token_ids: [batch_size, sequence_length]
        x = self.embedding(token_ids)
        # x: [batch_size, sequence_length, embedding_dim]

        mask = token_ids != 0
        # mask: [batch_size, sequence_length]

        mask = mask.unsqueeze(-1)
        # mask: [batch_size, sequence_length, 1]

        x = x * mask
        # padding positions become zero vectors

        lengths = mask.sum(dim=1).clamp(min=1)
        # lengths: [batch_size, 1]

        pooled = x.sum(dim=1) / lengths
        # pooled: [batch_size, embedding_dim]

        logits = self.classifier(pooled)
        # logits: [batch_size, num_classes]

        return logits

This model is simple, but it shows the basic role of embeddings. The embedding layer converts token IDs into vectors. The rest of the network operates on those vectors.

Padding and `padding_idx`

Text sequences often have different lengths. Neural networks usually process batches as rectangular tensors, so shorter sequences are padded.

For example:

the cat sat
the dog sat on the mat

may become

the cat sat <pad> <pad> <pad>
the dog sat on    the   mat

The corresponding token ID tensor may be

token_ids = torch.tensor([
    [2, 3, 4, 0, 0, 0],
    [2, 7, 4, 5, 2, 6],
])

The padding token should not affect the model’s meaning. PyTorch’s nn.Embedding supports a padding_idx argument:

embedding = nn.Embedding(
    num_embeddings=10000,
    embedding_dim=300,
    padding_idx=0,
)

The row at index 0 is treated as the padding embedding. PyTorch keeps this vector fixed at zero during training unless manually modified. This is useful because padding tokens are structural artifacts rather than meaningful words.

Pretrained Word Embeddings

Embedding matrices can be learned from scratch, but they can also be initialized from pretrained embeddings. Classic pretrained word embeddings include Word2Vec, GloVe, and FastText.

These embeddings are trained on large text corpora before being used in a downstream model. They capture statistical patterns from language. For example, words that appear in similar contexts tend to have similar vectors.

Pretrained embeddings were especially useful before large transformer models became dominant. They remain useful for small datasets, lightweight models, and settings where training a large language model is impractical.

A pretrained embedding matrix can be loaded into PyTorch:

embedding = nn.Embedding(vocab_size, embedding_dim)

pretrained_weight = torch.randn(vocab_size, embedding_dim)
embedding.weight.data.copy_(pretrained_weight)

The embedding matrix may be frozen:

embedding.weight.requires_grad = False

or fine-tuned:

embedding.weight.requires_grad = True

Freezing preserves the original pretrained vectors. Fine-tuning allows the vectors to adapt to the task.

Similarity Between Word Embeddings

Once words are represented as vectors, similarity can be measured with vector operations. A common measure is cosine similarity:

\operatorname{cos}(x, y) = \frac{x^\top y}{\|x\|\|y\|}.

Cosine similarity measures the angle between two vectors. It is often more useful than Euclidean distance because the direction of an embedding vector usually matters more than its length.

In PyTorch:

import torch.nn.functional as F

x = embedding(torch.tensor([3]))  # cat
y = embedding(torch.tensor([7]))  # dog

similarity = F.cosine_similarity(x, y)
print(similarity)

A high cosine similarity means the vectors point in similar directions.

Distributional Meaning

The theoretical motivation for word embeddings comes from the distributional view of language: a word’s meaning is partly determined by the contexts in which it appears.

For example, the words cat and dog often occur near words such as pet, animal, food, owner, and sleep. Because they share contexts, a model trained to predict or use context can learn similar vectors for them.

This idea underlies many embedding methods. Word2Vec learns embeddings by predicting nearby words or predicting a word from nearby words. GloVe learns embeddings from global word co-occurrence statistics. FastText represents words using character n-grams, allowing it to handle rare words and subword structure.

Modern transformer models still use embeddings, but usually at the token or subword level rather than the whole-word level.

Word Embeddings Versus Contextual Embeddings

Classic word embeddings assign one vector to each word type. The word bank has the same vector in both sentences:

I deposited money at the bank.
The boat reached the river bank.

This is a limitation. The word has different meanings in different contexts.

Modern models use contextual embeddings. A contextual embedding depends on the surrounding tokens. In a transformer, the initial token embedding is only the first representation. After self-attention layers, the representation of each token incorporates information from the rest of the sentence.

Thus the token bank can receive different contextual vectors in different sentences.

The distinction is important:

Representation	Depends on context?	Example
Word embedding	No	Word2Vec, GloVe
Subword embedding	No, initially	BPE token embedding
Contextual embedding	Yes	Transformer hidden state

In PyTorch terms, nn.Embedding usually gives the initial embedding. A recurrent network, convolutional network, or transformer then converts those initial embeddings into contextual representations.

Embeddings in Language Models

Modern language models usually operate on token embeddings rather than word embeddings. A tokenizer maps text into token IDs. These tokens may be words, subwords, bytes, or byte-pair units.

The model then applies an embedding layer:

X = E[T],

where $T$ is the matrix of token IDs and $E$ is the embedding matrix.

T \in \mathbb{N}^{B \times L},

then

X \in \mathbb{R}^{B \times L \times d}.

The transformer then processes $X$ through attention and feedforward layers.

In many language models, the input embedding matrix is also tied to the output projection matrix. This is called weight tying. The same matrix used to embed tokens is reused to map hidden states back to vocabulary logits.

If $h_t \in \mathbb{R}^d$ is the hidden state at position $t$ , then the logits over the vocabulary may be computed as

z_t = E h_t,

or, depending on convention,

z_t = h_t E^\top.

This reduces the number of parameters and often improves language modeling performance.

Practical Shape Conventions

In PyTorch NLP models, token IDs are usually stored as integer tensors:

token_ids.dtype == torch.long

The shape is commonly

[batch_size, sequence_length]

After embedding, the shape becomes

[batch_size, sequence_length, embedding_dim]

For example:

B = 8
T = 32
D = 256
V = 50000

embedding = nn.Embedding(V, D)
token_ids = torch.randint(0, V, (B, T))

x = embedding(token_ids)

print(token_ids.shape)  # torch.Size([8, 32])
print(x.shape)          # torch.Size([8, 32, 256])

The input to nn.Embedding must contain integer indices. It should not contain floating-point values.

Common Errors

A common error is passing one-hot vectors into nn.Embedding. The embedding layer expects token IDs, not one-hot encodings.

Correct:

token_ids = torch.tensor([[2, 3, 4]])
x = embedding(token_ids)

Incorrect:

one_hot = torch.tensor([[[0, 0, 1], [0, 1, 0]]])
x = embedding(one_hot)

Another common error is using the wrong dtype. Token IDs should usually be torch.long:

token_ids = torch.tensor([[2, 3, 4]], dtype=torch.long)

A third common error is forgetting padding masks. If padding embeddings are averaged with real token embeddings, the model may learn length artifacts or receive distorted sentence representations.

Summary

A word embedding maps a discrete word or token to a dense vector. The embedding matrix has one row per vocabulary item and one column per embedding dimension. In PyTorch, nn.Embedding implements this mapping as a learnable lookup table.

Embeddings convert symbolic language into numerical form. They allow neural networks to compare, combine, and transform words using vector operations. Classic embeddings assign one vector per word type. Modern language models begin with token embeddings and then compute contextual embeddings through deeper neural network layers.

Word Embeddings

From Symbols to Vectors

One-Hot Encodings

The Embedding Matrix

Embedding Layers Are Learnable

A Simple Text Classifier with Embeddings

Padding and padding_idx

Pretrained Word Embeddings

Similarity Between Word Embeddings

Distributional Meaning

Word Embeddings Versus Contextual Embeddings

Embeddings in Language Models

Practical Shape Conventions

Common Errors

Summary

Padding and `padding_idx`