Natural language models cannot operate directly on words as strings. A neural network receives numbers, performs arithmetic on those numbers, and produces numerical outputs. Before text can be processed by a model, words must be represented as vectors.
A word embedding is a vector representation of a word. Instead of representing a word as a discrete symbol such as "cat" or "run", we represent it as a point in a continuous vector space:
The dimension is called the embedding dimension. Common embedding dimensions include 50, 100, 300, 768, 1024, or larger values in modern language models.
The central idea is simple: words with related meanings should have related vectors. For example, the vectors for cat, dog, and animal should be closer to one another than the vectors for cat, democracy, and voltage.
From Symbols to Vectors
A vocabulary is a finite set of tokens:
Each word or token is assigned an integer ID. For example:
| Token | Integer ID |
|---|---|
<pad> | 0 |
<unk> | 1 |
the | 2 |
cat | 3 |
sat | 4 |
on | 5 |
mat | 6 |
A sentence such as
the cat sat on the matcan then be represented as a sequence of integer IDs:
These integers are indices, not numerical measurements. The fact that cat has ID 3 and sat has ID 4 does not mean that sat is numerically greater than cat. The IDs only tell the model where to look in an embedding table.
One-Hot Encodings
Before embeddings, a simple way to represent a word is by a one-hot vector. If the vocabulary size is , then each word is represented by a vector in with one entry equal to 1 and all others equal to 0.
For example, if the vocabulary has seven tokens, the word cat with ID 3 may be represented as
One-hot vectors are simple, but they have two major problems.
First, they are high-dimensional. A vocabulary may contain 50,000, 100,000, or millions of tokens. This makes one-hot vectors large and inefficient.
Second, they do not encode similarity. The one-hot vectors for cat and dog are just as different as the one-hot vectors for cat and airplane. Every pair of distinct words has the same dot product, namely zero.
Word embeddings solve both problems by mapping words into dense, lower-dimensional vectors.
The Embedding Matrix
An embedding layer is a lookup table. If the vocabulary has size and each embedding has dimension , then the embedding layer stores a matrix
Each row of is the embedding vector for one token. If token has integer ID , then its embedding is the -th row of :
For a sequence of token IDs
the embedding layer returns
The output has shape
For a batch of sequences, each of length , the input token IDs have shape
and the embedding output has shape
In PyTorch:
import torch
import torch.nn as nn
vocab_size = 10000
embedding_dim = 300
embedding = nn.Embedding(vocab_size, embedding_dim)
token_ids = torch.tensor([
[2, 3, 4, 5],
[2, 8, 9, 0],
])
x = embedding(token_ids)
print(token_ids.shape) # torch.Size([2, 4])
print(x.shape) # torch.Size([2, 4, 300])The input is a batch of two sequences, each with four tokens. The output is a batch of two sequences, where each token has been replaced by a 300-dimensional vector.
Embedding Layers Are Learnable
The embedding matrix is a model parameter. During training, the entries of the embedding matrix are updated by gradient descent.
Suppose a model predicts a label from a sentence. The embedding vectors influence the prediction. The loss measures how wrong the prediction is. Backpropagation computes gradients not only for the later layers of the model, but also for the embedding vectors used in the input.
If the word cat appears in a training example, then the row of the embedding matrix corresponding to cat receives a gradient update. Over many examples, words that occur in similar contexts tend to acquire similar embeddings.
In PyTorch:
embedding = nn.Embedding(10000, 300)
print(embedding.weight.shape)
print(embedding.weight.requires_grad)Output:
torch.Size([10000, 300])
TrueThe tensor embedding.weight is the embedding matrix. Since requires_grad is true, PyTorch will compute gradients for it during training.
A Simple Text Classifier with Embeddings
Consider a sentence classification model. The input is a sequence of token IDs. The model embeds each token, averages the token embeddings, and applies a linear classifier.
import torch
import torch.nn as nn
class AverageEmbeddingClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, num_classes, padding_idx=0):
super().__init__()
self.embedding = nn.Embedding(
vocab_size,
embedding_dim,
padding_idx=padding_idx,
)
self.classifier = nn.Linear(embedding_dim, num_classes)
def forward(self, token_ids):
# token_ids: [batch_size, sequence_length]
x = self.embedding(token_ids)
# x: [batch_size, sequence_length, embedding_dim]
mask = token_ids != 0
# mask: [batch_size, sequence_length]
mask = mask.unsqueeze(-1)
# mask: [batch_size, sequence_length, 1]
x = x * mask
# padding positions become zero vectors
lengths = mask.sum(dim=1).clamp(min=1)
# lengths: [batch_size, 1]
pooled = x.sum(dim=1) / lengths
# pooled: [batch_size, embedding_dim]
logits = self.classifier(pooled)
# logits: [batch_size, num_classes]
return logitsThis model is simple, but it shows the basic role of embeddings. The embedding layer converts token IDs into vectors. The rest of the network operates on those vectors.
Padding and padding_idx
Text sequences often have different lengths. Neural networks usually process batches as rectangular tensors, so shorter sequences are padded.
For example:
the cat sat
the dog sat on the matmay become
the cat sat <pad> <pad> <pad>
the dog sat on the matThe corresponding token ID tensor may be
token_ids = torch.tensor([
[2, 3, 4, 0, 0, 0],
[2, 7, 4, 5, 2, 6],
])The padding token should not affect the model’s meaning. PyTorch’s nn.Embedding supports a padding_idx argument:
embedding = nn.Embedding(
num_embeddings=10000,
embedding_dim=300,
padding_idx=0,
)The row at index 0 is treated as the padding embedding. PyTorch keeps this vector fixed at zero during training unless manually modified. This is useful because padding tokens are structural artifacts rather than meaningful words.
Pretrained Word Embeddings
Embedding matrices can be learned from scratch, but they can also be initialized from pretrained embeddings. Classic pretrained word embeddings include Word2Vec, GloVe, and FastText.
These embeddings are trained on large text corpora before being used in a downstream model. They capture statistical patterns from language. For example, words that appear in similar contexts tend to have similar vectors.
Pretrained embeddings were especially useful before large transformer models became dominant. They remain useful for small datasets, lightweight models, and settings where training a large language model is impractical.
A pretrained embedding matrix can be loaded into PyTorch:
embedding = nn.Embedding(vocab_size, embedding_dim)
pretrained_weight = torch.randn(vocab_size, embedding_dim)
embedding.weight.data.copy_(pretrained_weight)The embedding matrix may be frozen:
embedding.weight.requires_grad = Falseor fine-tuned:
embedding.weight.requires_grad = TrueFreezing preserves the original pretrained vectors. Fine-tuning allows the vectors to adapt to the task.
Similarity Between Word Embeddings
Once words are represented as vectors, similarity can be measured with vector operations. A common measure is cosine similarity:
Cosine similarity measures the angle between two vectors. It is often more useful than Euclidean distance because the direction of an embedding vector usually matters more than its length.
In PyTorch:
import torch.nn.functional as F
x = embedding(torch.tensor([3])) # cat
y = embedding(torch.tensor([7])) # dog
similarity = F.cosine_similarity(x, y)
print(similarity)A high cosine similarity means the vectors point in similar directions.
Distributional Meaning
The theoretical motivation for word embeddings comes from the distributional view of language: a word’s meaning is partly determined by the contexts in which it appears.
For example, the words cat and dog often occur near words such as pet, animal, food, owner, and sleep. Because they share contexts, a model trained to predict or use context can learn similar vectors for them.
This idea underlies many embedding methods. Word2Vec learns embeddings by predicting nearby words or predicting a word from nearby words. GloVe learns embeddings from global word co-occurrence statistics. FastText represents words using character n-grams, allowing it to handle rare words and subword structure.
Modern transformer models still use embeddings, but usually at the token or subword level rather than the whole-word level.
Word Embeddings Versus Contextual Embeddings
Classic word embeddings assign one vector to each word type. The word bank has the same vector in both sentences:
I deposited money at the bank.
The boat reached the river bank.This is a limitation. The word has different meanings in different contexts.
Modern models use contextual embeddings. A contextual embedding depends on the surrounding tokens. In a transformer, the initial token embedding is only the first representation. After self-attention layers, the representation of each token incorporates information from the rest of the sentence.
Thus the token bank can receive different contextual vectors in different sentences.
The distinction is important:
| Representation | Depends on context? | Example |
|---|---|---|
| Word embedding | No | Word2Vec, GloVe |
| Subword embedding | No, initially | BPE token embedding |
| Contextual embedding | Yes | Transformer hidden state |
In PyTorch terms, nn.Embedding usually gives the initial embedding. A recurrent network, convolutional network, or transformer then converts those initial embeddings into contextual representations.
Embeddings in Language Models
Modern language models usually operate on token embeddings rather than word embeddings. A tokenizer maps text into token IDs. These tokens may be words, subwords, bytes, or byte-pair units.
The model then applies an embedding layer:
where is the matrix of token IDs and is the embedding matrix.
If
then
The transformer then processes through attention and feedforward layers.
In many language models, the input embedding matrix is also tied to the output projection matrix. This is called weight tying. The same matrix used to embed tokens is reused to map hidden states back to vocabulary logits.
If is the hidden state at position , then the logits over the vocabulary may be computed as
or, depending on convention,
This reduces the number of parameters and often improves language modeling performance.
Practical Shape Conventions
In PyTorch NLP models, token IDs are usually stored as integer tensors:
token_ids.dtype == torch.longThe shape is commonly
[batch_size, sequence_length]After embedding, the shape becomes
[batch_size, sequence_length, embedding_dim]For example:
B = 8
T = 32
D = 256
V = 50000
embedding = nn.Embedding(V, D)
token_ids = torch.randint(0, V, (B, T))
x = embedding(token_ids)
print(token_ids.shape) # torch.Size([8, 32])
print(x.shape) # torch.Size([8, 32, 256])The input to nn.Embedding must contain integer indices. It should not contain floating-point values.
Common Errors
A common error is passing one-hot vectors into nn.Embedding. The embedding layer expects token IDs, not one-hot encodings.
Correct:
token_ids = torch.tensor([[2, 3, 4]])
x = embedding(token_ids)Incorrect:
one_hot = torch.tensor([[[0, 0, 1], [0, 1, 0]]])
x = embedding(one_hot)Another common error is using the wrong dtype. Token IDs should usually be torch.long:
token_ids = torch.tensor([[2, 3, 4]], dtype=torch.long)A third common error is forgetting padding masks. If padding embeddings are averaged with real token embeddings, the model may learn length artifacts or receive distorted sentence representations.
Summary
A word embedding maps a discrete word or token to a dense vector. The embedding matrix has one row per vocabulary item and one column per embedding dimension. In PyTorch, nn.Embedding implements this mapping as a learnable lookup table.
Embeddings convert symbolic language into numerical form. They allow neural networks to compare, combine, and transform words using vector operations. Classic embeddings assign one vector per word type. Modern language models begin with token embeddings and then compute contextual embeddings through deeper neural network layers.