After tokenization, text is represented as integer token IDs. A neural language model cannot use these IDs as numerical quantities directly. Token ID 900 is not “larger” or “closer” to token ID 899 in a semantic sense. IDs are just labels.
The model first maps token IDs into vectors. These vectors are called embeddings.
If the vocabulary size is and the embedding dimension is , the embedding table is
Each row corresponds to one token. For token , the embedding is
The embedding vector becomes the model’s numerical representation of that token.
Token IDs Are Indices
A tokenizer maps text into token IDs:
"deep learning"might become:
[2764, 6975]These IDs are indices into the embedding matrix. They are not ordinal numbers.
In PyTorch:
import torch
import torch.nn as nn
vocab_size = 50000
hidden_dim = 768
embedding = nn.Embedding(vocab_size, hidden_dim)
ids = torch.tensor([2764, 6975])
x = embedding(ids)
print(x.shape) # torch.Size([2, 768])The tensor x contains two dense vectors, one for each token.
For a batch of sequences:
tokens = torch.randint(0, vocab_size, (32, 128))
x = embedding(tokens)
print(x.shape) # torch.Size([32, 128, 768])The model now has a tensor of shape
where is batch size, is sequence length, and is the embedding dimension.
Embeddings as Lookup Tables
An embedding layer is equivalent to selecting rows from a matrix.
If token is represented as a one-hot vector
then multiplying by the embedding matrix gives
The embedding layer performs the same operation more efficiently. It does not construct the one-hot vector. It directly looks up the corresponding row.
For a sequence
the embedding layer returns
This produces a sequence of vectors that can be processed by recurrent networks, convolutional sequence models, or transformers.
Learned Geometry
Embedding vectors are learned during training. At initialization, they are usually random. During optimization, the model adjusts them to reduce prediction loss.
Tokens that appear in similar contexts tend to receive similar embeddings. This happens because they need to support similar predictions.
For example, these sentences share structure:
the cat slept
the dog slept
the child sleptThe model receives similar gradient signals for “cat,” “dog,” and “child.” Their embeddings may move toward nearby regions of representation space.
This creates a learned geometry. Distances and directions in embedding space can encode linguistic regularities, although the geometry is shaped by the training objective, model architecture, and data distribution.
Input Embeddings and Contextual Representations
The input embedding for a token is context-independent. The token “bank” receives the same input embedding in both examples:
river bank
bank accountThe model must use surrounding tokens to build a contextual representation.
A transformer maps input embeddings
to contextual hidden states
Unlike input embeddings, contextual states depend on the whole visible context.
Thus
can differ between “river bank” and “bank account.”
This distinction is important:
| Representation | Depends on context? | Example |
|---|---|---|
| Input embedding | No | Same vector for same token ID |
| Hidden state | Yes | Different vector depending on surrounding tokens |
Modern language understanding comes mostly from contextual representations, not from static input embeddings alone.
Positional Information
A transformer without positional information treats a sequence as a set. It can see which tokens are present, but not where they occur.
The sentences
dog bites man
man bites dogcontain the same words but have different meanings.
The model therefore needs position information.
One common approach is learned positional embeddings:
For token position , the model adds a position vector:
The resulting vector contains both token identity and position information.
In PyTorch:
max_length = 1024
token_embedding = nn.Embedding(vocab_size, hidden_dim)
position_embedding = nn.Embedding(max_length, hidden_dim)
tokens = torch.randint(0, vocab_size, (32, 128))
positions = torch.arange(128).unsqueeze(0)
x = token_embedding(tokens) + position_embedding(positions)
print(x.shape) # torch.Size([32, 128, 768])Other position methods include sinusoidal embeddings, rotary position embeddings, ALiBi, relative position bias, and learned relative position representations.
Segment, Role, and Modality Embeddings
Some models add extra embeddings that describe token type or source.
BERT-style models use segment embeddings to distinguish sentence A from sentence B:
Chat models may use role tokens or role embeddings to distinguish system, user, assistant, and tool messages.
Multimodal models may use modality embeddings to distinguish text tokens from image patches, audio frames, or other input types.
The general pattern is additive composition:
This gives the model a structured input representation without changing the core transformer block.
Output Projection
At the end of a language model, each hidden state must be converted into scores over the vocabulary.
Suppose the final hidden state at position is
The output projection computes logits:
where
and
The result is
Each entry is an unnormalized score for one possible next token.
For a batch:
produces
In PyTorch:
output = nn.Linear(hidden_dim, vocab_size)
hidden = torch.randn(32, 128, hidden_dim)
logits = output(hidden)
print(logits.shape) # torch.Size([32, 128, 50000])The logits are then passed to cross-entropy loss during training or to a decoding method during generation.
Softmax and Token Probabilities
The output projection produces logits, not probabilities. To obtain a probability distribution over the vocabulary, we apply softmax:
The token with the largest logit receives the largest probability, but probability values also depend on the full set of logits.
During training, cross-entropy loss combines log-softmax and negative log-likelihood:
import torch.nn.functional as F
loss = F.cross_entropy(
logits.reshape(-1, vocab_size),
targets.reshape(-1),
)During generation, probabilities may be modified by temperature, top-k filtering, top-p filtering, or other decoding rules before sampling.
Weight Tying
Many language models use weight tying between the input embedding matrix and the output projection.
The input embedding matrix has shape
The output projection weight has shape
With weight tying, we set
Then the same parameters are used for reading tokens and scoring output tokens.
In PyTorch:
class TiedLanguageModel(nn.Module):
def __init__(self, vocab_size, hidden_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, hidden_dim)
self.backbone = nn.GRU(
hidden_dim,
hidden_dim,
batch_first=True,
)
self.output_bias = nn.Parameter(torch.zeros(vocab_size))
def forward(self, tokens):
x = self.embedding(tokens)
h, _ = self.backbone(x)
logits = h @ self.embedding.weight.T + self.output_bias
return logitsWeight tying reduces parameter count and often improves language modeling quality because input and output token spaces share structure.
Parameter Cost
Embeddings and output projections can contain many parameters.
Suppose
The embedding matrix contains
parameters.
If the output projection is untied, it adds another 204.8 million weights, plus 50,000 biases.
This is significant, especially for smaller models.
| Vocabulary | Hidden dimension | Embedding parameters |
|---|---|---|
| 32,000 | 768 | 24.6M |
| 50,000 | 768 | 38.4M |
| 50,000 | 4096 | 204.8M |
| 100,000 | 4096 | 409.6M |
A larger vocabulary can reduce sequence length but increases embedding and output-layer cost.
Embedding Initialization
Embeddings are usually initialized randomly. Common initializations use small Gaussian or uniform distributions.
Example:
embedding = nn.Embedding(vocab_size, hidden_dim)
nn.init.normal_(embedding.weight, mean=0.0, std=0.02)The exact scale matters. Very large initial embeddings can destabilize early training. Very small embeddings can reduce useful signal.
Some models initialize embeddings with pretrained vectors, such as word2vec or GloVe. This was common before large transformer pretraining. Modern language models usually learn embeddings from scratch as part of full model training.
Freezing and Fine-Tuning Embeddings
In some transfer learning workflows, embeddings may be frozen. This means their parameters are not updated.
embedding.weight.requires_grad = FalseFreezing reduces trainable parameters and can help when labeled data is small. It can also hurt performance if the downstream task uses vocabulary in a different way from pretraining.
Fine-tuning allows embeddings to adapt:
embedding.weight.requires_grad = TrueLarge pretrained transformers usually fine-tune all parameters or use parameter-efficient methods such as adapters or LoRA. In either case, embeddings remain an important part of the model interface.
Padding Embeddings
Padding tokens are used to make sequences in a batch the same length.
A padding token should not carry semantic content. PyTorch supports a padding_idx argument:
embedding = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=hidden_dim,
padding_idx=0,
)The embedding at padding_idx is initialized to zeros and does not receive gradient updates.
However, attention masks and loss masks are still needed. A zero embedding alone does not guarantee that padding has no effect.
For language modeling, labels at padding positions are often ignored:
labels[attention_mask == 0] = -100Then cross-entropy excludes those positions.
Embeddings for Non-Text Inputs
The embedding idea extends beyond text.
For images, a vision transformer splits an image into patches and projects each patch into a vector:
For audio, frames or learned audio codes are embedded into vectors.
For graphs, node IDs, edge types, or node features can be embedded.
For recommendation systems, users, items, categories, and actions are commonly represented by embeddings.
The general principle is the same: map discrete or structured inputs into continuous vector spaces that neural networks can process.
Practical Shape Checks
Embedding and output projection code should always be checked by shape.
For a causal language model:
| Tensor | Shape |
|---|---|
| Token IDs | [B, T] |
| Token embeddings | [B, T, d] |
| Hidden states | [B, T, d] |
| Logits | [B, T, V] |
| Targets | [B, T] |
| Loss | scalar |
Minimal check:
B = 4
T = 16
V = 1000
d = 128
tokens = torch.randint(0, V, (B, T))
targets = torch.randint(0, V, (B, T))
embedding = nn.Embedding(V, d)
output = nn.Linear(d, V)
h = embedding(tokens)
logits = output(h)
loss = F.cross_entropy(
logits.reshape(B * T, V),
targets.reshape(B * T),
)
print(h.shape)
print(logits.shape)
print(loss.shape)Expected output:
torch.Size([4, 16, 128])
torch.Size([4, 16, 1000])
torch.Size([])The scalar loss can then be backpropagated through both the output projection and the embedding table.
Summary
Embeddings map token IDs into continuous vectors. They are the first learned layer of most language models. Output projections map hidden states back into vocabulary logits. Together, they form the interface between discrete text and differentiable neural computation.
Input embeddings provide context-independent token representations. Transformer layers turn them into contextual hidden states. The output projection converts those hidden states into next-token or masked-token predictions.
Tokenizer vocabulary size determines the shape and cost of both the embedding table and the output projection. In large language models, these layers can contain hundreds of millions of parameters, so their design is a central architectural choice.