Embeddings and Output Projections

After tokenization, text is represented as integer token IDs. A neural language model cannot use these IDs as numerical quantities directly. Token ID 900 is not “larger” or “closer” to token ID 899 in a semantic sense. IDs are just labels.

The model first maps token IDs into vectors. These vectors are called embeddings.

If the vocabulary size is $|V|$ and the embedding dimension is $d$ , the embedding table is

E \in \mathbb{R}^{|V| \times d}.

Each row corresponds to one token. For token $x_t$ , the embedding is

e_t = E[x_t].

The embedding vector becomes the model’s numerical representation of that token.

Token IDs Are Indices

A tokenizer maps text into token IDs:

"deep learning"

might become:

[2764, 6975]

These IDs are indices into the embedding matrix. They are not ordinal numbers.

In PyTorch:

import torch
import torch.nn as nn

vocab_size = 50000
hidden_dim = 768

embedding = nn.Embedding(vocab_size, hidden_dim)

ids = torch.tensor([2764, 6975])
x = embedding(ids)

print(x.shape)  # torch.Size([2, 768])

The tensor x contains two dense vectors, one for each token.

For a batch of sequences:

tokens = torch.randint(0, vocab_size, (32, 128))

x = embedding(tokens)

print(x.shape)  # torch.Size([32, 128, 768])

The model now has a tensor of shape

[B,T,d],

where $B$ is batch size, $T$ is sequence length, and $d$ is the embedding dimension.

Embeddings as Lookup Tables

An embedding layer is equivalent to selecting rows from a matrix.

If token $i$ is represented as a one-hot vector

o_i \in \mathbb{R}^{|V|},

then multiplying by the embedding matrix gives

o_i^\top E = E[i].

The embedding layer performs the same operation more efficiently. It does not construct the one-hot vector. It directly looks up the corresponding row.

For a sequence

x_{1:T},

the embedding layer returns

[E[x_1], E[x_2], \ldots, E[x_T]].

This produces a sequence of vectors that can be processed by recurrent networks, convolutional sequence models, or transformers.

Learned Geometry

Embedding vectors are learned during training. At initialization, they are usually random. During optimization, the model adjusts them to reduce prediction loss.

Tokens that appear in similar contexts tend to receive similar embeddings. This happens because they need to support similar predictions.

For example, these sentences share structure:

the cat slept
the dog slept
the child slept

The model receives similar gradient signals for “cat,” “dog,” and “child.” Their embeddings may move toward nearby regions of representation space.

This creates a learned geometry. Distances and directions in embedding space can encode linguistic regularities, although the geometry is shaped by the training objective, model architecture, and data distribution.

Input Embeddings and Contextual Representations

The input embedding for a token is context-independent. The token “bank” receives the same input embedding in both examples:

river bank
bank account

The model must use surrounding tokens to build a contextual representation.

A transformer maps input embeddings

e_1,e_2,\ldots,e_T

to contextual hidden states

h_1,h_2,\ldots,h_T.

Unlike input embeddings, contextual states depend on the whole visible context.

Thus

h_{\text{bank}}

can differ between “river bank” and “bank account.”

This distinction is important:

Representation	Depends on context?	Example
Input embedding	No	Same vector for same token ID
Hidden state	Yes	Different vector depending on surrounding tokens

Modern language understanding comes mostly from contextual representations, not from static input embeddings alone.

Positional Information

A transformer without positional information treats a sequence as a set. It can see which tokens are present, but not where they occur.

The sentences

dog bites man
man bites dog

contain the same words but have different meanings.

The model therefore needs position information.

One common approach is learned positional embeddings:

P \in \mathbb{R}^{T_{\max} \times d}.

For token position $t$ , the model adds a position vector:

u_t = E[x_t] + P[t].

The resulting vector contains both token identity and position information.

In PyTorch:

max_length = 1024

token_embedding = nn.Embedding(vocab_size, hidden_dim)
position_embedding = nn.Embedding(max_length, hidden_dim)

tokens = torch.randint(0, vocab_size, (32, 128))
positions = torch.arange(128).unsqueeze(0)

x = token_embedding(tokens) + position_embedding(positions)

print(x.shape)  # torch.Size([32, 128, 768])

Other position methods include sinusoidal embeddings, rotary position embeddings, ALiBi, relative position bias, and learned relative position representations.

Segment, Role, and Modality Embeddings

Some models add extra embeddings that describe token type or source.

BERT-style models use segment embeddings to distinguish sentence A from sentence B:

u_t = E[x_t] + P[t] + S[s_t].

Chat models may use role tokens or role embeddings to distinguish system, user, assistant, and tool messages.

Multimodal models may use modality embeddings to distinguish text tokens from image patches, audio frames, or other input types.

The general pattern is additive composition:

u_t = \text{token information} + \text{position information} + \text{source information}.

This gives the model a structured input representation without changing the core transformer block.

Output Projection

At the end of a language model, each hidden state must be converted into scores over the vocabulary.

Suppose the final hidden state at position $t$ is

h_t \in \mathbb{R}^{d}.

The output projection computes logits:

z_t = h_t W_{\text{out}} + b_{\text{out}},

where

W_{\text{out}} \in \mathbb{R}^{d \times |V|}

and

b_{\text{out}} \in \mathbb{R}^{|V|}.

The result is

z_t \in \mathbb{R}^{|V|}.

Each entry is an unnormalized score for one possible next token.

For a batch:

H \in \mathbb{R}^{B \times T \times d}

produces

Z \in \mathbb{R}^{B \times T \times |V|}.

In PyTorch:

output = nn.Linear(hidden_dim, vocab_size)

hidden = torch.randn(32, 128, hidden_dim)
logits = output(hidden)

print(logits.shape)  # torch.Size([32, 128, 50000])

The logits are then passed to cross-entropy loss during training or to a decoding method during generation.

Softmax and Token Probabilities

The output projection produces logits, not probabilities. To obtain a probability distribution over the vocabulary, we apply softmax:

p(x_{t+1}=i \mid x_{1:t}) = \frac{\exp(z_{t,i})} {\sum_{j=1}^{|V|}\exp(z_{t,j})}.

The token with the largest logit receives the largest probability, but probability values also depend on the full set of logits.

During training, cross-entropy loss combines log-softmax and negative log-likelihood:

import torch.nn.functional as F

loss = F.cross_entropy(
    logits.reshape(-1, vocab_size),
    targets.reshape(-1),
)

During generation, probabilities may be modified by temperature, top-k filtering, top-p filtering, or other decoding rules before sampling.

Weight Tying

Many language models use weight tying between the input embedding matrix and the output projection.

The input embedding matrix has shape

E \in \mathbb{R}^{|V| \times d}.

The output projection weight has shape

W_{\text{out}} \in \mathbb{R}^{d \times |V|}.

With weight tying, we set

W_{\text{out}} = E^\top.

Then the same parameters are used for reading tokens and scoring output tokens.

In PyTorch:

class TiedLanguageModel(nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.backbone = nn.GRU(
            hidden_dim,
            hidden_dim,
            batch_first=True,
        )
        self.output_bias = nn.Parameter(torch.zeros(vocab_size))

    def forward(self, tokens):
        x = self.embedding(tokens)
        h, _ = self.backbone(x)

        logits = h @ self.embedding.weight.T + self.output_bias
        return logits

Weight tying reduces parameter count and often improves language modeling quality because input and output token spaces share structure.

Parameter Cost

Embeddings and output projections can contain many parameters.

Suppose

|V| = 50{,}000,\quad d = 4096.

The embedding matrix contains

50{,}000 \times 4096 = 204{,}800{,}000

parameters.

If the output projection is untied, it adds another 204.8 million weights, plus 50,000 biases.

This is significant, especially for smaller models.

Vocabulary	Hidden dimension	Embedding parameters
32,000	768	24.6M
50,000	768	38.4M
50,000	4096	204.8M
100,000	4096	409.6M

A larger vocabulary can reduce sequence length but increases embedding and output-layer cost.

Embedding Initialization

Embeddings are usually initialized randomly. Common initializations use small Gaussian or uniform distributions.

Example:

embedding = nn.Embedding(vocab_size, hidden_dim)
nn.init.normal_(embedding.weight, mean=0.0, std=0.02)

The exact scale matters. Very large initial embeddings can destabilize early training. Very small embeddings can reduce useful signal.

Some models initialize embeddings with pretrained vectors, such as word2vec or GloVe. This was common before large transformer pretraining. Modern language models usually learn embeddings from scratch as part of full model training.

Freezing and Fine-Tuning Embeddings

In some transfer learning workflows, embeddings may be frozen. This means their parameters are not updated.

embedding.weight.requires_grad = False

Freezing reduces trainable parameters and can help when labeled data is small. It can also hurt performance if the downstream task uses vocabulary in a different way from pretraining.

Fine-tuning allows embeddings to adapt:

embedding.weight.requires_grad = True

Large pretrained transformers usually fine-tune all parameters or use parameter-efficient methods such as adapters or LoRA. In either case, embeddings remain an important part of the model interface.

Padding Embeddings

Padding tokens are used to make sequences in a batch the same length.

A padding token should not carry semantic content. PyTorch supports a padding_idx argument:

embedding = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=hidden_dim,
    padding_idx=0,
)

The embedding at padding_idx is initialized to zeros and does not receive gradient updates.

However, attention masks and loss masks are still needed. A zero embedding alone does not guarantee that padding has no effect.

For language modeling, labels at padding positions are often ignored:

labels[attention_mask == 0] = -100

Then cross-entropy excludes those positions.

Embeddings for Non-Text Inputs

The embedding idea extends beyond text.

For images, a vision transformer splits an image into patches and projects each patch into a vector:

e_i = Wp_i + b.

For audio, frames or learned audio codes are embedded into vectors.

For graphs, node IDs, edge types, or node features can be embedded.

For recommendation systems, users, items, categories, and actions are commonly represented by embeddings.

The general principle is the same: map discrete or structured inputs into continuous vector spaces that neural networks can process.

Practical Shape Checks

Embedding and output projection code should always be checked by shape.

For a causal language model:

Tensor	Shape
Token IDs	`[B, T]`
Token embeddings	`[B, T, d]`
Hidden states	`[B, T, d]`
Logits	`[B, T, V]`
Targets	`[B, T]`
Loss	scalar

Minimal check:

B = 4
T = 16
V = 1000
d = 128

tokens = torch.randint(0, V, (B, T))
targets = torch.randint(0, V, (B, T))

embedding = nn.Embedding(V, d)
output = nn.Linear(d, V)

h = embedding(tokens)
logits = output(h)

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

print(h.shape)
print(logits.shape)
print(loss.shape)

Expected output:

torch.Size([4, 16, 128])
torch.Size([4, 16, 1000])
torch.Size([])

The scalar loss can then be backpropagated through both the output projection and the embedding table.

Summary

Embeddings map token IDs into continuous vectors. They are the first learned layer of most language models. Output projections map hidden states back into vocabulary logits. Together, they form the interface between discrete text and differentiable neural computation.

Input embeddings provide context-independent token representations. Transformer layers turn them into contextual hidden states. The output projection converts those hidden states into next-token or masked-token predictions.

Tokenizer vocabulary size determines the shape and cost of both the embedding table and the output projection. In large language models, these layers can contain hundreds of millions of parameters, so their design is a central architectural choice.