# Embeddings and Output Projections

After tokenization, text is represented as integer token IDs. A neural language model cannot use these IDs as numerical quantities directly. Token ID 900 is not “larger” or “closer” to token ID 899 in a semantic sense. IDs are just labels.

The model first maps token IDs into vectors. These vectors are called embeddings.

If the vocabulary size is $|V|$ and the embedding dimension is $d$, the embedding table is

$$
E \in \mathbb{R}^{|V| \times d}.
$$

Each row corresponds to one token. For token $x_t$, the embedding is

$$
e_t = E[x_t].
$$

The embedding vector becomes the model’s numerical representation of that token.

### Token IDs Are Indices

A tokenizer maps text into token IDs:

```text
"deep learning"
```

might become:

```text
[2764, 6975]
```

These IDs are indices into the embedding matrix. They are not ordinal numbers.

In PyTorch:

```python
import torch
import torch.nn as nn

vocab_size = 50000
hidden_dim = 768

embedding = nn.Embedding(vocab_size, hidden_dim)

ids = torch.tensor([2764, 6975])
x = embedding(ids)

print(x.shape)  # torch.Size([2, 768])
```

The tensor `x` contains two dense vectors, one for each token.

For a batch of sequences:

```python
tokens = torch.randint(0, vocab_size, (32, 128))

x = embedding(tokens)

print(x.shape)  # torch.Size([32, 128, 768])
```

The model now has a tensor of shape

$$
[B,T,d],
$$

where $B$ is batch size, $T$ is sequence length, and $d$ is the embedding dimension.

### Embeddings as Lookup Tables

An embedding layer is equivalent to selecting rows from a matrix.

If token $i$ is represented as a one-hot vector

$$
o_i \in \mathbb{R}^{|V|},
$$

then multiplying by the embedding matrix gives

$$
o_i^\top E = E[i].
$$

The embedding layer performs the same operation more efficiently. It does not construct the one-hot vector. It directly looks up the corresponding row.

For a sequence

$$
x_{1:T},
$$

the embedding layer returns

$$
[E[x_1], E[x_2], \ldots, E[x_T]].
$$

This produces a sequence of vectors that can be processed by recurrent networks, convolutional sequence models, or transformers.

### Learned Geometry

Embedding vectors are learned during training. At initialization, they are usually random. During optimization, the model adjusts them to reduce prediction loss.

Tokens that appear in similar contexts tend to receive similar embeddings. This happens because they need to support similar predictions.

For example, these sentences share structure:

```text
the cat slept
the dog slept
the child slept
```

The model receives similar gradient signals for “cat,” “dog,” and “child.” Their embeddings may move toward nearby regions of representation space.

This creates a learned geometry. Distances and directions in embedding space can encode linguistic regularities, although the geometry is shaped by the training objective, model architecture, and data distribution.

### Input Embeddings and Contextual Representations

The input embedding for a token is context-independent. The token “bank” receives the same input embedding in both examples:

```text
river bank
bank account
```

The model must use surrounding tokens to build a contextual representation.

A transformer maps input embeddings

$$
e_1,e_2,\ldots,e_T
$$

to contextual hidden states

$$
h_1,h_2,\ldots,h_T.
$$

Unlike input embeddings, contextual states depend on the whole visible context.

Thus

$$
h_{\text{bank}}
$$

can differ between “river bank” and “bank account.”

This distinction is important:

| Representation | Depends on context? | Example |
|---|---:|---|
| Input embedding | No | Same vector for same token ID |
| Hidden state | Yes | Different vector depending on surrounding tokens |

Modern language understanding comes mostly from contextual representations, not from static input embeddings alone.

### Positional Information

A transformer without positional information treats a sequence as a set. It can see which tokens are present, but not where they occur.

The sentences

```text
dog bites man
man bites dog
```

contain the same words but have different meanings.

The model therefore needs position information.

One common approach is learned positional embeddings:

$$
P \in \mathbb{R}^{T_{\max} \times d}.
$$

For token position $t$, the model adds a position vector:

$$
u_t = E[x_t] + P[t].
$$

The resulting vector contains both token identity and position information.

In PyTorch:

```python
max_length = 1024

token_embedding = nn.Embedding(vocab_size, hidden_dim)
position_embedding = nn.Embedding(max_length, hidden_dim)

tokens = torch.randint(0, vocab_size, (32, 128))
positions = torch.arange(128).unsqueeze(0)

x = token_embedding(tokens) + position_embedding(positions)

print(x.shape)  # torch.Size([32, 128, 768])
```

Other position methods include sinusoidal embeddings, rotary position embeddings, ALiBi, relative position bias, and learned relative position representations.

### Segment, Role, and Modality Embeddings

Some models add extra embeddings that describe token type or source.

BERT-style models use segment embeddings to distinguish sentence A from sentence B:

$$
u_t = E[x_t] + P[t] + S[s_t].
$$

Chat models may use role tokens or role embeddings to distinguish system, user, assistant, and tool messages.

Multimodal models may use modality embeddings to distinguish text tokens from image patches, audio frames, or other input types.

The general pattern is additive composition:

$$
u_t = \text{token information} + \text{position information} + \text{source information}.
$$

This gives the model a structured input representation without changing the core transformer block.

### Output Projection

At the end of a language model, each hidden state must be converted into scores over the vocabulary.

Suppose the final hidden state at position $t$ is

$$
h_t \in \mathbb{R}^{d}.
$$

The output projection computes logits:

$$
z_t = h_t W_{\text{out}} + b_{\text{out}},
$$

where

$$
W_{\text{out}} \in \mathbb{R}^{d \times |V|}
$$

and

$$
b_{\text{out}} \in \mathbb{R}^{|V|}.
$$

The result is

$$
z_t \in \mathbb{R}^{|V|}.
$$

Each entry is an unnormalized score for one possible next token.

For a batch:

$$
H \in \mathbb{R}^{B \times T \times d}
$$

produces

$$
Z \in \mathbb{R}^{B \times T \times |V|}.
$$

In PyTorch:

```python
output = nn.Linear(hidden_dim, vocab_size)

hidden = torch.randn(32, 128, hidden_dim)
logits = output(hidden)

print(logits.shape)  # torch.Size([32, 128, 50000])
```

The logits are then passed to cross-entropy loss during training or to a decoding method during generation.

### Softmax and Token Probabilities

The output projection produces logits, not probabilities. To obtain a probability distribution over the vocabulary, we apply softmax:

$$
p(x_{t+1}=i \mid x_{1:t}) =
\frac{\exp(z_{t,i})}
{\sum_{j=1}^{|V|}\exp(z_{t,j})}.
$$

The token with the largest logit receives the largest probability, but probability values also depend on the full set of logits.

During training, cross-entropy loss combines log-softmax and negative log-likelihood:

```python
import torch.nn.functional as F

loss = F.cross_entropy(
    logits.reshape(-1, vocab_size),
    targets.reshape(-1),
)
```

During generation, probabilities may be modified by temperature, top-k filtering, top-p filtering, or other decoding rules before sampling.

### Weight Tying

Many language models use weight tying between the input embedding matrix and the output projection.

The input embedding matrix has shape

$$
E \in \mathbb{R}^{|V| \times d}.
$$

The output projection weight has shape

$$
W_{\text{out}} \in \mathbb{R}^{d \times |V|}.
$$

With weight tying, we set

$$
W_{\text{out}} = E^\top.
$$

Then the same parameters are used for reading tokens and scoring output tokens.

In PyTorch:

```python
class TiedLanguageModel(nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.backbone = nn.GRU(
            hidden_dim,
            hidden_dim,
            batch_first=True,
        )
        self.output_bias = nn.Parameter(torch.zeros(vocab_size))

    def forward(self, tokens):
        x = self.embedding(tokens)
        h, _ = self.backbone(x)

        logits = h @ self.embedding.weight.T + self.output_bias
        return logits
```

Weight tying reduces parameter count and often improves language modeling quality because input and output token spaces share structure.

### Parameter Cost

Embeddings and output projections can contain many parameters.

Suppose

$$
|V| = 50{,}000,\quad d = 4096.
$$

The embedding matrix contains

$$
50{,}000 \times 4096 = 204{,}800{,}000
$$

parameters.

If the output projection is untied, it adds another 204.8 million weights, plus 50,000 biases.

This is significant, especially for smaller models.

| Vocabulary | Hidden dimension | Embedding parameters |
|---:|---:|---:|
| 32,000 | 768 | 24.6M |
| 50,000 | 768 | 38.4M |
| 50,000 | 4096 | 204.8M |
| 100,000 | 4096 | 409.6M |

A larger vocabulary can reduce sequence length but increases embedding and output-layer cost.

### Embedding Initialization

Embeddings are usually initialized randomly. Common initializations use small Gaussian or uniform distributions.

Example:

```python
embedding = nn.Embedding(vocab_size, hidden_dim)
nn.init.normal_(embedding.weight, mean=0.0, std=0.02)
```

The exact scale matters. Very large initial embeddings can destabilize early training. Very small embeddings can reduce useful signal.

Some models initialize embeddings with pretrained vectors, such as word2vec or GloVe. This was common before large transformer pretraining. Modern language models usually learn embeddings from scratch as part of full model training.

### Freezing and Fine-Tuning Embeddings

In some transfer learning workflows, embeddings may be frozen. This means their parameters are not updated.

```python
embedding.weight.requires_grad = False
```

Freezing reduces trainable parameters and can help when labeled data is small. It can also hurt performance if the downstream task uses vocabulary in a different way from pretraining.

Fine-tuning allows embeddings to adapt:

```python
embedding.weight.requires_grad = True
```

Large pretrained transformers usually fine-tune all parameters or use parameter-efficient methods such as adapters or LoRA. In either case, embeddings remain an important part of the model interface.

### Padding Embeddings

Padding tokens are used to make sequences in a batch the same length.

A padding token should not carry semantic content. PyTorch supports a `padding_idx` argument:

```python
embedding = nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=hidden_dim,
    padding_idx=0,
)
```

The embedding at `padding_idx` is initialized to zeros and does not receive gradient updates.

However, attention masks and loss masks are still needed. A zero embedding alone does not guarantee that padding has no effect.

For language modeling, labels at padding positions are often ignored:

```python
labels[attention_mask == 0] = -100
```

Then cross-entropy excludes those positions.

### Embeddings for Non-Text Inputs

The embedding idea extends beyond text.

For images, a vision transformer splits an image into patches and projects each patch into a vector:

$$
e_i = Wp_i + b.
$$

For audio, frames or learned audio codes are embedded into vectors.

For graphs, node IDs, edge types, or node features can be embedded.

For recommendation systems, users, items, categories, and actions are commonly represented by embeddings.

The general principle is the same: map discrete or structured inputs into continuous vector spaces that neural networks can process.

### Practical Shape Checks

Embedding and output projection code should always be checked by shape.

For a causal language model:

| Tensor | Shape |
|---|---|
| Token IDs | `[B, T]` |
| Token embeddings | `[B, T, d]` |
| Hidden states | `[B, T, d]` |
| Logits | `[B, T, V]` |
| Targets | `[B, T]` |
| Loss | scalar |

Minimal check:

```python
B = 4
T = 16
V = 1000
d = 128

tokens = torch.randint(0, V, (B, T))
targets = torch.randint(0, V, (B, T))

embedding = nn.Embedding(V, d)
output = nn.Linear(d, V)

h = embedding(tokens)
logits = output(h)

loss = F.cross_entropy(
    logits.reshape(B * T, V),
    targets.reshape(B * T),
)

print(h.shape)
print(logits.shape)
print(loss.shape)
```

Expected output:

```text
torch.Size([4, 16, 128])
torch.Size([4, 16, 1000])
torch.Size([])
```

The scalar loss can then be backpropagated through both the output projection and the embedding table.

### Summary

Embeddings map token IDs into continuous vectors. They are the first learned layer of most language models. Output projections map hidden states back into vocabulary logits. Together, they form the interface between discrete text and differentiable neural computation.

Input embeddings provide context-independent token representations. Transformer layers turn them into contextual hidden states. The output projection converts those hidden states into next-token or masked-token predictions.

Tokenizer vocabulary size determines the shape and cost of both the embedding table and the output projection. In large language models, these layers can contain hundreds of millions of parameters, so their design is a central architectural choice.

