# Tokenization Systems

A language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text.

For a text string

$$
s = \text{"Deep learning works."}
$$

a tokenizer produces a token sequence

$$
x_{1:T} = (x_1, x_2, \ldots, x_T).
$$

The model then operates on token IDs, not characters or words directly.

```python
text = "Deep learning works."

tokens = tokenizer.encode(text)
print(tokens)

decoded = tokenizer.decode(tokens)
print(decoded)
```

Tokenization is a design choice with consequences for model quality, vocabulary size, sequence length, multilingual support, code modeling, memory usage, and inference speed.

### Why Tokenization Is Needed

Neural networks operate on numbers. Text must therefore be converted into numerical form.

A tokenizer usually performs two mappings:

$$
\text{text} \rightarrow \text{tokens}
$$

and

$$
\text{tokens} \rightarrow \text{text}.
$$

Each token is assigned an integer ID:

$$
x_t \in \{0,1,\ldots,|V|-1\},
$$

where $|V|$ is the vocabulary size.

For example:

| Token | ID |
|---|---:|
| `Deep` | 9132 |
| ` learning` | 6975 |
| ` works` | 2495 |
| `.` | 13 |

The model sees only the integer sequence:

$$
(9132, 6975, 2495, 13).
$$

An embedding layer maps these IDs into vectors:

$$
E[x_t] \in \mathbb{R}^d.
$$

Thus tokenization is the bridge between human-readable text and model-readable tensors.

### Word-Level Tokenization

The simplest approach is word-level tokenization. Text is split into words, often using whitespace and punctuation.

For example:

```text
Deep learning works.
```

may become:

```text
["Deep", "learning", "works", "."]
```

Word-level tokenization is easy to understand. It also preserves many linguistic units.

However, it has serious limitations.

First, the vocabulary can become very large. A model must handle singular forms, plural forms, verb tenses, names, numbers, spelling variants, and rare terms.

Second, unknown words are common. If a word never appears in the vocabulary, it must be replaced by a special unknown token such as `[UNK]`.

Third, word-level tokenization handles multilingual text poorly. Languages differ in spacing rules, morphology, and segmentation.

Fourth, it is weak for code, URLs, identifiers, emojis, and noisy text.

Word-level tokenization was common in early NLP systems. Modern language models usually use subword or byte-level tokenization instead.

### Character-Level Tokenization

Character-level tokenization treats each character as a token.

For example:

```text
cat
```

becomes:

```text
["c", "a", "t"]
```

This approach has a small vocabulary. It can represent almost any word without unknown tokens.

For English text, the vocabulary may include letters, digits, punctuation, spaces, and special symbols.

Character models are robust to misspellings and rare words. They are also useful for morphology-rich languages.

The main disadvantage is sequence length. A sentence that takes 20 word tokens may take 100 or more character tokens. Longer sequences increase compute and memory cost, especially for transformers.

Character-level models also force the network to learn word-level structure from smaller units. This can work, but it usually requires more training computation.

### Subword Tokenization

Subword tokenization is the dominant approach in modern language models. It represents frequent words as single tokens and rare words as combinations of smaller pieces.

For example:

```text
unbelievable
```

may become:

```text
["un", "believ", "able"]
```

or:

```text
["un", "##believ", "##able"]
```

depending on the tokenizer.

Subword tokenization balances two goals:

| Goal | Benefit |
|---|---|
| Keep common words short | Efficient representation |
| Split rare words | Fewer unknown tokens |
| Share pieces across words | Better generalization |
| Control vocabulary size | Stable memory and compute |

A subword vocabulary may contain 30,000 to 100,000 tokens. This is large enough to encode common patterns efficiently, but small enough to avoid a massive word-level vocabulary.

### Byte Pair Encoding

Byte Pair Encoding, or BPE, is a common subword tokenization algorithm.

BPE starts with small units, often characters or bytes. It repeatedly merges the most frequent adjacent pair into a new token.

Suppose the corpus contains many occurrences of:

```text
l o w
l o w e r
l o w e s t
```

The pair `l o` may be merged into `lo`, then `lo w` into `low`.

After many merges, frequent strings become single tokens.

The basic BPE procedure is:

1. Initialize the vocabulary with base symbols.
2. Count adjacent symbol pairs in the corpus.
3. Merge the most frequent pair.
4. Repeat until the desired vocabulary size is reached.

BPE is effective because it learns reusable text fragments from data.

For code, BPE can learn pieces of identifiers, keywords, operators, and common naming patterns.

### WordPiece

WordPiece is another subword algorithm, used by BERT-style models.

Like BPE, WordPiece builds a vocabulary of subword units. The difference is in the merge criterion. WordPiece chooses merges that improve the likelihood of the training corpus under a language model objective, rather than simply merging the most frequent pair.

WordPiece often marks non-initial subword pieces with a prefix such as `##`.

For example:

```text
playing
```

may become:

```text
["play", "##ing"]
```

This notation says that `##ing` continues a previous word.

WordPiece works well for masked language modeling and encoder-based models.

### Unigram Language Model Tokenization

Unigram tokenization starts with a large candidate vocabulary and removes tokens until the desired vocabulary size is reached.

Instead of constructing tokens by deterministic merges, it treats tokenization probabilistically. A word can have multiple possible segmentations, and the algorithm chooses a vocabulary that gives high likelihood to the training corpus.

For example, the word

```text
internationalization
```

may have several possible segmentations:

```text
["international", "ization"]
["inter", "national", "ization"]
["internationalization"]
```

The tokenizer assigns probabilities to pieces and chooses a likely segmentation.

Unigram tokenization is used in systems such as SentencePiece. It works well for multilingual settings because it does not require whitespace-based word boundaries.

### Byte-Level Tokenization

Byte-level tokenization uses bytes as the base alphabet. Since any UTF-8 string can be represented as bytes, byte-level tokenizers can encode arbitrary text without unknown characters.

This is useful for:

| Text type | Why byte-level helps |
|---|---|
| Multilingual text | Handles many scripts |
| Emojis | Represents arbitrary Unicode |
| Code | Handles symbols and indentation |
| URLs | Handles unusual characters |
| Noisy text | Handles typos and mixed encodings |

Many modern GPT-style tokenizers use byte-level BPE. They begin with byte symbols and learn merges over byte sequences.

The advantage is coverage. The tokenizer can represent nearly any input.

The disadvantage is that rare Unicode text may become many tokens, increasing sequence length.

### Special Tokens

Tokenizers usually reserve special tokens for structural purposes.

Common examples:

| Token | Use |
|---|---|
| `[PAD]` | Padding shorter sequences in a batch |
| `[UNK]` | Unknown token |
| `[CLS]` | Classification representation |
| `[SEP]` | Separator between segments |
| `[MASK]` | Masked language modeling |
| `<BOS>` | Beginning of sequence |
| `<EOS>` | End of sequence |
| `<|user|>` | Chat role marker |
| `<|assistant|>` | Chat role marker |

Special tokens must be handled carefully. They are part of the model’s learned interface.

A chat model may use role tokens to distinguish system, user, assistant, and tool messages. If those tokens are wrong, the model may receive a prompt in a format different from its training distribution.

### Vocabulary Size

Vocabulary size affects both model capacity and efficiency.

A larger vocabulary reduces sequence length because more strings can be represented as single tokens. But it increases the size of the embedding table and output projection.

The embedding matrix has shape

$$
E \in \mathbb{R}^{|V| \times d}.
$$

The output projection often has shape

$$
W \in \mathbb{R}^{d \times |V|}.
$$

If $|V| = 50{,}000$ and $d=4096$, then the embedding table contains

$$
50{,}000 \times 4096 = 204{,}800{,}000
$$

parameters.

The vocabulary also affects the cost of the final softmax. Larger vocabularies require more logits per position.

Thus tokenizer design trades sequence length against embedding and output-layer cost.

### Tokenization and Sequence Length

Different tokenizers can produce different sequence lengths for the same text.

For example:

| Tokenizer style | Likely token count |
|---|---:|
| Word-level | Low for common words, fails on rare words |
| Subword | Moderate |
| Character-level | High |
| Byte-level | Variable, high for rare Unicode |

Sequence length matters because transformer self-attention has quadratic cost in the number of tokens:

$$
O(T^2).
$$

A tokenizer that produces fewer tokens can make the same context cheaper to process.

However, aggressive merging can reduce compositionality. Rare or morphologically complex words may become opaque if represented as large tokens.

### Tokenization for Multilingual Models

Multilingual tokenization is harder than English tokenization.

Different languages use different scripts and segmentation conventions. Some languages use spaces between words. Others do not. Some languages have rich morphology, where one word may encode information that English would express with several words.

A multilingual tokenizer must allocate vocabulary capacity across languages. High-resource languages can dominate the vocabulary if training data is imbalanced.

This creates problems:

| Problem | Effect |
|---|---|
| Vocabulary imbalance | Low-resource languages get inefficient segmentation |
| Script diversity | Many symbols compete for vocabulary slots |
| Morphology | Words split into many pieces |
| Token fertility | Same meaning may require more tokens in some languages |

Token fertility means the number of tokens required to represent a unit of text. Higher fertility increases compute cost and can reduce effective context length.

Byte-level and unigram tokenizers help with coverage, but vocabulary allocation remains important.

### Tokenization for Code

Code tokenization has different requirements from natural language.

A code tokenizer must handle:

| Element | Example |
|---|---|
| Keywords | `def`, `class`, `return` |
| Operators | `==`, `<=`, `=>`, `::` |
| Indentation | Python block structure |
| Identifiers | `get_user_profile_by_id` |
| Literals | strings, numbers |
| Comments | natural language mixed with code |

Subword tokenization is useful because identifiers often contain meaningful pieces:

```text
get_user_profile_by_id
```

may be split into:

```text
["get", "_user", "_profile", "_by", "_id"]
```

A good code tokenizer should preserve common operators and syntax patterns while still decomposing rare identifiers.

Whitespace may also be semantically meaningful, especially in Python. Tokenizers for code models must avoid destroying structure that the model needs.

### Tokenization and Model Behavior

Tokenization can affect model behavior in subtle ways.

A model may handle a word well if it is a common single token, but poorly if it is split into many rare pieces.

Tokenization affects spelling, arithmetic, string manipulation, multilingual fairness, code completion, and robustness to formatting.

For example, numbers may be tokenized inconsistently:

```text
123456
```

could become:

```text
["123", "456"]
```

or:

```text
["1", "2", "3", "4", "5", "6"]
```

This affects numerical reasoning. A model trained on token chunks may struggle to learn digit-level algorithms.

Similarly, small text changes can change token boundaries. This can alter model probabilities even when the human meaning is nearly identical.

### Padding, Truncation, and Attention Masks

When sequences have different lengths, batching requires padding.

Suppose two tokenized sequences have lengths 5 and 8. We can pad the shorter sequence:

```text
[12, 51, 8, 99, 3, PAD, PAD, PAD]
[7, 14, 91, 2, 5, 88, 30, 4]
```

An attention mask tells the model which tokens are real and which are padding:

```text
[1, 1, 1, 1, 1, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1]
```

In PyTorch-style code:

```python
batch = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
```

Padding tokens should not contribute to loss or attention. For causal language modeling, labels at padding positions are often set to `-100` so cross-entropy ignores them.

### Tokenizer Training

Training a tokenizer usually follows this pipeline:

1. Collect a representative text corpus.
2. Normalize text if needed.
3. Choose a tokenizer algorithm.
4. Choose vocabulary size.
5. Train the tokenizer.
6. Inspect tokenization quality.
7. Freeze the tokenizer before model training.

The tokenizer should be trained on data similar to the model’s training distribution. A tokenizer trained only on English prose may perform poorly on code, math, logs, or multilingual text.

After a model is trained, changing the tokenizer is difficult because the embedding matrix and output projection depend on the vocabulary.

A tokenizer is therefore part of the model architecture, not merely preprocessing.

### Tokenization in PyTorch Workflows

PyTorch itself provides tensor computation. Tokenization is usually handled by libraries such as Hugging Face Tokenizers or SentencePiece.

A typical workflow:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

batch = tokenizer(
    ["Deep learning uses tensors.", "PyTorch stores data in tensors."],
    padding=True,
    truncation=True,
    return_tensors="pt",
)

print(batch["input_ids"].shape)
print(batch["attention_mask"].shape)
```

The model receives tensors:

```python
outputs = model(
    input_ids=batch["input_ids"],
    attention_mask=batch["attention_mask"],
)
```

For causal language modeling, we often create labels by copying input IDs:

```python
labels = batch["input_ids"].clone()
labels[batch["attention_mask"] == 0] = -100
```

The model is then trained to predict the next token internally.

### Practical Checks

Tokenizer problems are common. Before training, inspect examples manually.

Useful checks:

| Check | Question |
|---|---|
| Round-trip decoding | Does decode(encode(text)) preserve the text closely? |
| Rare words | Are rare words split sensibly? |
| Numbers | Are numeric strings tokenized consistently? |
| Code | Are indentation and operators preserved? |
| Multilingual text | Are non-English languages over-fragmented? |
| Special tokens | Are role and boundary tokens correct? |
| Sequence lengths | Are examples too long after tokenization? |

A simple inspection script:

```python
examples = [
    "Deep learning uses tensors.",
    "get_user_profile_by_id(user_id)",
    "The price is $12.99.",
    "Xin chào thế giới.",
]

for text in examples:
    ids = tokenizer.encode(text)
    pieces = tokenizer.convert_ids_to_tokens(ids)

    print(text)
    print(ids)
    print(pieces)
    print()
```

This kind of inspection often catches issues before expensive model training begins.

### Summary

Tokenization maps text into discrete symbols that a language model can process. The main approaches are word-level, character-level, subword, and byte-level tokenization.

Modern models usually use subword or byte-level tokenization because these methods balance vocabulary size, sequence length, coverage, and robustness.

Tokenizer design affects model efficiency, multilingual quality, code understanding, numerical behavior, prompt formatting, and downstream performance. Once a model is trained, the tokenizer becomes part of the model’s interface.