# Subword Methods

Subword methods split text into units smaller than words but usually larger than single characters. They are the standard tokenization strategy for modern language models because they balance coverage, compression, and generalization.

A word-level tokenizer has trouble with rare words. A character-level tokenizer creates long sequences. Subword tokenization sits between these extremes.

For example:

```text
unhelpfulness
```

may be split as:

```text
["un", "help", "ful", "ness"]
```

The model can represent the word even if the full word did not appear often during training. It can reuse pieces learned from related words:

```text
help
helpful
unhelpful
helpfulness
```

Subword methods are especially important for large language models, multilingual systems, code models, and domains with many rare terms.

### The Open Vocabulary Problem

Natural language has an open vocabulary. New words, names, spellings, hashtags, identifiers, and technical terms appear constantly.

A word-level vocabulary cannot include every possible string. If a word is missing, the tokenizer must map it to an unknown token:

```text
[UNK]
```

This loses information. For example:

```text
antidisestablishmentarianism
```

and

```text
electroencephalography
```

may both become:

```text
[UNK]
```

The model then cannot distinguish them.

Subword tokenization reduces this problem by decomposing rare words into known parts. Even if the full word is unseen, its pieces may be familiar.

```text
electroencephalography
```

might become:

```text
["electro", "ence", "phalo", "graphy"]
```

This preserves more information than `[UNK]`.

### Desiderata for Subword Tokenizers

A good subword tokenizer should satisfy several requirements.

| Requirement | Meaning |
|---|---|
| Coverage | It can encode nearly all input text |
| Compression | It represents common text with few tokens |
| Compositionality | Rare words are split into useful pieces |
| Stability | Small text changes should not cause extreme token changes |
| Multilingual fairness | Different languages should not be over-fragmented |
| Reversibility | Token IDs can be decoded back into text |
| Efficiency | Encoding and decoding should be fast |

There is no perfect tokenizer. Each algorithm chooses a tradeoff.

For example, byte-level BPE has excellent coverage, but rare Unicode text can require many tokens. WordPiece is effective for many NLP tasks, but depends on vocabulary design. Unigram tokenization is flexible, but decoding and training involve probabilistic segmentation.

### Byte Pair Encoding

Byte Pair Encoding, or BPE, is one of the most widely used subword methods.

BPE begins with a base vocabulary, such as characters or bytes. It then repeatedly merges the most frequent adjacent pair.

Suppose the training corpus contains:

```text
low
lower
lowest
```

Initially, these words might be represented as characters:

```text
l o w
l o w e r
l o w e s t
```

If the pair `l o` is frequent, BPE merges it:

```text
lo w
lo w e r
lo w e s t
```

Then `lo w` may be merged into `low`:

```text
low
low e r
low e s t
```

After many merge steps, frequent strings become single tokens.

The learned BPE vocabulary consists of base symbols plus merged symbols. The merge rules determine how new text is segmented.

### BPE Training Algorithm

A simplified BPE training procedure is:

```python
from collections import Counter, defaultdict

def get_pairs(words):
    counts = defaultdict(int)

    for symbols, freq in words.items():
        for a, b in zip(symbols, symbols[1:]):
            counts[(a, b)] += freq

    return counts

def merge_pair(pair, words):
    merged = {}

    for symbols, freq in words.items():
        new_symbols = []
        i = 0

        while i < len(symbols):
            if (
                i < len(symbols) - 1
                and symbols[i] == pair[0]
                and symbols[i + 1] == pair[1]
            ):
                new_symbols.append(pair[0] + pair[1])
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1

        merged[tuple(new_symbols)] = freq

    return merged

corpus = [
    "low",
    "lower",
    "lowest",
    "newer",
    "wider",
]

words = Counter(tuple(word) for word in corpus)

for _ in range(10):
    pairs = get_pairs(words)
    if not pairs:
        break

    best_pair = max(pairs, key=pairs.get)
    words = merge_pair(best_pair, words)

print(words)
```

Real BPE tokenizers add details for whitespace handling, Unicode normalization, byte encoding, special tokens, and fast matching. The central idea remains repeated frequent-pair merging.

### BPE Encoding

After training, BPE encodes new text by applying learned merges in order.

Suppose the learned merges include:

```text
("l", "o") -> "lo"
("lo", "w") -> "low"
("e", "r") -> "er"
("low", "er") -> "lower"
```

Then:

```text
lower
```

can become:

```text
["lower"]
```

If a word is rare, fewer merges apply:

```text
lowestness
```

may become:

```text
["low", "est", "ness"]
```

The tokenizer can represent unseen words by falling back to smaller pieces.

BPE is deterministic once the merge table is fixed.

### Byte-Level BPE

Character-level BPE still needs a base alphabet. With Unicode text, this can become complicated.

Byte-level BPE uses bytes as the base symbols. Since any UTF-8 string is a sequence of bytes, byte-level BPE can encode arbitrary text.

This avoids unknown characters.

For example, text containing emojis, rare scripts, mathematical symbols, or corrupted input can still be represented.

The tradeoff is that some characters may require several bytes before merges are applied. Rare scripts can be tokenized less efficiently than high-resource scripts seen often during tokenizer training.

Byte-level BPE is common in GPT-style systems.

### WordPiece

WordPiece is another subword method. It is closely associated with BERT-style models.

Like BPE, it builds a vocabulary of subword units. But it selects vocabulary pieces using a likelihood-based criterion rather than only raw pair frequency.

A WordPiece tokenizer often uses continuation markers. For example:

```text
unaffordable
```

may become:

```text
["un", "##afford", "##able"]
```

The prefix `##` indicates that the token continues a word rather than starting a new one.

This distinction helps the model separate word beginnings from word interiors.

WordPiece tokenization usually uses a greedy longest-match-first algorithm during encoding. At each position in a word, it selects the longest vocabulary piece that matches.

If no piece matches, the word may become `[UNK]`, depending on the implementation.

### WordPiece Encoding Example

Suppose the vocabulary contains:

```text
["un", "afford", "##afford", "##able", "able", "car"]
```

The word:

```text
unaffordable
```

might be segmented as:

```text
["un", "##afford", "##able"]
```

A simplified greedy encoding procedure:

```python
def wordpiece_tokenize(word, vocab):
    tokens = []
    start = 0

    while start < len(word):
        end = len(word)
        match = None

        while start < end:
            piece = word[start:end]
            if start > 0:
                piece = "##" + piece

            if piece in vocab:
                match = piece
                break

            end -= 1

        if match is None:
            return ["[UNK]"]

        tokens.append(match)
        start = end

    return tokens
```

Real implementations handle normalization, punctuation, special tokens, and whitespace, but this shows the core idea.

### Unigram Language Model Tokenization

Unigram tokenization takes a different approach. It starts with a large candidate vocabulary and prunes it.

Each token piece has a probability. A word may have many possible segmentations. The tokenizer chooses the segmentation with high probability, often using dynamic programming.

For a string $s$, the probability of one segmentation

$$
z = (z_1,z_2,\ldots,z_k)
$$

is

$$
p(z) = \prod_{i=1}^{k} p(z_i).
$$

The tokenizer seeks a segmentation such as

$$
z^\star = \arg\max_z p(z).
$$

During training, the algorithm removes pieces that contribute least to corpus likelihood until the target vocabulary size is reached.

Unigram tokenization is used by SentencePiece and works well in multilingual settings.

### SentencePiece

SentencePiece is a tokenizer framework rather than a single algorithm. It supports BPE and unigram tokenization.

A major design feature is that SentencePiece treats input as a raw Unicode string. It does not require pre-tokenization by whitespace.

Whitespace is represented explicitly, often using a marker such as:

```text
▁
```

For example:

```text
Deep learning
```

may become:

```text
["▁Deep", "▁learning"]
```

This makes tokenization language-neutral. It is useful for languages where whitespace does not separate words in the same way as English.

SentencePiece is widely used in multilingual and sequence-to-sequence models.

### Subword Regularization

Some unigram tokenizers support multiple possible segmentations for the same text. During training, one segmentation can be sampled.

For example:

```text
internationalization
```

might be segmented as:

```text
["international", "ization"]
```

or:

```text
["inter", "national", "ization"]
```

or:

```text
["internationalization"]
```

Sampling different segmentations acts as regularization. The model becomes less dependent on a single fixed segmentation and more robust to token boundary variation.

This technique is called subword regularization.

It is mostly used during training. At inference time, deterministic segmentation is usually preferred.

### Vocabulary Size Tradeoffs

Subword tokenizer vocabulary size controls an important tradeoff.

A small vocabulary produces longer token sequences but smaller embedding and output matrices.

A large vocabulary produces shorter token sequences but larger embedding and output matrices.

| Vocabulary size | Sequence length | Embedding/output cost | Rare word handling |
|---:|---:|---:|---|
| Small | Longer | Lower | More fragmented |
| Medium | Moderate | Moderate | Usually good |
| Large | Shorter | Higher | Less fragmented |

For a language model with embedding dimension $d$, the input embedding table has

$$
|V| \times d
$$

parameters.

If output weights are not tied to input embeddings, the output projection also has

$$
d \times |V|
$$

parameters.

With $|V|=100{,}000$ and $d=4096$, each such matrix has 409.6 million parameters.

Large vocabularies can improve compression but increase memory and softmax cost.

### Token Fertility

Token fertility is the number of tokens required to represent a span of text.

A tokenizer may represent English compactly but split another language into many more tokens. This means users of the second language consume more context length and compute for the same amount of meaning.

For example, two sentences with similar semantic content may become:

| Language | Characters | Tokens | Token fertility |
|---|---:|---:|---:|
| English | 80 | 18 | Low |
| Language B | 80 | 42 | High |

High token fertility can hurt multilingual performance. It reduces effective context length and increases training and inference cost.

Tokenizer training data should therefore be balanced across target languages and domains.

### Special Handling for Whitespace

Whitespace is simple for humans but complicated for tokenizers.

Consider:

```text
hello world
```

A tokenizer may produce:

```text
["hello", " world"]
```

where the leading space is part of the second token.

Another tokenizer may produce:

```text
["▁hello", "▁world"]
```

where the marker represents a word boundary.

For code models, whitespace can carry semantics. In Python, indentation affects program structure. For Markdown and YAML, whitespace can also matter.

A tokenizer that destroys or normalizes important whitespace can harm model performance.

### Numbers and Symbols

Numbers create special problems.

A tokenizer might split:

```text
123456789
```

as:

```text
["123", "456", "789"]
```

or:

```text
["1", "2", "3", "4", "5", "6", "7", "8", "9"]
```

or:

```text
["123456", "789"]
```

Chunked number tokens can make exact arithmetic difficult because the model does not see digits uniformly.

Similar issues occur for dates, units, chemical formulas, mathematical notation, and identifiers.

For code and scientific text, tokenizer inspection is mandatory. Poor segmentation can create systematic model weaknesses.

### Subword Methods for Code

Subword tokenization works well for code because identifiers are often compositional.

For example:

```text
load_user_profile_by_id
```

can be split into meaningful parts:

```text
["load", "_user", "_profile", "_by", "_id"]
```

However, code also contains syntax where exact characters matter.

Operators such as:

```text
== != <= >= -> => :: :=
```

should be represented consistently.

Whitespace and indentation may also need special treatment.

A code tokenizer should be tested on:

| Feature | Examples |
|---|---|
| Identifiers | `getUserID`, `snake_case_name` |
| Operators | `==`, `!=`, `=>`, `::` |
| Strings | quoted text, escapes |
| Comments | prose inside code |
| Whitespace | indentation and alignment |
| Mixed content | notebooks, Markdown, SQL, HTML |

### Evaluating a Subword Tokenizer

Tokenizer evaluation should include both quantitative and qualitative checks.

Useful quantitative metrics:

| Metric | Meaning |
|---|---|
| Average tokens per character | Measures compression |
| Average tokens per word | Measures fertility |
| Unknown token rate | Should be near zero for modern tokenizers |
| Vocabulary coverage by language | Detects imbalance |
| Sequence length distribution | Affects context and training cost |

Useful qualitative checks:

```python
examples = [
    "Deep learning with PyTorch.",
    "electroencephalography",
    "get_user_profile_by_id(user_id)",
    "Xin chào thế giới.",
    "東京で機械学習を勉強しています。",
    "price = $123,456.78",
]

for text in examples:
    ids = tokenizer.encode(text)
    pieces = tokenizer.convert_ids_to_tokens(ids)

    print(text)
    print(pieces)
    print(len(ids))
    print()
```

The goal is not perfect human-readable pieces. The goal is efficient, stable, and useful segmentation for the model’s training distribution.

### Subword Methods and Model Architecture

Tokenization interacts with architecture.

A tokenizer that produces longer sequences increases attention cost. For a transformer, standard self-attention has cost roughly proportional to

$$
T^2.
$$

Thus a tokenizer with poor compression can significantly increase training and inference cost.

A tokenizer with a large vocabulary increases the cost of embeddings and output logits.

The output logits for a batch have shape

$$
[B,T,|V|].
$$

When $|V|$ is large, the final linear layer and softmax can become expensive.

This is why tokenizer choice must be considered part of model design, not merely preprocessing.

### Practical PyTorch Workflow

In a PyTorch language model project, tokenization usually happens before tensors enter the model.

A typical dataset returns token IDs:

```python
import torch
from torch.utils.data import Dataset

class TokenDataset(Dataset):
    def __init__(self, texts, tokenizer, block_size):
        self.examples = []

        for text in texts:
            ids = tokenizer.encode(text)

            for i in range(0, len(ids) - block_size):
                chunk = ids[i:i + block_size + 1]
                self.examples.append(torch.tensor(chunk))

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        chunk = self.examples[idx]

        x = chunk[:-1]
        y = chunk[1:]

        return x, y
```

The model receives integer tensors:

```python
x, y = batch

# x: [B, T]
# y: [B, T]

logits = model(x)

# logits: [B, T, V]
```

The embedding layer maps token IDs to vectors:

```python
embedding = torch.nn.Embedding(vocab_size, hidden_dim)

h = embedding(x)

# h: [B, T, D]
```

The tokenizer defines `vocab_size`, and therefore determines the first and last layers of the model.

### Summary

Subword methods solve the open vocabulary problem by splitting rare text into reusable pieces. BPE builds tokens through frequent-pair merges. WordPiece uses likelihood-oriented subword construction and continuation markers. Unigram tokenization treats segmentation probabilistically and prunes a candidate vocabulary. SentencePiece provides a language-neutral framework for raw-text tokenization.

Subword tokenizer design affects sequence length, memory cost, multilingual quality, code modeling, numerical behavior, and model robustness. In modern deep learning systems, the tokenizer is part of the model’s architecture and deployment interface.

