Subword methods split text into units smaller than words but usually larger than single characters.
Subword methods split text into units smaller than words but usually larger than single characters. They are the standard tokenization strategy for modern language models because they balance coverage, compression, and generalization.
A word-level tokenizer has trouble with rare words. A character-level tokenizer creates long sequences. Subword tokenization sits between these extremes.
For example:
unhelpfulnessmay be split as:
["un", "help", "ful", "ness"]The model can represent the word even if the full word did not appear often during training. It can reuse pieces learned from related words:
help
helpful
unhelpful
helpfulnessSubword methods are especially important for large language models, multilingual systems, code models, and domains with many rare terms.
The Open Vocabulary Problem
Natural language has an open vocabulary. New words, names, spellings, hashtags, identifiers, and technical terms appear constantly.
A word-level vocabulary cannot include every possible string. If a word is missing, the tokenizer must map it to an unknown token:
[UNK]This loses information. For example:
antidisestablishmentarianismand
electroencephalographymay both become:
[UNK]The model then cannot distinguish them.
Subword tokenization reduces this problem by decomposing rare words into known parts. Even if the full word is unseen, its pieces may be familiar.
electroencephalographymight become:
["electro", "ence", "phalo", "graphy"]This preserves more information than [UNK].
Desiderata for Subword Tokenizers
A good subword tokenizer should satisfy several requirements.
| Requirement | Meaning |
|---|---|
| Coverage | It can encode nearly all input text |
| Compression | It represents common text with few tokens |
| Compositionality | Rare words are split into useful pieces |
| Stability | Small text changes should not cause extreme token changes |
| Multilingual fairness | Different languages should not be over-fragmented |
| Reversibility | Token IDs can be decoded back into text |
| Efficiency | Encoding and decoding should be fast |
There is no perfect tokenizer. Each algorithm chooses a tradeoff.
For example, byte-level BPE has excellent coverage, but rare Unicode text can require many tokens. WordPiece is effective for many NLP tasks, but depends on vocabulary design. Unigram tokenization is flexible, but decoding and training involve probabilistic segmentation.
Byte Pair Encoding
Byte Pair Encoding, or BPE, is one of the most widely used subword methods.
BPE begins with a base vocabulary, such as characters or bytes. It then repeatedly merges the most frequent adjacent pair.
Suppose the training corpus contains:
low
lower
lowestInitially, these words might be represented as characters:
l o w
l o w e r
l o w e s tIf the pair l o is frequent, BPE merges it:
lo w
lo w e r
lo w e s tThen lo w may be merged into low:
low
low e r
low e s tAfter many merge steps, frequent strings become single tokens.
The learned BPE vocabulary consists of base symbols plus merged symbols. The merge rules determine how new text is segmented.
BPE Training Algorithm
A simplified BPE training procedure is:
from collections import Counter, defaultdict
def get_pairs(words):
counts = defaultdict(int)
for symbols, freq in words.items():
for a, b in zip(symbols, symbols[1:]):
counts[(a, b)] += freq
return counts
def merge_pair(pair, words):
merged = {}
for symbols, freq in words.items():
new_symbols = []
i = 0
while i < len(symbols):
if (
i < len(symbols) - 1
and symbols[i] == pair[0]
and symbols[i + 1] == pair[1]
):
new_symbols.append(pair[0] + pair[1])
i += 2
else:
new_symbols.append(symbols[i])
i += 1
merged[tuple(new_symbols)] = freq
return merged
corpus = [
"low",
"lower",
"lowest",
"newer",
"wider",
]
words = Counter(tuple(word) for word in corpus)
for _ in range(10):
pairs = get_pairs(words)
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
words = merge_pair(best_pair, words)
print(words)Real BPE tokenizers add details for whitespace handling, Unicode normalization, byte encoding, special tokens, and fast matching. The central idea remains repeated frequent-pair merging.
BPE Encoding
After training, BPE encodes new text by applying learned merges in order.
Suppose the learned merges include:
("l", "o") -> "lo"
("lo", "w") -> "low"
("e", "r") -> "er"
("low", "er") -> "lower"Then:
lowercan become:
["lower"]If a word is rare, fewer merges apply:
lowestnessmay become:
["low", "est", "ness"]The tokenizer can represent unseen words by falling back to smaller pieces.
BPE is deterministic once the merge table is fixed.
Byte-Level BPE
Character-level BPE still needs a base alphabet. With Unicode text, this can become complicated.
Byte-level BPE uses bytes as the base symbols. Since any UTF-8 string is a sequence of bytes, byte-level BPE can encode arbitrary text.
This avoids unknown characters.
For example, text containing emojis, rare scripts, mathematical symbols, or corrupted input can still be represented.
The tradeoff is that some characters may require several bytes before merges are applied. Rare scripts can be tokenized less efficiently than high-resource scripts seen often during tokenizer training.
Byte-level BPE is common in GPT-style systems.
WordPiece
WordPiece is another subword method. It is closely associated with BERT-style models.
Like BPE, it builds a vocabulary of subword units. But it selects vocabulary pieces using a likelihood-based criterion rather than only raw pair frequency.
A WordPiece tokenizer often uses continuation markers. For example:
unaffordablemay become:
["un", "##afford", "##able"]The prefix ## indicates that the token continues a word rather than starting a new one.
This distinction helps the model separate word beginnings from word interiors.
WordPiece tokenization usually uses a greedy longest-match-first algorithm during encoding. At each position in a word, it selects the longest vocabulary piece that matches.
If no piece matches, the word may become [UNK], depending on the implementation.
WordPiece Encoding Example
Suppose the vocabulary contains:
["un", "afford", "##afford", "##able", "able", "car"]The word:
unaffordablemight be segmented as:
["un", "##afford", "##able"]A simplified greedy encoding procedure:
def wordpiece_tokenize(word, vocab):
tokens = []
start = 0
while start < len(word):
end = len(word)
match = None
while start < end:
piece = word[start:end]
if start > 0:
piece = "##" + piece
if piece in vocab:
match = piece
break
end -= 1
if match is None:
return ["[UNK]"]
tokens.append(match)
start = end
return tokensReal implementations handle normalization, punctuation, special tokens, and whitespace, but this shows the core idea.
Unigram Language Model Tokenization
Unigram tokenization takes a different approach. It starts with a large candidate vocabulary and prunes it.
Each token piece has a probability. A word may have many possible segmentations. The tokenizer chooses the segmentation with high probability, often using dynamic programming.
For a string , the probability of one segmentation
is
The tokenizer seeks a segmentation such as
During training, the algorithm removes pieces that contribute least to corpus likelihood until the target vocabulary size is reached.
Unigram tokenization is used by SentencePiece and works well in multilingual settings.
SentencePiece
SentencePiece is a tokenizer framework rather than a single algorithm. It supports BPE and unigram tokenization.
A major design feature is that SentencePiece treats input as a raw Unicode string. It does not require pre-tokenization by whitespace.
Whitespace is represented explicitly, often using a marker such as:
▁For example:
Deep learningmay become:
["▁Deep", "▁learning"]This makes tokenization language-neutral. It is useful for languages where whitespace does not separate words in the same way as English.
SentencePiece is widely used in multilingual and sequence-to-sequence models.
Subword Regularization
Some unigram tokenizers support multiple possible segmentations for the same text. During training, one segmentation can be sampled.
For example:
internationalizationmight be segmented as:
["international", "ization"]or:
["inter", "national", "ization"]or:
["internationalization"]Sampling different segmentations acts as regularization. The model becomes less dependent on a single fixed segmentation and more robust to token boundary variation.
This technique is called subword regularization.
It is mostly used during training. At inference time, deterministic segmentation is usually preferred.
Vocabulary Size Tradeoffs
Subword tokenizer vocabulary size controls an important tradeoff.
A small vocabulary produces longer token sequences but smaller embedding and output matrices.
A large vocabulary produces shorter token sequences but larger embedding and output matrices.
| Vocabulary size | Sequence length | Embedding/output cost | Rare word handling |
|---|---|---|---|
| Small | Longer | Lower | More fragmented |
| Medium | Moderate | Moderate | Usually good |
| Large | Shorter | Higher | Less fragmented |
For a language model with embedding dimension , the input embedding table has
parameters.
If output weights are not tied to input embeddings, the output projection also has
parameters.
With and , each such matrix has 409.6 million parameters.
Large vocabularies can improve compression but increase memory and softmax cost.
Token Fertility
Token fertility is the number of tokens required to represent a span of text.
A tokenizer may represent English compactly but split another language into many more tokens. This means users of the second language consume more context length and compute for the same amount of meaning.
For example, two sentences with similar semantic content may become:
| Language | Characters | Tokens | Token fertility |
|---|---|---|---|
| English | 80 | 18 | Low |
| Language B | 80 | 42 | High |
High token fertility can hurt multilingual performance. It reduces effective context length and increases training and inference cost.
Tokenizer training data should therefore be balanced across target languages and domains.
Special Handling for Whitespace
Whitespace is simple for humans but complicated for tokenizers.
Consider:
hello worldA tokenizer may produce:
["hello", " world"]where the leading space is part of the second token.
Another tokenizer may produce:
["▁hello", "▁world"]where the marker represents a word boundary.
For code models, whitespace can carry semantics. In Python, indentation affects program structure. For Markdown and YAML, whitespace can also matter.
A tokenizer that destroys or normalizes important whitespace can harm model performance.
Numbers and Symbols
Numbers create special problems.
A tokenizer might split:
123456789as:
["123", "456", "789"]or:
["1", "2", "3", "4", "5", "6", "7", "8", "9"]or:
["123456", "789"]Chunked number tokens can make exact arithmetic difficult because the model does not see digits uniformly.
Similar issues occur for dates, units, chemical formulas, mathematical notation, and identifiers.
For code and scientific text, tokenizer inspection is mandatory. Poor segmentation can create systematic model weaknesses.
Subword Methods for Code
Subword tokenization works well for code because identifiers are often compositional.
For example:
load_user_profile_by_idcan be split into meaningful parts:
["load", "_user", "_profile", "_by", "_id"]However, code also contains syntax where exact characters matter.
Operators such as:
== != <= >= -> => :: :=should be represented consistently.
Whitespace and indentation may also need special treatment.
A code tokenizer should be tested on:
| Feature | Examples |
|---|---|
| Identifiers | getUserID, snake_case_name |
| Operators | ==, !=, =>, :: |
| Strings | quoted text, escapes |
| Comments | prose inside code |
| Whitespace | indentation and alignment |
| Mixed content | notebooks, Markdown, SQL, HTML |
Evaluating a Subword Tokenizer
Tokenizer evaluation should include both quantitative and qualitative checks.
Useful quantitative metrics:
| Metric | Meaning |
|---|---|
| Average tokens per character | Measures compression |
| Average tokens per word | Measures fertility |
| Unknown token rate | Should be near zero for modern tokenizers |
| Vocabulary coverage by language | Detects imbalance |
| Sequence length distribution | Affects context and training cost |
Useful qualitative checks:
examples = [
"Deep learning with PyTorch.",
"electroencephalography",
"get_user_profile_by_id(user_id)",
"Xin chào thế giới.",
"東京で機械学習を勉強しています。",
"price = $123,456.78",
]
for text in examples:
ids = tokenizer.encode(text)
pieces = tokenizer.convert_ids_to_tokens(ids)
print(text)
print(pieces)
print(len(ids))
print()The goal is not perfect human-readable pieces. The goal is efficient, stable, and useful segmentation for the model’s training distribution.
Subword Methods and Model Architecture
Tokenization interacts with architecture.
A tokenizer that produces longer sequences increases attention cost. For a transformer, standard self-attention has cost roughly proportional to
Thus a tokenizer with poor compression can significantly increase training and inference cost.
A tokenizer with a large vocabulary increases the cost of embeddings and output logits.
The output logits for a batch have shape
When is large, the final linear layer and softmax can become expensive.
This is why tokenizer choice must be considered part of model design, not merely preprocessing.
Practical PyTorch Workflow
In a PyTorch language model project, tokenization usually happens before tensors enter the model.
A typical dataset returns token IDs:
import torch
from torch.utils.data import Dataset
class TokenDataset(Dataset):
def __init__(self, texts, tokenizer, block_size):
self.examples = []
for text in texts:
ids = tokenizer.encode(text)
for i in range(0, len(ids) - block_size):
chunk = ids[i:i + block_size + 1]
self.examples.append(torch.tensor(chunk))
def __len__(self):
return len(self.examples)
def __getitem__(self, idx):
chunk = self.examples[idx]
x = chunk[:-1]
y = chunk[1:]
return x, yThe model receives integer tensors:
x, y = batch
# x: [B, T]
# y: [B, T]
logits = model(x)
# logits: [B, T, V]The embedding layer maps token IDs to vectors:
embedding = torch.nn.Embedding(vocab_size, hidden_dim)
h = embedding(x)
# h: [B, T, D]The tokenizer defines vocab_size, and therefore determines the first and last layers of the model.
Summary
Subword methods solve the open vocabulary problem by splitting rare text into reusable pieces. BPE builds tokens through frequent-pair merges. WordPiece uses likelihood-oriented subword construction and continuation markers. Unigram tokenization treats segmentation probabilistically and prunes a candidate vocabulary. SentencePiece provides a language-neutral framework for raw-text tokenization.
Subword tokenizer design affects sequence length, memory cost, multilingual quality, code modeling, numerical behavior, and model robustness. In modern deep learning systems, the tokenizer is part of the model’s architecture and deployment interface.