Subword Methods

Subword methods split text into units smaller than words but usually larger than single characters. They are the standard tokenization strategy for modern language models because they balance coverage, compression, and generalization.

A word-level tokenizer has trouble with rare words. A character-level tokenizer creates long sequences. Subword tokenization sits between these extremes.

For example:

unhelpfulness

may be split as:

["un", "help", "ful", "ness"]

The model can represent the word even if the full word did not appear often during training. It can reuse pieces learned from related words:

help
helpful
unhelpful
helpfulness

Subword methods are especially important for large language models, multilingual systems, code models, and domains with many rare terms.

The Open Vocabulary Problem

Natural language has an open vocabulary. New words, names, spellings, hashtags, identifiers, and technical terms appear constantly.

A word-level vocabulary cannot include every possible string. If a word is missing, the tokenizer must map it to an unknown token:

[UNK]

This loses information. For example:

antidisestablishmentarianism

and

electroencephalography

may both become:

[UNK]

The model then cannot distinguish them.

Subword tokenization reduces this problem by decomposing rare words into known parts. Even if the full word is unseen, its pieces may be familiar.

electroencephalography

might become:

["electro", "ence", "phalo", "graphy"]

This preserves more information than [UNK].

Desiderata for Subword Tokenizers

A good subword tokenizer should satisfy several requirements.

Requirement	Meaning
Coverage	It can encode nearly all input text
Compression	It represents common text with few tokens
Compositionality	Rare words are split into useful pieces
Stability	Small text changes should not cause extreme token changes
Multilingual fairness	Different languages should not be over-fragmented
Reversibility	Token IDs can be decoded back into text
Efficiency	Encoding and decoding should be fast

There is no perfect tokenizer. Each algorithm chooses a tradeoff.

For example, byte-level BPE has excellent coverage, but rare Unicode text can require many tokens. WordPiece is effective for many NLP tasks, but depends on vocabulary design. Unigram tokenization is flexible, but decoding and training involve probabilistic segmentation.

Byte Pair Encoding

Byte Pair Encoding, or BPE, is one of the most widely used subword methods.

BPE begins with a base vocabulary, such as characters or bytes. It then repeatedly merges the most frequent adjacent pair.

Suppose the training corpus contains:

low
lower
lowest

Initially, these words might be represented as characters:

l o w
l o w e r
l o w e s t

If the pair l o is frequent, BPE merges it:

lo w
lo w e r
lo w e s t

Then lo w may be merged into low:

low
low e r
low e s t

After many merge steps, frequent strings become single tokens.

The learned BPE vocabulary consists of base symbols plus merged symbols. The merge rules determine how new text is segmented.

BPE Training Algorithm

A simplified BPE training procedure is:

from collections import Counter, defaultdict

def get_pairs(words):
    counts = defaultdict(int)

    for symbols, freq in words.items():
        for a, b in zip(symbols, symbols[1:]):
            counts[(a, b)] += freq

    return counts

def merge_pair(pair, words):
    merged = {}

    for symbols, freq in words.items():
        new_symbols = []
        i = 0

        while i < len(symbols):
            if (
                i < len(symbols) - 1
                and symbols[i] == pair[0]
                and symbols[i + 1] == pair[1]
            ):
                new_symbols.append(pair[0] + pair[1])
                i += 2
            else:
                new_symbols.append(symbols[i])
                i += 1

        merged[tuple(new_symbols)] = freq

    return merged

corpus = [
    "low",
    "lower",
    "lowest",
    "newer",
    "wider",
]

words = Counter(tuple(word) for word in corpus)

for _ in range(10):
    pairs = get_pairs(words)
    if not pairs:
        break

    best_pair = max(pairs, key=pairs.get)
    words = merge_pair(best_pair, words)

print(words)

Real BPE tokenizers add details for whitespace handling, Unicode normalization, byte encoding, special tokens, and fast matching. The central idea remains repeated frequent-pair merging.

BPE Encoding

After training, BPE encodes new text by applying learned merges in order.

Suppose the learned merges include:

("l", "o") -> "lo"
("lo", "w") -> "low"
("e", "r") -> "er"
("low", "er") -> "lower"

Then:

lower

can become:

["lower"]

If a word is rare, fewer merges apply:

lowestness

may become:

["low", "est", "ness"]

The tokenizer can represent unseen words by falling back to smaller pieces.

BPE is deterministic once the merge table is fixed.

Byte-Level BPE

Character-level BPE still needs a base alphabet. With Unicode text, this can become complicated.

Byte-level BPE uses bytes as the base symbols. Since any UTF-8 string is a sequence of bytes, byte-level BPE can encode arbitrary text.

This avoids unknown characters.

For example, text containing emojis, rare scripts, mathematical symbols, or corrupted input can still be represented.

The tradeoff is that some characters may require several bytes before merges are applied. Rare scripts can be tokenized less efficiently than high-resource scripts seen often during tokenizer training.

Byte-level BPE is common in GPT-style systems.

WordPiece

WordPiece is another subword method. It is closely associated with BERT-style models.

Like BPE, it builds a vocabulary of subword units. But it selects vocabulary pieces using a likelihood-based criterion rather than only raw pair frequency.

A WordPiece tokenizer often uses continuation markers. For example:

unaffordable

may become:

["un", "##afford", "##able"]

The prefix ## indicates that the token continues a word rather than starting a new one.

This distinction helps the model separate word beginnings from word interiors.

WordPiece tokenization usually uses a greedy longest-match-first algorithm during encoding. At each position in a word, it selects the longest vocabulary piece that matches.

If no piece matches, the word may become [UNK], depending on the implementation.

WordPiece Encoding Example

Suppose the vocabulary contains:

["un", "afford", "##afford", "##able", "able", "car"]

The word:

unaffordable

might be segmented as:

["un", "##afford", "##able"]

A simplified greedy encoding procedure:

def wordpiece_tokenize(word, vocab):
    tokens = []
    start = 0

    while start < len(word):
        end = len(word)
        match = None

        while start < end:
            piece = word[start:end]
            if start > 0:
                piece = "##" + piece

            if piece in vocab:
                match = piece
                break

            end -= 1

        if match is None:
            return ["[UNK]"]

        tokens.append(match)
        start = end

    return tokens

Real implementations handle normalization, punctuation, special tokens, and whitespace, but this shows the core idea.

Unigram Language Model Tokenization

Unigram tokenization takes a different approach. It starts with a large candidate vocabulary and prunes it.

Each token piece has a probability. A word may have many possible segmentations. The tokenizer chooses the segmentation with high probability, often using dynamic programming.

For a string $s$ , the probability of one segmentation

z = (z_1,z_2,\ldots,z_k)

p(z) = \prod_{i=1}^{k} p(z_i).

The tokenizer seeks a segmentation such as

z^\star = \arg\max_z p(z).

During training, the algorithm removes pieces that contribute least to corpus likelihood until the target vocabulary size is reached.

Unigram tokenization is used by SentencePiece and works well in multilingual settings.

SentencePiece

SentencePiece is a tokenizer framework rather than a single algorithm. It supports BPE and unigram tokenization.

A major design feature is that SentencePiece treats input as a raw Unicode string. It does not require pre-tokenization by whitespace.

Whitespace is represented explicitly, often using a marker such as:

▁

For example:

Deep learning

may become:

["▁Deep", "▁learning"]

This makes tokenization language-neutral. It is useful for languages where whitespace does not separate words in the same way as English.

SentencePiece is widely used in multilingual and sequence-to-sequence models.

Subword Regularization

Some unigram tokenizers support multiple possible segmentations for the same text. During training, one segmentation can be sampled.

For example:

internationalization

might be segmented as:

["international", "ization"]

or:

["inter", "national", "ization"]

or:

["internationalization"]

Sampling different segmentations acts as regularization. The model becomes less dependent on a single fixed segmentation and more robust to token boundary variation.

This technique is called subword regularization.

It is mostly used during training. At inference time, deterministic segmentation is usually preferred.

Vocabulary Size Tradeoffs

Subword tokenizer vocabulary size controls an important tradeoff.

A small vocabulary produces longer token sequences but smaller embedding and output matrices.

A large vocabulary produces shorter token sequences but larger embedding and output matrices.

Vocabulary size	Sequence length	Embedding/output cost	Rare word handling
Small	Longer	Lower	More fragmented
Medium	Moderate	Moderate	Usually good
Large	Shorter	Higher	Less fragmented

For a language model with embedding dimension $d$ , the input embedding table has

|V| \times d

parameters.

If output weights are not tied to input embeddings, the output projection also has

d \times |V|

parameters.

With $|V|=100{,}000$ and $d=4096$ , each such matrix has 409.6 million parameters.

Large vocabularies can improve compression but increase memory and softmax cost.

Token Fertility

Token fertility is the number of tokens required to represent a span of text.

A tokenizer may represent English compactly but split another language into many more tokens. This means users of the second language consume more context length and compute for the same amount of meaning.

For example, two sentences with similar semantic content may become:

Language	Characters	Tokens	Token fertility
English	80	18	Low
Language B	80	42	High

High token fertility can hurt multilingual performance. It reduces effective context length and increases training and inference cost.

Tokenizer training data should therefore be balanced across target languages and domains.

Special Handling for Whitespace

Whitespace is simple for humans but complicated for tokenizers.

Consider:

hello world

A tokenizer may produce:

["hello", " world"]

where the leading space is part of the second token.

Another tokenizer may produce:

["▁hello", "▁world"]

where the marker represents a word boundary.

For code models, whitespace can carry semantics. In Python, indentation affects program structure. For Markdown and YAML, whitespace can also matter.

A tokenizer that destroys or normalizes important whitespace can harm model performance.

Numbers and Symbols

Numbers create special problems.

A tokenizer might split:

123456789

as:

["123", "456", "789"]

or:

["1", "2", "3", "4", "5", "6", "7", "8", "9"]

or:

["123456", "789"]

Chunked number tokens can make exact arithmetic difficult because the model does not see digits uniformly.

Similar issues occur for dates, units, chemical formulas, mathematical notation, and identifiers.

For code and scientific text, tokenizer inspection is mandatory. Poor segmentation can create systematic model weaknesses.

Subword Methods for Code

Subword tokenization works well for code because identifiers are often compositional.

For example:

load_user_profile_by_id

can be split into meaningful parts:

["load", "_user", "_profile", "_by", "_id"]

However, code also contains syntax where exact characters matter.

Operators such as:

== != <= >= -> => :: :=

should be represented consistently.

Whitespace and indentation may also need special treatment.

A code tokenizer should be tested on:

Feature	Examples
Identifiers	`getUserID`, `snake_case_name`
Operators	`==`, `!=`, `=>`, `::`
Strings	quoted text, escapes
Comments	prose inside code
Whitespace	indentation and alignment
Mixed content	notebooks, Markdown, SQL, HTML

Evaluating a Subword Tokenizer

Tokenizer evaluation should include both quantitative and qualitative checks.

Useful quantitative metrics:

Metric	Meaning
Average tokens per character	Measures compression
Average tokens per word	Measures fertility
Unknown token rate	Should be near zero for modern tokenizers
Vocabulary coverage by language	Detects imbalance
Sequence length distribution	Affects context and training cost

Useful qualitative checks:

examples = [
    "Deep learning with PyTorch.",
    "electroencephalography",
    "get_user_profile_by_id(user_id)",
    "Xin chào thế giới.",
    "東京で機械学習を勉強しています。",
    "price = $123,456.78",
]

for text in examples:
    ids = tokenizer.encode(text)
    pieces = tokenizer.convert_ids_to_tokens(ids)

    print(text)
    print(pieces)
    print(len(ids))
    print()

The goal is not perfect human-readable pieces. The goal is efficient, stable, and useful segmentation for the model’s training distribution.

Subword Methods and Model Architecture

Tokenization interacts with architecture.

A tokenizer that produces longer sequences increases attention cost. For a transformer, standard self-attention has cost roughly proportional to

T^2.

Thus a tokenizer with poor compression can significantly increase training and inference cost.

A tokenizer with a large vocabulary increases the cost of embeddings and output logits.

The output logits for a batch have shape

[B,T,|V|].

When $|V|$ is large, the final linear layer and softmax can become expensive.

This is why tokenizer choice must be considered part of model design, not merely preprocessing.

Practical PyTorch Workflow

In a PyTorch language model project, tokenization usually happens before tensors enter the model.

A typical dataset returns token IDs:

import torch
from torch.utils.data import Dataset

class TokenDataset(Dataset):
    def __init__(self, texts, tokenizer, block_size):
        self.examples = []

        for text in texts:
            ids = tokenizer.encode(text)

            for i in range(0, len(ids) - block_size):
                chunk = ids[i:i + block_size + 1]
                self.examples.append(torch.tensor(chunk))

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        chunk = self.examples[idx]

        x = chunk[:-1]
        y = chunk[1:]

        return x, y

The model receives integer tensors:

x, y = batch

# x: [B, T]
# y: [B, T]

logits = model(x)

# logits: [B, T, V]

The embedding layer maps token IDs to vectors:

embedding = torch.nn.Embedding(vocab_size, hidden_dim)

h = embedding(x)

# h: [B, T, D]

The tokenizer defines vocab_size, and therefore determines the first and last layers of the model.

Summary

Subword methods solve the open vocabulary problem by splitting rare text into reusable pieces. BPE builds tokens through frequent-pair merges. WordPiece uses likelihood-oriented subword construction and continuation markers. Unigram tokenization treats segmentation probabilistically and prunes a candidate vocabulary. SentencePiece provides a language-neutral framework for raw-text tokenization.

Subword tokenizer design affects sequence length, memory cost, multilingual quality, code modeling, numerical behavior, and model robustness. In modern deep learning systems, the tokenizer is part of the model’s architecture and deployment interface.