Subword Tokenization

A language model cannot process raw text directly. Text must first be converted into a sequence of token IDs. The procedure that performs this conversion is called tokenization.

Older NLP systems often used word-level tokenization. A sentence was split into words, and each word received an integer ID. This approach is simple, but it has a serious limitation: natural language has too many possible words. Names, numbers, spelling variants, compounds, rare terms, and domain-specific words make the vocabulary grow quickly.

Subword tokenization solves this problem by representing text using units smaller than words but usually larger than characters. A rare word can be decomposed into pieces that the model already knows.

For example:

unhappiness

might be split as

un happiness

un happi ness

un ##happiness

depending on the tokenizer.

The model does not need a separate vocabulary entry for every possible word. It only needs a vocabulary of reusable pieces.

Why Word-Level Tokenization Breaks Down

Suppose a vocabulary contains these words:

the, cat, sat, on, mat

The sentence

the cat sat on the mat

can be encoded cleanly:

[the, cat, sat, on, the, mat]

But the sentence

the kitten sat on the carpet

contains kitten and carpet, which are missing from the vocabulary. These are called out-of-vocabulary tokens.

A word-level system may replace each unknown word with a special token:

[the, <unk>, sat, on, the, <unk>]

This loses information. The model cannot distinguish kitten from carpet, because both become <unk>.

Subword tokenization reduces this problem. If the vocabulary contains smaller units such as kit, ten, car, and pet, then the model can still represent rare words:

kitten  -> kit ten
carpet  -> car pet

The representation is imperfect, but it preserves much more information than <unk>.

Characters, Words, and Subwords

There are three common levels of text representation.

Level	Example tokenization of `unhappiness`	Strength	Weakness
Character	`u n h a p p i n e s s`	Handles any word	Long sequences
Word	`unhappiness`	Short sequences	Large vocabulary, many unknowns
Subword	`un happiness` or `un happi ness`	Balanced vocabulary and sequence length	Token boundaries can look unnatural

Character-level tokenization avoids unknown words almost entirely, but sequences become long. A sentence of 20 words may become 100 or more characters. Longer sequences are more expensive for transformers, because attention cost grows roughly with the square of sequence length.

Word-level tokenization gives shorter sequences, but the vocabulary can become very large. It also handles rare and newly created words poorly.

Subword tokenization gives a practical compromise. It keeps the vocabulary manageable while avoiding most unknown tokens.

Vocabulary and Token IDs

A tokenizer has a vocabulary. The vocabulary maps token strings to integer IDs:

Token	ID
`<pad>`	0
`<unk>`	1
`<bos>`	2
`<eos>`	3
`the`	4
`cat`	5
`un`	6
`happy`	7
`ness`	8

When the tokenizer sees text, it splits the text into tokens and then replaces each token with its ID.

For example:

the cat

may become

[4, 5]

The resulting integer tensor can be passed into an embedding layer:

import torch
import torch.nn as nn

token_ids = torch.tensor([[4, 5]], dtype=torch.long)

embedding = nn.Embedding(num_embeddings=10000, embedding_dim=256)
x = embedding(token_ids)

print(x.shape)  # torch.Size([1, 2, 256])

The tokenizer handles text. The embedding layer handles token IDs. These are separate parts of the language model pipeline.

Byte Pair Encoding

Byte Pair Encoding, or BPE, is one of the most widely used subword tokenization algorithms.

The basic idea is to start with small units and repeatedly merge frequent adjacent pairs. In text tokenization, the initial units may be characters or bytes. The algorithm counts which adjacent pairs occur most often and merges the most frequent pair into a new token.

Suppose the training text contains many examples of

l o w
l o w e r
l o w e s t

The pair l o may be frequent, so it is merged:

lo w
lo w e r
lo w e s t

Then lo w may be frequent, so it is merged:

low
low e r
low e s t

Later, e r may become er, and e s t may become est.

After many merge steps, the vocabulary contains common words and useful subword fragments.

The number of merge steps controls vocabulary size. More merges produce larger vocabularies and shorter sequences. Fewer merges produce smaller vocabularies and longer sequences.

WordPiece

WordPiece is another common subword algorithm. It is used by BERT-style models.

Like BPE, WordPiece builds a vocabulary from smaller units. The difference lies in how candidate merges are selected. WordPiece chooses merges according to a likelihood-based scoring rule rather than pure pair frequency.

WordPiece tokenizers often mark tokens that continue a word with a prefix such as ##.

For example:

playing

may become

play ##ing

and

unaffordable

may become

un ##afford ##able

The ## notation means that the subword does not begin a new word. It is part of the preceding word.

This convention helps preserve word boundary information.

Unigram Tokenization

Unigram tokenization takes a different approach. Instead of building a vocabulary only by merging pieces, it starts with a large candidate vocabulary and learns which pieces are useful.

Each possible tokenization of a word is assigned a probability. The tokenizer chooses a likely segmentation under the learned subword model.

For example, the word

internationalization

could be segmented in several ways:

international ization
inter national ization
international iz ation

A unigram tokenizer learns which segmentation is most probable from data.

SentencePiece commonly supports unigram tokenization. It is especially useful for multilingual models because it can treat whitespace as an ordinary symbol and avoid assumptions about language-specific word boundaries.

Byte-Level Tokenization

Many modern language models use byte-level tokenization. Instead of starting from Unicode characters or words, the tokenizer begins from bytes.

A byte-level tokenizer can represent any text that can be encoded as bytes. This greatly reduces the chance of unknown tokens. It also handles mixed languages, symbols, emojis, code, and unusual strings more robustly.

For example, byte-level BPE starts with a base vocabulary over byte values and then learns merges over byte sequences.

The main advantage is coverage. The tokenizer can encode nearly any input text. The main cost is that unusual text may require more tokens.

This tradeoff is usually acceptable for large language models, where robust coverage matters more than linguistically clean token boundaries.

Special Tokens

Tokenizers usually reserve special tokens for structural roles.

Token	Common meaning
`<pad>`	Padding token used to make sequences the same length
`<unk>`	Unknown token
`<bos>`	Beginning of sequence
`<eos>`	End of sequence
`<mask>`	Masked token for masked language modeling
`<sep>`	Separator between text segments
`<cls>`	Classification token

Different model families use different conventions. BERT commonly uses [CLS], [SEP], and [MASK]. GPT-style models often use an end-of-text token and may not use a padding token during pretraining.

Special tokens must be handled consistently. The tokenizer, embedding layer, loss function, attention mask, and decoding procedure must agree on their IDs and meanings.

Attention Masks

When sequences have different lengths, they are padded to a common length inside a batch.

For example:

the cat sat
the cat sat on the mat

may become

the cat sat <pad> <pad> <pad>
the cat sat on    the   mat

The token ID tensor may have shape:

[batch_size, sequence_length]

But the model also needs to know which positions are real tokens and which positions are padding. This is done with an attention mask.

A typical mask uses 1 for real tokens and 0 for padding:

token_ids = torch.tensor([
    [4, 5, 6, 0, 0, 0],
    [4, 5, 6, 7, 4, 8],
])

attention_mask = (token_ids != 0).long()

print(attention_mask)

Output:

tensor([[1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1]])

The transformer uses this mask to prevent attention from reading padding positions.

Causal Masks

Autoregressive language models predict the next token from previous tokens. During training, the model must not look ahead at future tokens.

For a sequence

the cat sat

the model should predict:

cat from the
sat from the cat

It should not use sat when predicting cat.

This constraint is enforced by a causal mask. In a transformer decoder, each position can attend only to itself and earlier positions.

For a sequence of length 4, the allowed attention pattern is:

In PyTorch:

T = 4
causal_mask = torch.tril(torch.ones(T, T))

print(causal_mask)

Output:

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

Padding masks and causal masks often work together. Padding masks hide padding tokens. Causal masks hide future tokens.

Tokenization and Sequence Length

Tokenization changes the effective length of the input.

A sentence with 20 words might become 24 tokens, 35 tokens, or 80 tokens depending on the tokenizer and language. This matters because transformer computation depends heavily on sequence length.

For self-attention, a sequence of length $T$ creates an attention matrix of size

T \times T.

Doubling the sequence length roughly quadruples the size of the attention matrix.

This is one reason tokenizer design matters. A tokenizer that produces shorter sequences can make training and inference cheaper. But a tokenizer with too large a vocabulary increases the size of the embedding matrix and output classifier.

The embedding matrix has shape

|V| \times d,

where $|V|$ is the vocabulary size and $d$ is the embedding dimension. Increasing vocabulary size directly increases parameter count.

Tokenization for Multilingual Text

Tokenization becomes harder in multilingual settings.

Some languages use spaces between words. Some do not. Some use complex morphology. Some use large character inventories. Some combine multiple scripts in ordinary text.

A tokenizer trained mostly on English may tokenize other languages inefficiently. A word or phrase in another language may become many small pieces, increasing sequence length and reducing model efficiency.

Multilingual tokenizers usually need broad and balanced training data. Byte-level tokenization helps with coverage, but it does not guarantee equal efficiency across languages.

For example, the same semantic content may require more tokens in one language than another. This affects context length, training cost, and downstream performance.

Tokenization for Code

Code has different structure from natural language. It contains identifiers, punctuation, whitespace, indentation, operators, literals, and comments.

A tokenizer for code should handle strings such as:

get_user_profile
HTTPRequestHandler
user.id == request.owner_id

Subword tokenization is useful because identifiers often contain meaningful parts:

get_user_profile -> get user profile
HTTPRequestHandler -> HTTP Request Handler

Byte-level tokenization is also useful because code contains symbols and formatting patterns that may be rare in ordinary text.

For code models, preserving whitespace and punctuation can be important. A tokenizer that discards or normalizes too much structure may damage the model’s ability to reason about syntax.

Tokenization in PyTorch Workflows

PyTorch itself does not define a single standard tokenizer. In practice, tokenization is often handled by external libraries, then token IDs are passed into PyTorch tensors.

A typical workflow is:

texts = [
    "the cat sat",
    "the dog slept",
]

# Example output from a tokenizer.
token_ids = torch.tensor([
    [4, 5, 6],
    [4, 7, 8],
], dtype=torch.long)

embedding = nn.Embedding(num_embeddings=10000, embedding_dim=256)

x = embedding(token_ids)
print(x.shape)  # torch.Size([2, 3, 256])

For transformer models, the tokenizer often returns a dictionary:

batch = {
    "input_ids": torch.tensor(...),
    "attention_mask": torch.tensor(...),
}

The model receives both:

input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]

The input IDs say what the tokens are. The attention mask says which positions are valid.

Decoding Token IDs Back to Text

Tokenization has an inverse operation called decoding. Decoding maps token IDs back into text.

For example:

[4, 5, 6] -> "the cat sat"

For subword tokenization, decoding must merge pieces correctly:

play ##ing -> playing

Ġthe Ġcat Ġsat -> the cat sat

Different tokenizers use different internal symbols to represent whitespace and word boundaries. These symbols are tokenizer implementation details. The decoded output should reconstruct readable text.

Generation models repeatedly predict token IDs and decode them into text.

Common Errors

One common error is treating token IDs as meaningful numbers. Token IDs are categorical indices. The difference between ID 100 and ID 101 has no semantic meaning. The embedding layer gives meaning to the IDs by mapping them to learned vectors.

Another common error is forgetting the attention mask. If a transformer attends to padding tokens, training quality may degrade.

A third common error is mixing tokenizer and model vocabularies. A model trained with one tokenizer should be used with that same tokenizer. If token IDs are produced by a different tokenizer, the model will look up the wrong embeddings.

A fourth common error is truncating text without noticing. Tokenizers often enforce a maximum sequence length. Text longer than this length may be cut off. This can remove important information.

Summary

Subword tokenization converts text into reusable pieces that are smaller than words but usually larger than characters. It reduces out-of-vocabulary problems while keeping sequence lengths manageable.

BPE, WordPiece, unigram tokenization, and byte-level tokenization are common approaches. Each makes different tradeoffs between vocabulary size, coverage, sequence length, and linguistic structure.

In PyTorch systems, tokenization usually happens before the model receives input. The tokenizer produces integer token IDs and attention masks. The embedding layer converts those IDs into vectors. The transformer or sequence model then computes contextual representations.