16.19 Token Streams
Most string algorithms operate on characters.
16.19 Token Streams
Problem
Most string algorithms operate on characters. Many real systems operate on tokens.
A compiler does not usually search raw source code character by character. It first groups characters into tokens:
if
identifier
=
number
{
}
A search engine does not usually index raw characters. It tokenizes documents into words, terms, phrases, or normalized units.
The challenge is to apply string-algorithm ideas to sequences whose elements are not characters.
Solution
Treat a token stream as a generalized string.
Instead of:
String = Array Char
use:
TokenStream = Array Token
Most algorithms still apply as long as tokens support equality, hashing, and sometimes ordering.
For example, KMP can search for a token pattern inside a token stream. Rabin-Karp can hash token windows. Tries can store token sequences. Suffix arrays can index suffixes of token streams.
The structure changes from text processing to sequence processing.
What Is a Token?
A token is a meaningful unit extracted from raw input.
In source code:
let x = 42
tokens might be:
| Token Type | Lexeme |
|---|---|
| Keyword | let |
| Identifier | x |
| Operator | = |
| Number | 42 |
In natural language:
the quick brown fox
tokens might be:
the
quick
brown
fox
In logs:
ERROR user=42 timeout=3000
tokens might be:
ERROR
user
=
42
timeout
=
3000
The tokenizer defines the algorithm's input model.
Why Tokenization Matters
Consider searching source code for:
x = y + z
A raw string search may match inside comments, string literals, or unrelated formatting.
A token-based search can ignore whitespace and comments:
x=y+z
x = y + z
x = y + z
all become the same token sequence.
This makes matching more robust.
Tokenization moves the problem from textual appearance to structural meaning.
Token Equality
The first design decision is equality.
Should these be equal?
UserID
userid
user_id
The answer depends on the application.
For a compiler:
UserID ≠ userid
For a search engine:
UserID = userid
may be acceptable after normalization.
Token equality must be explicit because it affects every downstream algorithm.
Token Normalization
Token streams often pass through a normalization stage.
Examples:
| Input | Normalized |
|---|---|
| Running | run |
| RUNNING | running |
| user_id | user id |
| 42 | <NUMBER> |
| "hello" | <STRING> |
Normalization can improve matching, but it also removes information.
For example, replacing every number with <NUMBER> makes these equivalent:
timeout = 1000
timeout = 5000
That may be desirable for clone detection but incorrect for exact semantic comparison.
Pattern Matching Over Tokens
Exact matching generalizes directly.
Given:
Stream = [let, x, =, y, +, z]
Pattern = [x, =, y]
find every position where the token sequence occurs.
The naive algorithm is identical to character matching:
for each start position:
compare pattern tokens one by one
KMP also applies because it depends only on equality.
def tokenKMP
[BEq α]
(stream pattern : Array α)
: List Nat :=
by
-- Same structure as character KMP.
-- Equality compares tokens instead of characters.
sorry
The algorithm is generic over the token type.
Token Hashing
Rabin-Karp works over tokens by assigning each token an integer identity.
let → 17
identifier → 29
= → 5
number → 41
Then compute rolling hashes over these integer IDs.
This is useful for:
- Clone detection
- Plagiarism detection
- Log pattern matching
- Large-scale document search
- Repeated phrase detection
The same collision cautions apply. Equal hashes should be verified unless probabilistic equality is acceptable.
Token Tries
A trie can store token sequences.
For phrases:
new york
new york city
new jersey
new delhi
the trie shares:
new
as a common prefix.
Conceptually:
(root)
|
new
/ | \\nyork jersey delhi
|
city
This supports phrase lookup, autocomplete, and multi-token dictionary matching.
With failure links, the same structure becomes Aho-Corasick over tokens.
Token Suffix Arrays
A suffix array can index a token stream.
Document:
to be or not to be
Token suffixes:
to be or not to be
be or not to be
or not to be
not to be
to be
be
Sorting these suffixes enables phrase search and repeated phrase discovery.
This is common in plagiarism detection, corpus analysis, and document indexing.
Example: Code Clone Detection
Suppose two functions differ only in variable names:
x = a + b
y = c + d
Raw text differs.
After token normalization:
<ID> = <ID> + <ID>
<ID> = <ID> + <ID>
the token streams match exactly.
A suffix array, rolling hash, or token-based KMP can detect this repeated structure.
This technique is a simple form of clone detection.
Example: Log Templates
Logs often contain repeated templates:
ERROR user 17 failed login
ERROR user 42 failed login
ERROR user 99 failed login
Normalize numbers:
ERROR user <NUMBER> failed login
Now all three messages share the same token pattern.
This enables aggregation, anomaly detection, and indexing.
Lean Representation
Define tokens explicitly.
inductive Token where
| keyword : String -> Token
| ident : String -> Token
| number : Int -> Token
| symbol : String -> Token
deriving BEq, Repr
A token stream is then:
abbrev TokenStream := Array Token
For normalized comparison, define a separate representation.
inductive NormalizedToken where
| keyword : String -> NormalizedToken
| ident : NormalizedToken
| number : NormalizedToken
| symbol : String -> NormalizedToken
deriving BEq, Repr
This keeps exact tokens and normalized tokens distinct.
Tokenization Pipeline
A robust pipeline has clear stages:
Raw text
↓
Lexing
↓
Token normalization
↓
Algorithm
↓
Result mapping
The final stage maps token positions back to original text offsets.
That mapping is essential for diagnostics, search results, highlighting, and error messages.
Complexity Analysis
Let:
| Symbol | Meaning |
|---|---|
| n | Number of tokens in the stream |
| m | Number of tokens in the pattern |
| Operation | Complexity |
|---|---|
| Tokenization | O(input size) |
| Naive token matching | O(nm) |
| KMP over tokens | O(n + m) |
| Rolling hash preprocessing | O(n) |
| Token suffix array | O(n log n) typical |
| Token trie lookup | O(m) |
The algorithmic complexity mirrors character-based strings. The main additional cost is tokenization.
Correctness
Token-stream algorithms are correct relative to the tokenizer and equality policy.
If tokenization preserves the distinctions required by the application, then sequence algorithms behave as expected.
If normalization merges distinct meanings, the algorithm may report matches that are acceptable for approximate analysis but invalid for exact semantics.
Correctness therefore depends on two layers:
- The sequence algorithm.
- The token model.
Both must match the problem.
Common Pitfalls
Do not treat tokenization as a minor preprocessing detail. It defines the problem.
Do not discard position information if results must be mapped back to source text.
Do not normalize identifiers, numbers, or literals unless that behavior is intended.
Do not mix raw tokens and normalized tokens in the same index.
Do not assume natural-language tokenization and programming-language tokenization follow the same rules.
Takeaway
A token stream is a string over a richer alphabet. Once text is converted into tokens, many classic string algorithms transfer directly: KMP, Rabin-Karp, tries, Aho-Corasick, suffix arrays, and hashing all work over token sequences. The hard part is not usually the algorithm. It is choosing a tokenization and normalization policy that matches the semantics of the task.