16.19 Token Streams

Most string algorithms operate on characters.

16.19 Token Streams

Problem

Most string algorithms operate on characters. Many real systems operate on tokens.

A compiler does not usually search raw source code character by character. It first groups characters into tokens:

if
identifier
=
number
{
}

A search engine does not usually index raw characters. It tokenizes documents into words, terms, phrases, or normalized units.

The challenge is to apply string-algorithm ideas to sequences whose elements are not characters.

Solution

Treat a token stream as a generalized string.

Instead of:

String = Array Char

use:

TokenStream = Array Token

Most algorithms still apply as long as tokens support equality, hashing, and sometimes ordering.

For example, KMP can search for a token pattern inside a token stream. Rabin-Karp can hash token windows. Tries can store token sequences. Suffix arrays can index suffixes of token streams.

The structure changes from text processing to sequence processing.

What Is a Token?

A token is a meaningful unit extracted from raw input.

In source code:

let x = 42

tokens might be:

Token Type Lexeme
Keyword let
Identifier x
Operator =
Number 42

In natural language:

the quick brown fox

tokens might be:

the
quick
brown
fox

In logs:

ERROR user=42 timeout=3000

tokens might be:

ERROR
user
=
42
timeout
=
3000

The tokenizer defines the algorithm's input model.

Why Tokenization Matters

Consider searching source code for:

x = y + z

A raw string search may match inside comments, string literals, or unrelated formatting.

A token-based search can ignore whitespace and comments:

x=y+z
x = y + z
x   =   y   +   z

all become the same token sequence.

This makes matching more robust.

Tokenization moves the problem from textual appearance to structural meaning.

Token Equality

The first design decision is equality.

Should these be equal?

UserID
userid
user_id

The answer depends on the application.

For a compiler:

UserID ≠ userid

For a search engine:

UserID = userid

may be acceptable after normalization.

Token equality must be explicit because it affects every downstream algorithm.

Token Normalization

Token streams often pass through a normalization stage.

Examples:

Input Normalized
Running run
RUNNING running
user_id user id
42 <NUMBER>
"hello" <STRING>

Normalization can improve matching, but it also removes information.

For example, replacing every number with <NUMBER> makes these equivalent:

timeout = 1000
timeout = 5000

That may be desirable for clone detection but incorrect for exact semantic comparison.

Pattern Matching Over Tokens

Exact matching generalizes directly.

Given:

Stream  = [let, x, =, y, +, z]
Pattern = [x, =, y]

find every position where the token sequence occurs.

The naive algorithm is identical to character matching:

for each start position:
    compare pattern tokens one by one

KMP also applies because it depends only on equality.

def tokenKMP
  [BEq α]
  (stream pattern : Array α)
  : List Nat :=
by
  -- Same structure as character KMP.
  -- Equality compares tokens instead of characters.
  sorry

The algorithm is generic over the token type.

Token Hashing

Rabin-Karp works over tokens by assigning each token an integer identity.

let        → 17
identifier → 29
=          → 5
number     → 41

Then compute rolling hashes over these integer IDs.

This is useful for:

  • Clone detection
  • Plagiarism detection
  • Log pattern matching
  • Large-scale document search
  • Repeated phrase detection

The same collision cautions apply. Equal hashes should be verified unless probabilistic equality is acceptable.

Token Tries

A trie can store token sequences.

For phrases:

new york
new york city
new jersey
new delhi

the trie shares:

new

as a common prefix.

Conceptually:

(root)
   |
  new
 / | \\nyork jersey delhi
 |
city

This supports phrase lookup, autocomplete, and multi-token dictionary matching.

With failure links, the same structure becomes Aho-Corasick over tokens.

Token Suffix Arrays

A suffix array can index a token stream.

Document:

to be or not to be

Token suffixes:

to be or not to be
be or not to be
or not to be
not to be
to be
be

Sorting these suffixes enables phrase search and repeated phrase discovery.

This is common in plagiarism detection, corpus analysis, and document indexing.

Example: Code Clone Detection

Suppose two functions differ only in variable names:

x = a + b
y = c + d

Raw text differs.

After token normalization:

<ID> = <ID> + <ID>
<ID> = <ID> + <ID>

the token streams match exactly.

A suffix array, rolling hash, or token-based KMP can detect this repeated structure.

This technique is a simple form of clone detection.

Example: Log Templates

Logs often contain repeated templates:

ERROR user 17 failed login
ERROR user 42 failed login
ERROR user 99 failed login

Normalize numbers:

ERROR user <NUMBER> failed login

Now all three messages share the same token pattern.

This enables aggregation, anomaly detection, and indexing.

Lean Representation

Define tokens explicitly.

inductive Token where
  | keyword : String -> Token
  | ident   : String -> Token
  | number  : Int -> Token
  | symbol  : String -> Token
deriving BEq, Repr

A token stream is then:

abbrev TokenStream := Array Token

For normalized comparison, define a separate representation.

inductive NormalizedToken where
  | keyword : String -> NormalizedToken
  | ident   : NormalizedToken
  | number  : NormalizedToken
  | symbol  : String -> NormalizedToken
deriving BEq, Repr

This keeps exact tokens and normalized tokens distinct.

Tokenization Pipeline

A robust pipeline has clear stages:

Raw text
  ↓
Lexing
  ↓
Token normalization
  ↓
Algorithm
  ↓
Result mapping

The final stage maps token positions back to original text offsets.

That mapping is essential for diagnostics, search results, highlighting, and error messages.

Complexity Analysis

Let:

Symbol Meaning
n Number of tokens in the stream
m Number of tokens in the pattern
Operation Complexity
Tokenization O(input size)
Naive token matching O(nm)
KMP over tokens O(n + m)
Rolling hash preprocessing O(n)
Token suffix array O(n log n) typical
Token trie lookup O(m)

The algorithmic complexity mirrors character-based strings. The main additional cost is tokenization.

Correctness

Token-stream algorithms are correct relative to the tokenizer and equality policy.

If tokenization preserves the distinctions required by the application, then sequence algorithms behave as expected.

If normalization merges distinct meanings, the algorithm may report matches that are acceptable for approximate analysis but invalid for exact semantics.

Correctness therefore depends on two layers:

  1. The sequence algorithm.
  2. The token model.

Both must match the problem.

Common Pitfalls

Do not treat tokenization as a minor preprocessing detail. It defines the problem.

Do not discard position information if results must be mapped back to source text.

Do not normalize identifiers, numbers, or literals unless that behavior is intended.

Do not mix raw tokens and normalized tokens in the same index.

Do not assume natural-language tokenization and programming-language tokenization follow the same rules.

Takeaway

A token stream is a string over a richer alphabet. Once text is converted into tokens, many classic string algorithms transfer directly: KMP, Rabin-Karp, tries, Aho-Corasick, suffix arrays, and hashing all work over token sequences. The hard part is not usually the algorithm. It is choosing a tokenization and normalization policy that matches the semantics of the task.