Tokenization converts a string into a sequence of meaningful units called tokens. A token may be a word, number, identifier, operator, delimiter, quoted string, or whitespace block, depending on the grammar. Tokenization is the first stage of many parsers because it separates low level character scanning from higher level syntactic analysis.
Problem
Given an input string, split it into tokens while preserving enough information for later processing.
For example:
sum = price * 3 + taxA tokenizer might produce:
identifier("sum")
operator("=")
identifier("price")
operator("*")
number("3")
operator("+")
identifier("tax")Whitespace is usually skipped, but some languages preserve it when indentation or formatting matters.
Token Type
A tokenizer should produce structured values, not just strings.
type TokenKind int
const (
TokenEOF TokenKind = iota
TokenIdent
TokenNumber
TokenString
TokenOperator
TokenLParen
TokenRParen
TokenComma
TokenError
)
type Token struct {
Kind TokenKind
Text string
Start int
End int
}Start and End are byte offsets into the original input. Keeping offsets makes diagnostics and source mapping much easier.
Basic Scanner Structure
A tokenizer maintains an input string and a current byte offset.
type Scanner struct {
src string
pos int
}
func NewScanner(src string) *Scanner {
return &Scanner{src: src}
}The main method returns the next token.
func (s *Scanner) Next() Token {
s.skipSpaces()
if s.pos >= len(s.src) {
return Token{Kind: TokenEOF, Start: s.pos, End: s.pos}
}
start := s.pos
c := s.src[s.pos]
if isLetterASCII(c) || c == '_' {
return s.scanIdent()
}
if isDigitASCII(c) {
return s.scanNumber()
}
switch c {
case '(':
s.pos++
return Token{Kind: TokenLParen, Text: "(", Start: start, End: s.pos}
case ')':
s.pos++
return Token{Kind: TokenRParen, Text: ")", Start: start, End: s.pos}
case ',':
s.pos++
return Token{Kind: TokenComma, Text: ",", Start: start, End: s.pos}
case '"':
return s.scanString()
case '+', '-', '*', '/', '=':
s.pos++
return Token{Kind: TokenOperator, Text: s.src[start:s.pos], Start: start, End: s.pos}
default:
s.pos++
return Token{Kind: TokenError, Text: s.src[start:s.pos], Start: start, End: s.pos}
}
}The scanner consumes at least one byte on every non-EOF call. This progress condition prevents infinite loops.
Skipping Whitespace
For a simple ASCII grammar, skip spaces, tabs, and newlines.
func (s *Scanner) skipSpaces() {
for s.pos < len(s.src) {
switch s.src[s.pos] {
case ' ', '\t', '\n', '\r':
s.pos++
default:
return
}
}
}Invariant:
After skipSpaces returns, pos is either at EOF or at a non-whitespace byte.Identifiers
Identifiers usually begin with a letter or underscore and continue with letters, digits, or underscores.
func (s *Scanner) scanIdent() Token {
start := s.pos
s.pos++
for s.pos < len(s.src) {
c := s.src[s.pos]
if !isLetterASCII(c) && !isDigitASCII(c) && c != '_' {
break
}
s.pos++
}
return Token{
Kind: TokenIdent,
Text: s.src[start:s.pos],
Start: start,
End: s.pos,
}
}
func isLetterASCII(c byte) bool {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')
}
func isDigitASCII(c byte) bool {
return '0' <= c && c <= '9'
}Invariant:
The substring src[start:pos] is a valid identifier prefix.The loop stops at the first byte that cannot continue an identifier.
Numbers
A simple integer scanner reads consecutive digits.
func (s *Scanner) scanNumber() Token {
start := s.pos
for s.pos < len(s.src) && isDigitASCII(s.src[s.pos]) {
s.pos++
}
return Token{
Kind: TokenNumber,
Text: s.src[start:s.pos],
Start: start,
End: s.pos,
}
}This scanner recognizes integers only. If decimals, signs, exponents, or underscores are allowed, encode those rules explicitly instead of adding ad hoc checks.
Quoted Strings
String scanning requires special handling because delimiters inside the string may be escaped.
func (s *Scanner) scanString() Token {
start := s.pos
s.pos++ // opening quote
for s.pos < len(s.src) {
c := s.src[s.pos]
if c == '\\' {
s.pos++
if s.pos < len(s.src) {
s.pos++
}
continue
}
if c == '"' {
s.pos++
return Token{
Kind: TokenString,
Text: s.src[start:s.pos],
Start: start,
End: s.pos,
}
}
s.pos++
}
return Token{
Kind: TokenError,
Text: s.src[start:s.pos],
Start: start,
End: s.pos,
}
}The scanner returns TokenError if the string reaches EOF before a closing quote.
Tokenizing the Whole Input
The scanner can be used as an iterator.
func Tokenize(src string) []Token {
s := NewScanner(src)
var tokens []Token
for {
tok := s.Next()
tokens = append(tokens, tok)
if tok.Kind == TokenEOF || tok.Kind == TokenError {
break
}
}
return tokens
}This approach keeps tokenization simple and makes parser code independent from raw character handling.
Example
Input:
sum = price * 3 + taxOutput:
TokenIdent "sum" [0,3)
TokenOperator "=" [4,5)
TokenIdent "price" [6,11)
TokenOperator "*" [12,13)
TokenNumber "3" [14,15)
TokenOperator "+" [16,17)
TokenIdent "tax" [18,21)
TokenEOF "" [21,21)The offsets are byte ranges in the original string.
Longest Match Rule
Many tokenizers use the longest match rule: when multiple token forms could match, consume the longest valid token.
For example, with operators:
=
==
=>the input:
==should usually produce one == token, not two = tokens.
func (s *Scanner) scanOperator() Token {
start := s.pos
if s.pos+1 < len(s.src) {
two := s.src[s.pos : s.pos+2]
switch two {
case "==", "!=", "<=", ">=", "=>":
s.pos += 2
return Token{Kind: TokenOperator, Text: two, Start: start, End: s.pos}
}
}
s.pos++
return Token{Kind: TokenOperator, Text: s.src[start:s.pos], Start: start, End: s.pos}
}Check longer operators before shorter ones.
Keywords
Keywords can be handled after scanning an identifier.
const (
TokenIf TokenKind = iota + 100
TokenElse
TokenFor
)
func keywordKind(text string) TokenKind {
switch text {
case "if":
return TokenIf
case "else":
return TokenElse
case "for":
return TokenFor
default:
return TokenIdent
}
}Then modify scanIdent:
text := s.src[start:s.pos]
return Token{
Kind: keywordKind(text),
Text: text,
Start: start,
End: s.pos,
}This avoids duplicating identifier logic for each keyword.
Error Tokens
A tokenizer should report lexical errors as tokens or diagnostics rather than panic.
Examples of lexical errors:
unexpected character
unterminated string
invalid number
invalid escape sequenceReturning TokenError is enough for simple systems. Larger systems usually store a diagnostic message as well.
type Token struct {
Kind TokenKind
Text string
Start int
End int
Message string
}Complexity
A correct tokenizer consumes each byte a constant number of times.
Time: O(n)
Space: O(t)where t is the number of emitted tokens.
If the tokenizer stores only the current token and streams tokens to a parser, extra space can be reduced to O(1) apart from the input.
Common Pitfalls
A token scanner must always make progress. If an error branch returns without advancing, the caller may loop forever.
The order of checks matters. For example, keywords should be recognized after identifier scanning, and multi-character operators should be checked before one-character operators.
Offsets should use a consistent convention. Half-open byte ranges [Start, End) work well.
Skipping whitespace may be wrong for indentation-sensitive languages. In such grammars, whitespace and newlines may need their own token kinds.
String slicing by byte offsets is correct for byte-oriented tokenizers, but diagnostics for user-facing columns may need Unicode-aware position mapping.
Takeaway
Tokenization turns raw text into structured tokens. A good tokenizer has explicit token kinds, byte offsets, a progress guarantee, and clear rules for identifiers, numbers, strings, operators, and errors. Once tokenization is correct, later parsing code can work with a cleaner and more stable input model.