2.11 Tokenization

Tokenization converts a string into a sequence of meaningful units called tokens. A token may be a word, number, identifier, operator, delimiter, quoted string, or whitespace block, depending on the grammar. Tokenization is the first stage of many parsers because it separates low level character scanning from higher level syntactic analysis.

Problem

Given an input string, split it into tokens while preserving enough information for later processing.

For example:

sum = price * 3 + tax

A tokenizer might produce:

identifier("sum")
operator("=")
identifier("price")
operator("*")
number("3")
operator("+")
identifier("tax")

Whitespace is usually skipped, but some languages preserve it when indentation or formatting matters.

Token Type

A tokenizer should produce structured values, not just strings.

type TokenKind int

const (
	TokenEOF TokenKind = iota
	TokenIdent
	TokenNumber
	TokenString
	TokenOperator
	TokenLParen
	TokenRParen
	TokenComma
	TokenError
)

type Token struct {
	Kind  TokenKind
	Text  string
	Start int
	End   int
}

Start and End are byte offsets into the original input. Keeping offsets makes diagnostics and source mapping much easier.

Basic Scanner Structure

A tokenizer maintains an input string and a current byte offset.

type Scanner struct {
	src string
	pos int
}

func NewScanner(src string) *Scanner {
	return &Scanner{src: src}
}

The main method returns the next token.

func (s *Scanner) Next() Token {
	s.skipSpaces()

	if s.pos >= len(s.src) {
		return Token{Kind: TokenEOF, Start: s.pos, End: s.pos}
	}

	start := s.pos
	c := s.src[s.pos]

	if isLetterASCII(c) || c == '_' {
		return s.scanIdent()
	}

	if isDigitASCII(c) {
		return s.scanNumber()
	}

	switch c {
	case '(':
		s.pos++
		return Token{Kind: TokenLParen, Text: "(", Start: start, End: s.pos}
	case ')':
		s.pos++
		return Token{Kind: TokenRParen, Text: ")", Start: start, End: s.pos}
	case ',':
		s.pos++
		return Token{Kind: TokenComma, Text: ",", Start: start, End: s.pos}
	case '"':
		return s.scanString()
	case '+', '-', '*', '/', '=':
		s.pos++
		return Token{Kind: TokenOperator, Text: s.src[start:s.pos], Start: start, End: s.pos}
	default:
		s.pos++
		return Token{Kind: TokenError, Text: s.src[start:s.pos], Start: start, End: s.pos}
	}
}

The scanner consumes at least one byte on every non-EOF call. This progress condition prevents infinite loops.

Skipping Whitespace

For a simple ASCII grammar, skip spaces, tabs, and newlines.

func (s *Scanner) skipSpaces() {
	for s.pos < len(s.src) {
		switch s.src[s.pos] {
		case ' ', '\t', '\n', '\r':
			s.pos++
		default:
			return
		}
	}
}

Invariant:

After skipSpaces returns, pos is either at EOF or at a non-whitespace byte.

Identifiers

Identifiers usually begin with a letter or underscore and continue with letters, digits, or underscores.

func (s *Scanner) scanIdent() Token {
	start := s.pos
	s.pos++

	for s.pos < len(s.src) {
		c := s.src[s.pos]
		if !isLetterASCII(c) && !isDigitASCII(c) && c != '_' {
			break
		}
		s.pos++
	}

	return Token{
		Kind:  TokenIdent,
		Text:  s.src[start:s.pos],
		Start: start,
		End:   s.pos,
	}
}

func isLetterASCII(c byte) bool {
	return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')
}

func isDigitASCII(c byte) bool {
	return '0' <= c && c <= '9'
}

Invariant:

The substring src[start:pos] is a valid identifier prefix.

The loop stops at the first byte that cannot continue an identifier.

Numbers

A simple integer scanner reads consecutive digits.

func (s *Scanner) scanNumber() Token {
	start := s.pos

	for s.pos < len(s.src) && isDigitASCII(s.src[s.pos]) {
		s.pos++
	}

	return Token{
		Kind:  TokenNumber,
		Text:  s.src[start:s.pos],
		Start: start,
		End:   s.pos,
	}
}

This scanner recognizes integers only. If decimals, signs, exponents, or underscores are allowed, encode those rules explicitly instead of adding ad hoc checks.

Quoted Strings

String scanning requires special handling because delimiters inside the string may be escaped.

func (s *Scanner) scanString() Token {
	start := s.pos
	s.pos++ // opening quote

	for s.pos < len(s.src) {
		c := s.src[s.pos]

		if c == '\\' {
			s.pos++
			if s.pos < len(s.src) {
				s.pos++
			}
			continue
		}

		if c == '"' {
			s.pos++
			return Token{
				Kind:  TokenString,
				Text:  s.src[start:s.pos],
				Start: start,
				End:   s.pos,
			}
		}

		s.pos++
	}

	return Token{
		Kind:  TokenError,
		Text:  s.src[start:s.pos],
		Start: start,
		End:   s.pos,
	}
}

The scanner returns TokenError if the string reaches EOF before a closing quote.

Tokenizing the Whole Input

The scanner can be used as an iterator.

func Tokenize(src string) []Token {
	s := NewScanner(src)
	var tokens []Token

	for {
		tok := s.Next()
		tokens = append(tokens, tok)

		if tok.Kind == TokenEOF || tok.Kind == TokenError {
			break
		}
	}

	return tokens
}

This approach keeps tokenization simple and makes parser code independent from raw character handling.

Example

Input:

sum = price * 3 + tax

Output:

TokenIdent    "sum"    [0,3)
TokenOperator "="      [4,5)
TokenIdent    "price"  [6,11)
TokenOperator "*"      [12,13)
TokenNumber   "3"      [14,15)
TokenOperator "+"      [16,17)
TokenIdent    "tax"    [18,21)
TokenEOF      ""       [21,21)

The offsets are byte ranges in the original string.

Longest Match Rule

Many tokenizers use the longest match rule: when multiple token forms could match, consume the longest valid token.

For example, with operators:

=
==
=>

the input:

==

should usually produce one == token, not two = tokens.

func (s *Scanner) scanOperator() Token {
	start := s.pos

	if s.pos+1 < len(s.src) {
		two := s.src[s.pos : s.pos+2]
		switch two {
		case "==", "!=", "<=", ">=", "=>":
			s.pos += 2
			return Token{Kind: TokenOperator, Text: two, Start: start, End: s.pos}
		}
	}

	s.pos++
	return Token{Kind: TokenOperator, Text: s.src[start:s.pos], Start: start, End: s.pos}
}

Check longer operators before shorter ones.

Keywords

Keywords can be handled after scanning an identifier.

const (
	TokenIf TokenKind = iota + 100
	TokenElse
	TokenFor
)

func keywordKind(text string) TokenKind {
	switch text {
	case "if":
		return TokenIf
	case "else":
		return TokenElse
	case "for":
		return TokenFor
	default:
		return TokenIdent
	}
}

Then modify scanIdent:

text := s.src[start:s.pos]
return Token{
	Kind:  keywordKind(text),
	Text:  text,
	Start: start,
	End:   s.pos,
}

This avoids duplicating identifier logic for each keyword.

Error Tokens

A tokenizer should report lexical errors as tokens or diagnostics rather than panic.

Examples of lexical errors:

unexpected character
unterminated string
invalid number
invalid escape sequence

Returning TokenError is enough for simple systems. Larger systems usually store a diagnostic message as well.

type Token struct {
	Kind    TokenKind
	Text    string
	Start   int
	End     int
	Message string
}

Complexity

A correct tokenizer consumes each byte a constant number of times.

Time:  O(n)
Space: O(t)

where t is the number of emitted tokens.

If the tokenizer stores only the current token and streams tokens to a parser, extra space can be reduced to O(1) apart from the input.

Common Pitfalls

A token scanner must always make progress. If an error branch returns without advancing, the caller may loop forever.

The order of checks matters. For example, keywords should be recognized after identifier scanning, and multi-character operators should be checked before one-character operators.

Offsets should use a consistent convention. Half-open byte ranges [Start, End) work well.

Skipping whitespace may be wrong for indentation-sensitive languages. In such grammars, whitespace and newlines may need their own token kinds.

String slicing by byte offsets is correct for byte-oriented tokenizers, but diagnostics for user-facing columns may need Unicode-aware position mapping.

Takeaway

Tokenization turns raw text into structured tokens. A good tokenizer has explicit token kinds, byte offsets, a progress guarantee, and clear rules for identifiers, numbers, strings, operators, and errors. Once tokenization is correct, later parsing code can work with a cleaner and more stable input model.