Port of cpython/Parser/lexer/ and cpython/Parser/tokenizer/. Lexer state machine, buffer, token emission, and the four tokenizer drivers (file, string, utf8, readline).
1641. Lexer and tokenizer drivers
What we are porting
CPython’s lexer is split into two layers:
Parser/lexer/: the state machine that consumes bytes and emits tokens.lexer.cis the FSM,state.cowns the per-tokenizer state struct,buffer.cowns the slidable input buffer.Parser/tokenizer/: four drivers that feed the lexer from different sources.utf8_tokenizer.c(in-memory UTF-8 string),string_tokenizer.c(legacy string with encoding detection),file_tokenizer.c(FILE*),readline_tokenizer.c(REPL callback).helpers.cis the shared decode / line-handling surface.
Together they are the most stateful part of CPython’s parser, with roughly 6k lines of C. The lexer tracks indentation, parenthesis depth, type-comment mode, async-aware keywords (3.7+), f-string nesting, and continuation lines.
Go translation
Top-level surface lives in parser/lexer/:
// State is the per-tokenizer struct. Mirrors struct tok_state from
// Parser/lexer/state.h. Fields renamed from snake_case to Go style.
type State struct {
buf *Buffer // input buffer
indents []int // indent stack
parens []byte // open paren stack: '(' '[' '{' or 0
line int // 1-based
col int // 0-based code-point offset
mode Mode // file, single, eval, fstring
async asyncState // 3.7 keyword tracking
fstring []fstringTok // open f-string contexts
err *SyntaxError
}
// Mode mirrors Parser/lexer/state.h:Pegen_*Mode.
type Mode int
const (
ModeFile Mode = iota
ModeSingle
ModeEval
ModeFunctionType
ModeFString
)Buffer model in parser/lexer/buffer.go:
// Buffer mirrors the buf/cur/inp/end pointer quartet from
// Parser/lexer/buffer.c. We use offsets into a []byte instead of
// raw pointers, but the semantics are identical.
type Buffer struct {
src []byte
cur int // current read offset
lineStart int
eof bool
}Token emission lives in parser/lexer/lexer.go:
// Tok is the lexer's emitted token. Mirrors struct token from
// Parser/lexer/state.h. Distinct from tokenize.Token (1665) which
// is the public Python-facing surface.
type Tok struct {
Kind tokenize.Type
Bytes []byte
Start Pos
End Pos
Metadata uint32 // packs is_keyword, is_async_keyword, etc.
}
// Next pulls one token. Mirrors tok_get from Parser/lexer/lexer.c.
func (s *State) Next() (Tok, error)Driver dispatch
Each of the four tokenizer drivers is a thin constructor over
State:
// FromUTF8 mirrors utf8_tokenizer.c:_PyTokenizer_FromUTF8.
func FromUTF8(src []byte, mode Mode) *State
// FromString mirrors string_tokenizer.c with encoding detection
// (BOM + PEP 263 cookie).
func FromString(src []byte, mode Mode) (*State, error)
// FromFile mirrors file_tokenizer.c. Wraps an io.Reader and
// handles incremental reads.
func FromFile(r io.Reader, mode Mode) *State
// FromReadline mirrors readline_tokenizer.c. The callback returns
// one line at a time; used by the REPL.
func FromReadline(rl func() (string, error), mode Mode) *StateIndentation, parens, async
The three statefulnesses CPython tracks:
- Indent stack:
tabsize=8,altabsize=1, error on inconsistent tab/space mixing under PEP 8 mode. Same algorithm astok_get_indentinlexer.c. - Paren stack: balances
(),[],{}across logical lines. Mismatch yields the sameunmatched ']'text CPython emits. - Async-keyword state: pre-3.7 quirk is gone in 3.14; the field stays so we can re-enable for older grammar tests if needed.
f-string and t-string nesting
f-strings recursively re-enter the lexer with ModeFString. The
nesting stack is fstring []fstringTok. Each entry tracks the
quote style, the brace depth, and whether we are inside a : format
spec. Same structure as tok->tok_extra_tokens in CPython 3.12+.
t-strings (PEP 750, 3.14) reuse the same machinery with a different Tok kind on emission. The nesting algorithm is identical; only the emitted token type differs.
Errors
Lexer errors lift to a *SyntaxError whose text is verbatim from
pegen_errors.c. The mapping table lives in 1643.
File mapping
| C source | Go target |
|---|---|
Parser/lexer/state.h (struct) | parser/lexer/state.go |
Parser/lexer/state.c | parser/lexer/state.go |
Parser/lexer/buffer.c | parser/lexer/buffer.go |
Parser/lexer/lexer.c | parser/lexer/lexer.go |
Parser/tokenizer/utf8_tokenizer.c | parser/lexer/driver_utf8.go |
Parser/tokenizer/string_tokenizer.c | parser/lexer/driver_string.go |
Parser/tokenizer/file_tokenizer.c | parser/lexer/driver_file.go |
Parser/tokenizer/readline_tokenizer.c | parser/lexer/driver_readline.go |
Parser/tokenizer/helpers.c | parser/lexer/helpers.go |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
Files
-
parser/lexer/state.go:Statestruct,Modeconstants,New,Free. Indent stack, paren stack, f-string mode stack. Async-keyword tracking is intentionally not wired: 3.14 madeasync/awaitfull hard keywords so the pre-3.7 quirk is dead code in CPython too. -
parser/lexer/buffer.go: collapses to a no-op pair plusreserveBuf. The C source’s pointer rebase dance is unnecessary because gopy stores offsets. -
parser/lexer/lexer.go: regular-mode FSM (NAME, NUMBER, STRING single + triple, OP, NEWLINE/NL, INDENT/DEDENT, comment, ENDMARKER), type-comment branch, line continuation, and the entry into f-string mode all land. The f-string scanner itself sits infstring.go. Async-keyword tracking is N/A on 3.14. -
parser/lexer/fstring.go: f-string brace-balance scanner,:-format-spec mode, conversion specifiers. - [n]
parser/lexer/driver_utf8.go: collapsed into driver_string.go because Go strings are already UTF-8. -
parser/lexer/driver_string.go: in-memory driver with BOM- PEP 263 cookie detection. The cookie scanner lives alongside in
source.goand is exercised bysource_test.go.
- PEP 263 cookie detection. The cookie scanner lives alongside in
-
parser/lexer/driver_file.go:io.Readerdriver with incremental refill. -
parser/lexer/driver_readline.go: REPL driver over afunc() (string, error)callback. -
parser/lexer/helpers.go: shared decode / line slicing / printable-ASCII filter ports ofhelpers.c. -
parser/lexer/lexer_test.go: tokenisation panels including type comments. Indent/dedent, paren balance, and f-string nesting are pinned by sibling panels underpartest/(indent_test.go,paren_mismatch_test.go,fstring_nesting_test.go,fstring_walrus_test.go).
Surface guarantees
- Token kinds match the table generated for 1665. Pinned by
parser/lexer/types_test.goreferencingtokenize.Type. - Indent/dedent emission matches CPython on the
Lib/test/test_tokenize.pyindentation corpus. - Paren-mismatch errors quote the same span CPython quotes (start of opening, span to current).
- f-string nesting depth panel: 0..6 levels with mixed
:format specs reproduces CPython’s emission. - Encoding detection: UTF-8 BOM, ASCII default, PEP 263 cookies on line 1 and line 2, conflicting BOM-vs-cookie error message.
- CRLF, CR, LF line ending normalisation matches CPython.
Cross-references
- Token table values: 1665.
- SyntaxError text: 1643.
- String literal post-processing: 1644.
Out of scope for v0.5.5
- Interactive readline. Lands in 1645 alongside v0.9 REPL work.
tok->tok_extra_tokensfor COMMENT / NL / ENCODING inextraTokens=truemode. Surface lands in 1665, lexer side lands here in v0.9.
Out of scope, period
- Free-threaded parser paths. The parser runs under one goroutine.