CPython's PEG parser, Grammar/python.gram, the generated Parser/parser.c, and parse tree construction.
Parsing is the stage that turns a token stream into syntax structure.
The tokenizer recognizes lexical units. The parser recognizes grammatical structure. It decides whether a sequence of tokens is a valid Python program and, when it is valid, builds an abstract syntax tree.
For this source:
def add(a, b):
return a + bthe tokenizer produces tokens such as:
NAME "def"
NAME "add"
LPAR "("
NAME "a"
COMMA ","
NAME "b"
RPAR ")"
COLON ":"
NEWLINE "\n"
INDENT " "
NAME "return"
NAME "a"
PLUS "+"
NAME "b"
NEWLINE "\n"
DEDENT ""
ENDMARKER ""The parser gives those tokens structure:
Module
FunctionDef name="add"
arguments
arg "a"
arg "b"
body
Return
BinOp
Name "a"
Add
Name "b"The difference is important. Tokenization knows that def, add, (, a, and : exist. Parsing knows that together they form a function definition.
19.1 Position in the Compilation Pipeline
Parsing sits between tokenization and semantic compilation.
source bytes
↓
encoding detection
↓
tokenization
↓
parsing
↓
abstract syntax tree
↓
symbol table
↓
compiler
↓
code object
↓
bytecode executionThe parser consumes tokens and produces an AST. It does not allocate local variable slots, decide closure layout, generate bytecode, run type checks, or execute code.
Its main responsibilities are:
recognize valid Python grammar
reject invalid syntax
group tokens into statements and expressions
apply operator precedence
handle indentation-based blocks
build AST nodes
produce useful syntax errorsThe Python language reference exposes the full grammar used by CPython, derived from Grammar/python.gram, while omitting implementation details related to code generation and error recovery. (Python documentation)
19.2 CPython Uses a PEG Parser
Modern CPython uses a PEG parser.
PEG means Parsing Expression Grammar. CPython switched from its older LL(1) parser to a PEG parser in Python 3.9. The PEG parser allows a more direct expression of Python’s grammar and gives CPython more flexibility when adding syntax.
A PEG parser works by trying grammar alternatives in order. The first alternative that succeeds is selected.
A simplified rule might look like:
statement:
| function_def
| class_def
| if_stmt
| while_stmt
| simple_stmtGiven input tokens, the parser attempts to match a statement. If the first alternative fails, it tries the next, and so on.
This ordered choice is different from some context-free grammar systems where alternatives are unordered. In PEG, order matters.
19.3 Grammar Rules
The grammar defines valid token sequences.
A simplified function definition rule can be imagined like this:
function_def:
"def" NAME "(" parameters? ")" ":" blockThis means:
the keyword def
then a function name
then an opening parenthesis
then optional parameters
then a closing parenthesis
then a colon
then a blockThe real CPython grammar is more detailed because it supports decorators, async functions, annotations, positional-only parameters, keyword-only parameters, default values, type parameters, return annotations, and error recovery.
Still, the basic model remains:
tokens are matched against grammar rules
rules call other rules
successful rules produce AST structure
failed rules produce backtracking or syntax errors19.4 Statements and Expressions
Python syntax has statements and expressions.
Statements perform actions:
x = 1
return x
if x:
print(x)Expressions produce values:
x + 1
f(a, b)
items[0]
lambda x: x + 1The parser must distinguish them because not every expression is allowed everywhere, and not every statement can appear inside an expression.
Valid:
if ready:
run()Invalid:
x = if ready: run()Valid:
value = a if ready else bThe grammar encodes these distinctions. if as a statement belongs to one grammar region. Conditional expressions belong to another.
19.5 Blocks and Suites
Python blocks are represented by INDENT and DEDENT tokens.
The parser does not count spaces directly. The tokenizer already converted indentation into structural tokens.
Example:
if x:
a = 1
b = 2
c = 3Token shape:
NAME "if"
NAME "x"
COLON ":"
NEWLINE
INDENT
NAME "a"
EQUAL
NUMBER "1"
NEWLINE
NAME "b"
EQUAL
NUMBER "2"
NEWLINE
DEDENT
NAME "c"
EQUAL
NUMBER "3"
NEWLINE
ENDMARKERThe parser sees the block as a suite:
If
test: Name "x"
body:
Assign a = 1
Assign b = 2
Module body:
Assign c = 3This design keeps indentation handling out of most grammar rules. The grammar can say “block” and rely on tokenizer-produced INDENT and DEDENT tokens.
19.6 Expression Parsing and Precedence
Parsing expressions requires precedence.
Consider:
x = 1 + 2 * 3The parser must produce:
1 + (2 * 3)not:
(1 + 2) * 3The AST shape is:
Assign
target: Name "x"
value:
BinOp Add
left: Constant 1
right:
BinOp Mult
left: Constant 2
right: Constant 3Grammar rules encode precedence by layering expressions.
A simplified model:
expression
conditional expression
sum
term
sum "+" term
sum "-" term
term
factor
term "*" factor
term "/" factor
factor
atom
"-" factor
"+" factor
atom
NAME
NUMBER
STRING
"(" expression ")"Higher-precedence constructs appear deeper in the grammar. Multiplication binds more tightly than addition because addition rules consume multiplication-level expressions as operands.
19.7 Associativity
Precedence decides which operators bind more tightly. Associativity decides how operators of the same precedence group.
Example:
x = a - b - cThis parses as:
(a - b) - cnot:
a - (b - c)Subtraction is left-associative.
But exponentiation behaves differently:
x = a ** b ** cThis parses as:
a ** (b ** c)The grammar must encode this explicitly. Associativity is not a runtime property. It is fixed by parse structure before evaluation begins.
19.8 Calls, Attributes, and Subscripts
Python has postfix expression forms:
obj.attr
obj.method(x)
items[i]
items[i:j]
f(x)(y).zThese chain tightly.
Example:
result = client.session.get(url).json()["items"][0]The parser groups this as a sequence of operations on a primary expression:
Name "client"
Attribute "session"
Attribute "get"
Call args=[Name "url"]
Attribute "json"
Call args=[]
Subscript "items"
Subscript 0This is why calls, attributes, and subscripts bind more tightly than arithmetic operators:
x = f(a) + b.cThe calls and attributes are parsed before the addition expression is formed.
19.9 Assignment Parsing
Assignment syntax is more constrained than expression syntax.
Valid assignment targets include:
x = 1
obj.attr = 1
items[0] = 1
a, b = pairInvalid assignment targets include:
1 = x
a + b = x
f() = xThe parser must recognize that the left side of assignment is not just any expression. It must be a valid target.
This also affects augmented assignment:
x += 1
obj.value *= 2
items[i] -= 3and annotated assignment:
x: int = 1
name: strCPython’s parser and AST builder must preserve the difference between an expression used for reading and a target used for writing.
At the AST level, names carry context:
Name id="x", ctx=Load
Name id="x", ctx=Store
Name id="x", ctx=DelExample:
x = x + 1The left x is Store. The right x is Load.
19.10 Ambiguity and Ordered Choice
PEG parsing uses ordered alternatives.
Consider syntax where multiple grammar alternatives begin similarly. The parser tries one alternative first. If it succeeds, that result is used. If it fails, later alternatives may be tried.
This gives grammar authors a controlled way to resolve ambiguity.
Simplified example:
small_stmt:
| assignment
| expressionThe assignment alternative must be considered before the plain expression alternative, because an assignment often starts with an expression-like target.
Input:
x = 1The prefix x could begin an expression, but the full statement is assignment. The grammar must be ordered and structured so that assignment wins.
PEG also supports a “cut” operator in CPython grammar notation. The language reference notes the ~ marker in the grammar as a cut that commits to the current alternative and fails the rule if the alternative then fails. (Python documentation)
A cut is useful for performance and diagnostics. Once the parser has seen enough tokens to know which construct it is parsing, it can stop considering unrelated alternatives.
19.11 Lookahead
Parsers often need to inspect upcoming tokens before deciding what to parse.
Lookahead asks: “Does the next token or next token sequence match this pattern?”
Example:
x: intThis could begin an annotated assignment.
Example:
x + yThis is an expression statement.
Both start with NAME. The parser needs more context than the first token.
Lookahead also helps reject syntax early and choose better error messages.
A simplified lookahead model:
if next token is NAME and following token is ":":
parse annotated assignment
else:
parse expression statementReal CPython grammar uses richer mechanisms, but the mental model is the same: some decisions require peeking ahead without consuming tokens permanently.
19.12 Backtracking
PEG parsers can backtrack.
Backtracking means the parser tries a grammar path, fails, restores the previous token position, and tries another path.
Example model:
try parse async function definition
if it fails:
restore token position
try parse normal function definition
if it fails:
restore token position
try parse simple statementBacktracking makes grammar authoring more flexible than a strict single-token lookahead parser.
But unrestricted backtracking can be expensive. CPython’s parser uses grammar structure, cuts, memoization, and targeted rules to keep parsing efficient.
19.13 AST Construction
Parsing in CPython does not merely accept or reject syntax. It builds an AST.
Example source:
x = 1 + 2AST shape:
Module
Assign
targets:
Name id="x", ctx=Store
value:
BinOp
left: Constant 1
op: Add
right: Constant 2At Python level:
import ast
tree = ast.parse("x = 1 + 2\n")
print(ast.dump(tree, indent=4))This produces a structured tree.
The AST is still not bytecode. It is a syntax tree with semantic annotations such as load/store context and source locations.
19.14 Source Locations
AST nodes carry source location information.
This includes line numbers and column offsets. Modern CPython also tracks end positions for many nodes.
Example:
import ast
tree = ast.parse("x = 1 + 2\n")
node = tree.body[0]
print(node.lineno)
print(node.col_offset)
print(node.end_lineno)
print(node.end_col_offset)Source locations support:
tracebacks
syntax errors
debuggers
coverage tools
profilers
AST tools
editor integrationsWithout accurate source positions, CPython could still execute code, but diagnostics and tooling quality would be much worse.
19.15 Syntax Errors
When parsing fails, CPython raises SyntaxError or a subclass such as IndentationError.
A syntax error should report:
filename
line number
column offset
source text
messageExample:
if x
passThe parser sees that the if statement is missing a colon.
A useful error points near the place where parsing became impossible:
SyntaxError: expected ':'Parsing errors are not always simple. Because the parser may backtrack, it must remember the most useful failure location rather than only the final failed alternative.
Good syntax errors are a major part of parser engineering.
19.16 Invalid Rules and Better Diagnostics
CPython’s grammar includes rules used specifically for better error messages.
A parser could reject bad syntax generically:
SyntaxError: invalid syntaxBut CPython often tries to detect common mistakes:
if x = 1:
passThis can produce a more helpful message than a generic parse failure.
Invalid grammar rules are not intended to accept programs. They are used to recognize wrong forms and raise better diagnostics.
The practical model:
normal grammar accepts valid Python
invalid rules recognize common invalid Python
error machinery reports a targeted messageThis keeps the parser strict while improving user experience.
19.17 Parser Output Is Not Fully Validated Yet
A successful parse means the source matches Python grammar. It does not mean the program is fully valid in every later sense.
Some checks occur after parsing.
Example:
return 1At module level, this is syntactically shaped like a return statement, but it is invalid outside a function.
Another example:
breakAt module level, this is grammatically a break statement, but it is invalid outside a loop.
These constraints are often checked during AST validation, symbol table analysis, or compilation.
The phases are separate:
parser:
Can these tokens form a syntax tree?
later compiler checks:
Is this syntax legal in this context?
Which names are local, global, free, or cell variables?
Can this construct be compiled here?19.18 ast.parse
The public ast.parse() function exposes CPython’s parsing machinery at Python level.
Example:
import ast
src = """
def add(a, b):
return a + b
"""
tree = ast.parse(src)
print(ast.dump(tree, indent=4))This is useful for:
linters
formatters
static analyzers
code generation tools
security scanners
refactoring tools
documentation toolsBut ast.parse() returns an AST, not tokens and not bytecode.
For bytecode, use compile() and dis:
import dis
code = compile("x = 1 + 2\n", "<input>", "exec")
dis.dis(code)The relation is:
tokenize.tokenize() source to tokens
ast.parse() source to AST
compile() source or AST to code object
dis.dis() code object to bytecode display19.19 Parse Modes
Python source can be parsed in different modes.
Common modes:
| Mode | Meaning |
|---|---|
exec | Parse a module or sequence of statements |
eval | Parse a single expression |
single | Parse interactive input |
Example:
compile("x = 1", "<input>", "exec")
compile("1 + 2", "<input>", "eval")
compile("print(1)", "<input>", "single")eval mode rejects statements:
compile("x = 1", "<input>", "eval")That fails because assignment statements are not expressions.
The parse mode determines the grammar start symbol.
19.20 F-Strings and the Parser
Modern CPython parses f-string expressions through the parser machinery.
Before Python 3.12, f-string expression parsing had more special-case behavior. Python 3.12 changed f-strings through PEP 701, bringing them into the grammar more fully. The Python 3.12 documentation notes that f-strings are parsed with the PEG parser, allowing more precise error messages, and that tokenize performance improved due to these changes. (Python documentation)
Example:
name = "Ada"
text = f"hello {name.upper()}"The embedded expression:
name.upper()is parsed as Python expression syntax.
This means f-strings are not just string interpolation at runtime. They contain nested syntax that must be parsed and represented structurally.
19.21 Parser Files in CPython
The important parser-related source areas include:
Grammar/python.gram
Parser/
Python/ast.c
Include/internal/
Lib/ast.py
Lib/test/test_grammar.py
Lib/test/test_syntax.pyThe exact file boundaries can change, but the conceptual split is stable:
| Area | Role |
|---|---|
| Grammar | Defines Python syntax |
| Parser generator | Produces parser implementation from grammar |
| Parser runtime | Consumes tokens and applies grammar rules |
| AST construction | Builds AST nodes |
| AST validation | Checks structural correctness |
| Tests | Protect syntax and diagnostics |
When reading CPython, start with the grammar. Then follow how a grammar rule creates AST nodes.
19.22 Example: Parsing a Function
Source:
def square(x):
return x * xToken-level view:
NAME "def"
NAME "square"
LPAR "("
NAME "x"
RPAR ")"
COLON ":"
NEWLINE
INDENT
NAME "return"
NAME "x"
STAR "*"
NAME "x"
NEWLINE
DEDENT
ENDMARKERAST-level view:
Module
FunctionDef
name: "square"
args:
arguments
arg: "x"
body:
Return
BinOp
left: Name "x", Load
op: Mult
right: Name "x", LoadThe parser creates this hierarchy because the grammar says:
a module contains statements
a function definition is a statement
a function body is a block
a return statement can contain an expression
a multiplication expression contains two operands
a name expression loads a valueEach level gives meaning to the flat token stream.
19.23 Example: Parsing an if Statement
Source:
if score >= 60:
result = "pass"
else:
result = "fail"AST shape:
Module
If
test:
Compare
left: Name "score"
op: GtE
comparator: Constant 60
body:
Assign
target: Name "result", Store
value: Constant "pass"
orelse:
Assign
target: Name "result", Store
value: Constant "fail"The parser must attach else to the correct if.
Indentation tokens help here. The else appears after the DEDENT ending the first block, but at the same indentation level as the if.
This is one place where Python’s layout-sensitive syntax interacts directly with parser structure.
19.24 Example: Parsing a Comprehension
Source:
squares = [x * x for x in values if x > 0]AST shape:
Assign
target: Name "squares", Store
value:
ListComp
element:
BinOp
Name "x"
Mult
Name "x"
generators:
comprehension
target: Name "x", Store
iter: Name "values", Load
ifs:
Compare
Name "x"
Gt
Constant 0Comprehensions are expression syntax, but they contain assignment-like targets and nested clauses.
The parser must distinguish:
[x * x for x in values]from a list literal:
[x * x, y]Both begin with [ and an expression. The token sequence after the first expression determines which structure the parser builds.
19.25 Parser Boundaries
The parser does not execute code.
It does not call functions:
f()It only builds a call node.
It does not import modules:
import osIt only builds an import node.
It does not resolve names:
x + yIt only builds name and binary operation nodes.
It does not evaluate constants beyond what the AST layer represents.
The parser’s job is structural. Later phases assign scope, generate bytecode, and execute operations.
19.26 Minimal Mental Model
Use this model:
The tokenizer emits a flat stream of tokens.
The parser matches that stream against Python grammar.
CPython’s modern parser is PEG-based.
Grammar alternatives are ordered.
The parser builds an AST.
The AST records statements, expressions, contexts, and source locations.
Syntax errors are raised when the token stream cannot form valid grammar.
Some validity checks happen after parsing.Parsing is where source text first becomes tree-shaped. Everything after this stage works with structure rather than raw token order.