Skip to content

19. Parsing

CPython's PEG parser, Grammar/python.gram, the generated Parser/parser.c, and parse tree construction.

Parsing is the stage that turns a token stream into syntax structure.

The tokenizer recognizes lexical units. The parser recognizes grammatical structure. It decides whether a sequence of tokens is a valid Python program and, when it is valid, builds an abstract syntax tree.

For this source:

def add(a, b):
    return a + b

the tokenizer produces tokens such as:

NAME        "def"
NAME        "add"
LPAR        "("
NAME        "a"
COMMA       ","
NAME        "b"
RPAR        ")"
COLON       ":"
NEWLINE     "\n"
INDENT      "    "
NAME        "return"
NAME        "a"
PLUS        "+"
NAME        "b"
NEWLINE     "\n"
DEDENT      ""
ENDMARKER   ""

The parser gives those tokens structure:

Module
    FunctionDef name="add"
        arguments
            arg "a"
            arg "b"
        body
            Return
                BinOp
                    Name "a"
                    Add
                    Name "b"

The difference is important. Tokenization knows that def, add, (, a, and : exist. Parsing knows that together they form a function definition.

19.1 Position in the Compilation Pipeline

Parsing sits between tokenization and semantic compilation.

source bytes
encoding detection
tokenization
parsing
abstract syntax tree
symbol table
compiler
code object
bytecode execution

The parser consumes tokens and produces an AST. It does not allocate local variable slots, decide closure layout, generate bytecode, run type checks, or execute code.

Its main responsibilities are:

recognize valid Python grammar
reject invalid syntax
group tokens into statements and expressions
apply operator precedence
handle indentation-based blocks
build AST nodes
produce useful syntax errors

The Python language reference exposes the full grammar used by CPython, derived from Grammar/python.gram, while omitting implementation details related to code generation and error recovery. (Python documentation)

19.2 CPython Uses a PEG Parser

Modern CPython uses a PEG parser.

PEG means Parsing Expression Grammar. CPython switched from its older LL(1) parser to a PEG parser in Python 3.9. The PEG parser allows a more direct expression of Python’s grammar and gives CPython more flexibility when adding syntax.

A PEG parser works by trying grammar alternatives in order. The first alternative that succeeds is selected.

A simplified rule might look like:

statement:
    | function_def
    | class_def
    | if_stmt
    | while_stmt
    | simple_stmt

Given input tokens, the parser attempts to match a statement. If the first alternative fails, it tries the next, and so on.

This ordered choice is different from some context-free grammar systems where alternatives are unordered. In PEG, order matters.

19.3 Grammar Rules

The grammar defines valid token sequences.

A simplified function definition rule can be imagined like this:

function_def:
    "def" NAME "(" parameters? ")" ":" block

This means:

the keyword def
then a function name
then an opening parenthesis
then optional parameters
then a closing parenthesis
then a colon
then a block

The real CPython grammar is more detailed because it supports decorators, async functions, annotations, positional-only parameters, keyword-only parameters, default values, type parameters, return annotations, and error recovery.

Still, the basic model remains:

tokens are matched against grammar rules
rules call other rules
successful rules produce AST structure
failed rules produce backtracking or syntax errors

19.4 Statements and Expressions

Python syntax has statements and expressions.

Statements perform actions:

x = 1
return x
if x:
    print(x)

Expressions produce values:

x + 1
f(a, b)
items[0]
lambda x: x + 1

The parser must distinguish them because not every expression is allowed everywhere, and not every statement can appear inside an expression.

Valid:

if ready:
    run()

Invalid:

x = if ready: run()

Valid:

value = a if ready else b

The grammar encodes these distinctions. if as a statement belongs to one grammar region. Conditional expressions belong to another.

19.5 Blocks and Suites

Python blocks are represented by INDENT and DEDENT tokens.

The parser does not count spaces directly. The tokenizer already converted indentation into structural tokens.

Example:

if x:
    a = 1
    b = 2
c = 3

Token shape:

NAME      "if"
NAME      "x"
COLON     ":"
NEWLINE
INDENT
NAME      "a"
EQUAL
NUMBER    "1"
NEWLINE
NAME      "b"
EQUAL
NUMBER    "2"
NEWLINE
DEDENT
NAME      "c"
EQUAL
NUMBER    "3"
NEWLINE
ENDMARKER

The parser sees the block as a suite:

If
    test: Name "x"
    body:
        Assign a = 1
        Assign b = 2
Module body:
    Assign c = 3

This design keeps indentation handling out of most grammar rules. The grammar can say “block” and rely on tokenizer-produced INDENT and DEDENT tokens.

19.6 Expression Parsing and Precedence

Parsing expressions requires precedence.

Consider:

x = 1 + 2 * 3

The parser must produce:

1 + (2 * 3)

not:

(1 + 2) * 3

The AST shape is:

Assign
    target: Name "x"
    value:
        BinOp Add
            left: Constant 1
            right:
                BinOp Mult
                    left: Constant 2
                    right: Constant 3

Grammar rules encode precedence by layering expressions.

A simplified model:

expression
    conditional expression

sum
    term
    sum "+" term
    sum "-" term

term
    factor
    term "*" factor
    term "/" factor

factor
    atom
    "-" factor
    "+" factor

atom
    NAME
    NUMBER
    STRING
    "(" expression ")"

Higher-precedence constructs appear deeper in the grammar. Multiplication binds more tightly than addition because addition rules consume multiplication-level expressions as operands.

19.7 Associativity

Precedence decides which operators bind more tightly. Associativity decides how operators of the same precedence group.

Example:

x = a - b - c

This parses as:

(a - b) - c

not:

a - (b - c)

Subtraction is left-associative.

But exponentiation behaves differently:

x = a ** b ** c

This parses as:

a ** (b ** c)

The grammar must encode this explicitly. Associativity is not a runtime property. It is fixed by parse structure before evaluation begins.

19.8 Calls, Attributes, and Subscripts

Python has postfix expression forms:

obj.attr
obj.method(x)
items[i]
items[i:j]
f(x)(y).z

These chain tightly.

Example:

result = client.session.get(url).json()["items"][0]

The parser groups this as a sequence of operations on a primary expression:

Name "client"
    Attribute "session"
    Attribute "get"
    Call args=[Name "url"]
    Attribute "json"
    Call args=[]
    Subscript "items"
    Subscript 0

This is why calls, attributes, and subscripts bind more tightly than arithmetic operators:

x = f(a) + b.c

The calls and attributes are parsed before the addition expression is formed.

19.9 Assignment Parsing

Assignment syntax is more constrained than expression syntax.

Valid assignment targets include:

x = 1
obj.attr = 1
items[0] = 1
a, b = pair

Invalid assignment targets include:

1 = x
a + b = x
f() = x

The parser must recognize that the left side of assignment is not just any expression. It must be a valid target.

This also affects augmented assignment:

x += 1
obj.value *= 2
items[i] -= 3

and annotated assignment:

x: int = 1
name: str

CPython’s parser and AST builder must preserve the difference between an expression used for reading and a target used for writing.

At the AST level, names carry context:

Name id="x", ctx=Load
Name id="x", ctx=Store
Name id="x", ctx=Del

Example:

x = x + 1

The left x is Store. The right x is Load.

19.10 Ambiguity and Ordered Choice

PEG parsing uses ordered alternatives.

Consider syntax where multiple grammar alternatives begin similarly. The parser tries one alternative first. If it succeeds, that result is used. If it fails, later alternatives may be tried.

This gives grammar authors a controlled way to resolve ambiguity.

Simplified example:

small_stmt:
    | assignment
    | expression

The assignment alternative must be considered before the plain expression alternative, because an assignment often starts with an expression-like target.

Input:

x = 1

The prefix x could begin an expression, but the full statement is assignment. The grammar must be ordered and structured so that assignment wins.

PEG also supports a “cut” operator in CPython grammar notation. The language reference notes the ~ marker in the grammar as a cut that commits to the current alternative and fails the rule if the alternative then fails. (Python documentation)

A cut is useful for performance and diagnostics. Once the parser has seen enough tokens to know which construct it is parsing, it can stop considering unrelated alternatives.

19.11 Lookahead

Parsers often need to inspect upcoming tokens before deciding what to parse.

Lookahead asks: “Does the next token or next token sequence match this pattern?”

Example:

x: int

This could begin an annotated assignment.

Example:

x + y

This is an expression statement.

Both start with NAME. The parser needs more context than the first token.

Lookahead also helps reject syntax early and choose better error messages.

A simplified lookahead model:

if next token is NAME and following token is ":":
    parse annotated assignment
else:
    parse expression statement

Real CPython grammar uses richer mechanisms, but the mental model is the same: some decisions require peeking ahead without consuming tokens permanently.

19.12 Backtracking

PEG parsers can backtrack.

Backtracking means the parser tries a grammar path, fails, restores the previous token position, and tries another path.

Example model:

try parse async function definition
    if it fails:
        restore token position
try parse normal function definition
    if it fails:
        restore token position
try parse simple statement

Backtracking makes grammar authoring more flexible than a strict single-token lookahead parser.

But unrestricted backtracking can be expensive. CPython’s parser uses grammar structure, cuts, memoization, and targeted rules to keep parsing efficient.

19.13 AST Construction

Parsing in CPython does not merely accept or reject syntax. It builds an AST.

Example source:

x = 1 + 2

AST shape:

Module
    Assign
        targets:
            Name id="x", ctx=Store
        value:
            BinOp
                left: Constant 1
                op: Add
                right: Constant 2

At Python level:

import ast

tree = ast.parse("x = 1 + 2\n")
print(ast.dump(tree, indent=4))

This produces a structured tree.

The AST is still not bytecode. It is a syntax tree with semantic annotations such as load/store context and source locations.

19.14 Source Locations

AST nodes carry source location information.

This includes line numbers and column offsets. Modern CPython also tracks end positions for many nodes.

Example:

import ast

tree = ast.parse("x = 1 + 2\n")
node = tree.body[0]

print(node.lineno)
print(node.col_offset)
print(node.end_lineno)
print(node.end_col_offset)

Source locations support:

tracebacks
syntax errors
debuggers
coverage tools
profilers
AST tools
editor integrations

Without accurate source positions, CPython could still execute code, but diagnostics and tooling quality would be much worse.

19.15 Syntax Errors

When parsing fails, CPython raises SyntaxError or a subclass such as IndentationError.

A syntax error should report:

filename
line number
column offset
source text
message

Example:

if x
    pass

The parser sees that the if statement is missing a colon.

A useful error points near the place where parsing became impossible:

SyntaxError: expected ':'

Parsing errors are not always simple. Because the parser may backtrack, it must remember the most useful failure location rather than only the final failed alternative.

Good syntax errors are a major part of parser engineering.

19.16 Invalid Rules and Better Diagnostics

CPython’s grammar includes rules used specifically for better error messages.

A parser could reject bad syntax generically:

SyntaxError: invalid syntax

But CPython often tries to detect common mistakes:

if x = 1:
    pass

This can produce a more helpful message than a generic parse failure.

Invalid grammar rules are not intended to accept programs. They are used to recognize wrong forms and raise better diagnostics.

The practical model:

normal grammar accepts valid Python
invalid rules recognize common invalid Python
error machinery reports a targeted message

This keeps the parser strict while improving user experience.

19.17 Parser Output Is Not Fully Validated Yet

A successful parse means the source matches Python grammar. It does not mean the program is fully valid in every later sense.

Some checks occur after parsing.

Example:

return 1

At module level, this is syntactically shaped like a return statement, but it is invalid outside a function.

Another example:

break

At module level, this is grammatically a break statement, but it is invalid outside a loop.

These constraints are often checked during AST validation, symbol table analysis, or compilation.

The phases are separate:

parser:
    Can these tokens form a syntax tree?

later compiler checks:
    Is this syntax legal in this context?
    Which names are local, global, free, or cell variables?
    Can this construct be compiled here?

19.18 ast.parse

The public ast.parse() function exposes CPython’s parsing machinery at Python level.

Example:

import ast

src = """
def add(a, b):
    return a + b
"""

tree = ast.parse(src)
print(ast.dump(tree, indent=4))

This is useful for:

linters
formatters
static analyzers
code generation tools
security scanners
refactoring tools
documentation tools

But ast.parse() returns an AST, not tokens and not bytecode.

For bytecode, use compile() and dis:

import dis

code = compile("x = 1 + 2\n", "<input>", "exec")
dis.dis(code)

The relation is:

tokenize.tokenize()    source to tokens
ast.parse()            source to AST
compile()              source or AST to code object
dis.dis()              code object to bytecode display

19.19 Parse Modes

Python source can be parsed in different modes.

Common modes:

ModeMeaning
execParse a module or sequence of statements
evalParse a single expression
singleParse interactive input

Example:

compile("x = 1", "<input>", "exec")
compile("1 + 2", "<input>", "eval")
compile("print(1)", "<input>", "single")

eval mode rejects statements:

compile("x = 1", "<input>", "eval")

That fails because assignment statements are not expressions.

The parse mode determines the grammar start symbol.

19.20 F-Strings and the Parser

Modern CPython parses f-string expressions through the parser machinery.

Before Python 3.12, f-string expression parsing had more special-case behavior. Python 3.12 changed f-strings through PEP 701, bringing them into the grammar more fully. The Python 3.12 documentation notes that f-strings are parsed with the PEG parser, allowing more precise error messages, and that tokenize performance improved due to these changes. (Python documentation)

Example:

name = "Ada"
text = f"hello {name.upper()}"

The embedded expression:

name.upper()

is parsed as Python expression syntax.

This means f-strings are not just string interpolation at runtime. They contain nested syntax that must be parsed and represented structurally.

19.21 Parser Files in CPython

The important parser-related source areas include:

Grammar/python.gram
Parser/
Python/ast.c
Include/internal/
Lib/ast.py
Lib/test/test_grammar.py
Lib/test/test_syntax.py

The exact file boundaries can change, but the conceptual split is stable:

AreaRole
GrammarDefines Python syntax
Parser generatorProduces parser implementation from grammar
Parser runtimeConsumes tokens and applies grammar rules
AST constructionBuilds AST nodes
AST validationChecks structural correctness
TestsProtect syntax and diagnostics

When reading CPython, start with the grammar. Then follow how a grammar rule creates AST nodes.

19.22 Example: Parsing a Function

Source:

def square(x):
    return x * x

Token-level view:

NAME      "def"
NAME      "square"
LPAR      "("
NAME      "x"
RPAR      ")"
COLON     ":"
NEWLINE
INDENT
NAME      "return"
NAME      "x"
STAR      "*"
NAME      "x"
NEWLINE
DEDENT
ENDMARKER

AST-level view:

Module
    FunctionDef
        name: "square"
        args:
            arguments
                arg: "x"
        body:
            Return
                BinOp
                    left: Name "x", Load
                    op: Mult
                    right: Name "x", Load

The parser creates this hierarchy because the grammar says:

a module contains statements
a function definition is a statement
a function body is a block
a return statement can contain an expression
a multiplication expression contains two operands
a name expression loads a value

Each level gives meaning to the flat token stream.

19.23 Example: Parsing an if Statement

Source:

if score >= 60:
    result = "pass"
else:
    result = "fail"

AST shape:

Module
    If
        test:
            Compare
                left: Name "score"
                op: GtE
                comparator: Constant 60
        body:
            Assign
                target: Name "result", Store
                value: Constant "pass"
        orelse:
            Assign
                target: Name "result", Store
                value: Constant "fail"

The parser must attach else to the correct if.

Indentation tokens help here. The else appears after the DEDENT ending the first block, but at the same indentation level as the if.

This is one place where Python’s layout-sensitive syntax interacts directly with parser structure.

19.24 Example: Parsing a Comprehension

Source:

squares = [x * x for x in values if x > 0]

AST shape:

Assign
    target: Name "squares", Store
    value:
        ListComp
            element:
                BinOp
                    Name "x"
                    Mult
                    Name "x"
            generators:
                comprehension
                    target: Name "x", Store
                    iter: Name "values", Load
                    ifs:
                        Compare
                            Name "x"
                            Gt
                            Constant 0

Comprehensions are expression syntax, but they contain assignment-like targets and nested clauses.

The parser must distinguish:

[x * x for x in values]

from a list literal:

[x * x, y]

Both begin with [ and an expression. The token sequence after the first expression determines which structure the parser builds.

19.25 Parser Boundaries

The parser does not execute code.

It does not call functions:

f()

It only builds a call node.

It does not import modules:

import os

It only builds an import node.

It does not resolve names:

x + y

It only builds name and binary operation nodes.

It does not evaluate constants beyond what the AST layer represents.

The parser’s job is structural. Later phases assign scope, generate bytecode, and execute operations.

19.26 Minimal Mental Model

Use this model:

The tokenizer emits a flat stream of tokens.
The parser matches that stream against Python grammar.
CPython’s modern parser is PEG-based.
Grammar alternatives are ordered.
The parser builds an AST.
The AST records statements, expressions, contexts, and source locations.
Syntax errors are raised when the token stream cannot form valid grammar.
Some validity checks happen after parsing.

Parsing is where source text first becomes tree-shaped. Everything after this stage works with structure rather than raw token order.