# 19. Parsing

# 19. Parsing

Parsing is the stage that turns a token stream into syntax structure.

The tokenizer recognizes lexical units. The parser recognizes grammatical structure. It decides whether a sequence of tokens is a valid Python program and, when it is valid, builds an abstract syntax tree.

For this source:

```python
def add(a, b):
    return a + b
```

the tokenizer produces tokens such as:

```text
NAME        "def"
NAME        "add"
LPAR        "("
NAME        "a"
COMMA       ","
NAME        "b"
RPAR        ")"
COLON       ":"
NEWLINE     "\n"
INDENT      "    "
NAME        "return"
NAME        "a"
PLUS        "+"
NAME        "b"
NEWLINE     "\n"
DEDENT      ""
ENDMARKER   ""
```

The parser gives those tokens structure:

```text
Module
    FunctionDef name="add"
        arguments
            arg "a"
            arg "b"
        body
            Return
                BinOp
                    Name "a"
                    Add
                    Name "b"
```

The difference is important. Tokenization knows that `def`, `add`, `(`, `a`, and `:` exist. Parsing knows that together they form a function definition.

## 19.1 Position in the Compilation Pipeline

Parsing sits between tokenization and semantic compilation.

```text
source bytes
    ↓
encoding detection
    ↓
tokenization
    ↓
parsing
    ↓
abstract syntax tree
    ↓
symbol table
    ↓
compiler
    ↓
code object
    ↓
bytecode execution
```

The parser consumes tokens and produces an AST. It does not allocate local variable slots, decide closure layout, generate bytecode, run type checks, or execute code.

Its main responsibilities are:

```text
recognize valid Python grammar
reject invalid syntax
group tokens into statements and expressions
apply operator precedence
handle indentation-based blocks
build AST nodes
produce useful syntax errors
```

The Python language reference exposes the full grammar used by CPython, derived from `Grammar/python.gram`, while omitting implementation details related to code generation and error recovery. ([Python documentation][1])

## 19.2 CPython Uses a PEG Parser

Modern CPython uses a PEG parser.

PEG means Parsing Expression Grammar. CPython switched from its older LL(1) parser to a PEG parser in Python 3.9. The PEG parser allows a more direct expression of Python’s grammar and gives CPython more flexibility when adding syntax.

A PEG parser works by trying grammar alternatives in order. The first alternative that succeeds is selected.

A simplified rule might look like:

```text
statement:
    | function_def
    | class_def
    | if_stmt
    | while_stmt
    | simple_stmt
```

Given input tokens, the parser attempts to match a statement. If the first alternative fails, it tries the next, and so on.

This ordered choice is different from some context-free grammar systems where alternatives are unordered. In PEG, order matters.

## 19.3 Grammar Rules

The grammar defines valid token sequences.

A simplified function definition rule can be imagined like this:

```text
function_def:
    "def" NAME "(" parameters? ")" ":" block
```

This means:

```text
the keyword def
then a function name
then an opening parenthesis
then optional parameters
then a closing parenthesis
then a colon
then a block
```

The real CPython grammar is more detailed because it supports decorators, async functions, annotations, positional-only parameters, keyword-only parameters, default values, type parameters, return annotations, and error recovery.

Still, the basic model remains:

```text
tokens are matched against grammar rules
rules call other rules
successful rules produce AST structure
failed rules produce backtracking or syntax errors
```

## 19.4 Statements and Expressions

Python syntax has statements and expressions.

Statements perform actions:

```python
x = 1
return x
if x:
    print(x)
```

Expressions produce values:

```python
x + 1
f(a, b)
items[0]
lambda x: x + 1
```

The parser must distinguish them because not every expression is allowed everywhere, and not every statement can appear inside an expression.

Valid:

```python
if ready:
    run()
```

Invalid:

```python
x = if ready: run()
```

Valid:

```python
value = a if ready else b
```

The grammar encodes these distinctions. `if` as a statement belongs to one grammar region. Conditional expressions belong to another.

## 19.5 Blocks and Suites

Python blocks are represented by `INDENT` and `DEDENT` tokens.

The parser does not count spaces directly. The tokenizer already converted indentation into structural tokens.

Example:

```python
if x:
    a = 1
    b = 2
c = 3
```

Token shape:

```text
NAME      "if"
NAME      "x"
COLON     ":"
NEWLINE
INDENT
NAME      "a"
EQUAL
NUMBER    "1"
NEWLINE
NAME      "b"
EQUAL
NUMBER    "2"
NEWLINE
DEDENT
NAME      "c"
EQUAL
NUMBER    "3"
NEWLINE
ENDMARKER
```

The parser sees the block as a suite:

```text
If
    test: Name "x"
    body:
        Assign a = 1
        Assign b = 2
Module body:
    Assign c = 3
```

This design keeps indentation handling out of most grammar rules. The grammar can say “block” and rely on tokenizer-produced `INDENT` and `DEDENT` tokens.

## 19.6 Expression Parsing and Precedence

Parsing expressions requires precedence.

Consider:

```python
x = 1 + 2 * 3
```

The parser must produce:

```text
1 + (2 * 3)
```

not:

```text
(1 + 2) * 3
```

The AST shape is:

```text
Assign
    target: Name "x"
    value:
        BinOp Add
            left: Constant 1
            right:
                BinOp Mult
                    left: Constant 2
                    right: Constant 3
```

Grammar rules encode precedence by layering expressions.

A simplified model:

```text
expression
    conditional expression

sum
    term
    sum "+" term
    sum "-" term

term
    factor
    term "*" factor
    term "/" factor

factor
    atom
    "-" factor
    "+" factor

atom
    NAME
    NUMBER
    STRING
    "(" expression ")"
```

Higher-precedence constructs appear deeper in the grammar. Multiplication binds more tightly than addition because addition rules consume multiplication-level expressions as operands.

## 19.7 Associativity

Precedence decides which operators bind more tightly. Associativity decides how operators of the same precedence group.

Example:

```python
x = a - b - c
```

This parses as:

```text
(a - b) - c
```

not:

```text
a - (b - c)
```

Subtraction is left-associative.

But exponentiation behaves differently:

```python
x = a ** b ** c
```

This parses as:

```text
a ** (b ** c)
```

The grammar must encode this explicitly. Associativity is not a runtime property. It is fixed by parse structure before evaluation begins.

## 19.8 Calls, Attributes, and Subscripts

Python has postfix expression forms:

```python
obj.attr
obj.method(x)
items[i]
items[i:j]
f(x)(y).z
```

These chain tightly.

Example:

```python
result = client.session.get(url).json()["items"][0]
```

The parser groups this as a sequence of operations on a primary expression:

```text
Name "client"
    Attribute "session"
    Attribute "get"
    Call args=[Name "url"]
    Attribute "json"
    Call args=[]
    Subscript "items"
    Subscript 0
```

This is why calls, attributes, and subscripts bind more tightly than arithmetic operators:

```python
x = f(a) + b.c
```

The calls and attributes are parsed before the addition expression is formed.

## 19.9 Assignment Parsing

Assignment syntax is more constrained than expression syntax.

Valid assignment targets include:

```python
x = 1
obj.attr = 1
items[0] = 1
a, b = pair
```

Invalid assignment targets include:

```python
1 = x
a + b = x
f() = x
```

The parser must recognize that the left side of assignment is not just any expression. It must be a valid target.

This also affects augmented assignment:

```python
x += 1
obj.value *= 2
items[i] -= 3
```

and annotated assignment:

```python
x: int = 1
name: str
```

CPython’s parser and AST builder must preserve the difference between an expression used for reading and a target used for writing.

At the AST level, names carry context:

```text
Name id="x", ctx=Load
Name id="x", ctx=Store
Name id="x", ctx=Del
```

Example:

```python
x = x + 1
```

The left `x` is `Store`. The right `x` is `Load`.

## 19.10 Ambiguity and Ordered Choice

PEG parsing uses ordered alternatives.

Consider syntax where multiple grammar alternatives begin similarly. The parser tries one alternative first. If it succeeds, that result is used. If it fails, later alternatives may be tried.

This gives grammar authors a controlled way to resolve ambiguity.

Simplified example:

```text
small_stmt:
    | assignment
    | expression
```

The assignment alternative must be considered before the plain expression alternative, because an assignment often starts with an expression-like target.

Input:

```python
x = 1
```

The prefix `x` could begin an expression, but the full statement is assignment. The grammar must be ordered and structured so that assignment wins.

PEG also supports a “cut” operator in CPython grammar notation. The language reference notes the `~` marker in the grammar as a cut that commits to the current alternative and fails the rule if the alternative then fails. ([Python documentation][1])

A cut is useful for performance and diagnostics. Once the parser has seen enough tokens to know which construct it is parsing, it can stop considering unrelated alternatives.

## 19.11 Lookahead

Parsers often need to inspect upcoming tokens before deciding what to parse.

Lookahead asks: “Does the next token or next token sequence match this pattern?”

Example:

```python
x: int
```

This could begin an annotated assignment.

Example:

```python
x + y
```

This is an expression statement.

Both start with `NAME`. The parser needs more context than the first token.

Lookahead also helps reject syntax early and choose better error messages.

A simplified lookahead model:

```text
if next token is NAME and following token is ":":
    parse annotated assignment
else:
    parse expression statement
```

Real CPython grammar uses richer mechanisms, but the mental model is the same: some decisions require peeking ahead without consuming tokens permanently.

## 19.12 Backtracking

PEG parsers can backtrack.

Backtracking means the parser tries a grammar path, fails, restores the previous token position, and tries another path.

Example model:

```text
try parse async function definition
    if it fails:
        restore token position
try parse normal function definition
    if it fails:
        restore token position
try parse simple statement
```

Backtracking makes grammar authoring more flexible than a strict single-token lookahead parser.

But unrestricted backtracking can be expensive. CPython’s parser uses grammar structure, cuts, memoization, and targeted rules to keep parsing efficient.

## 19.13 AST Construction

Parsing in CPython does not merely accept or reject syntax. It builds an AST.

Example source:

```python
x = 1 + 2
```

AST shape:

```text
Module
    Assign
        targets:
            Name id="x", ctx=Store
        value:
            BinOp
                left: Constant 1
                op: Add
                right: Constant 2
```

At Python level:

```python
import ast

tree = ast.parse("x = 1 + 2\n")
print(ast.dump(tree, indent=4))
```

This produces a structured tree.

The AST is still not bytecode. It is a syntax tree with semantic annotations such as load/store context and source locations.

## 19.14 Source Locations

AST nodes carry source location information.

This includes line numbers and column offsets. Modern CPython also tracks end positions for many nodes.

Example:

```python
import ast

tree = ast.parse("x = 1 + 2\n")
node = tree.body[0]

print(node.lineno)
print(node.col_offset)
print(node.end_lineno)
print(node.end_col_offset)
```

Source locations support:

```text
tracebacks
syntax errors
debuggers
coverage tools
profilers
AST tools
editor integrations
```

Without accurate source positions, CPython could still execute code, but diagnostics and tooling quality would be much worse.

## 19.15 Syntax Errors

When parsing fails, CPython raises `SyntaxError` or a subclass such as `IndentationError`.

A syntax error should report:

```text
filename
line number
column offset
source text
message
```

Example:

```python
if x
    pass
```

The parser sees that the `if` statement is missing a colon.

A useful error points near the place where parsing became impossible:

```text
SyntaxError: expected ':'
```

Parsing errors are not always simple. Because the parser may backtrack, it must remember the most useful failure location rather than only the final failed alternative.

Good syntax errors are a major part of parser engineering.

## 19.16 Invalid Rules and Better Diagnostics

CPython’s grammar includes rules used specifically for better error messages.

A parser could reject bad syntax generically:

```text
SyntaxError: invalid syntax
```

But CPython often tries to detect common mistakes:

```python
if x = 1:
    pass
```

This can produce a more helpful message than a generic parse failure.

Invalid grammar rules are not intended to accept programs. They are used to recognize wrong forms and raise better diagnostics.

The practical model:

```text
normal grammar accepts valid Python
invalid rules recognize common invalid Python
error machinery reports a targeted message
```

This keeps the parser strict while improving user experience.

## 19.17 Parser Output Is Not Fully Validated Yet

A successful parse means the source matches Python grammar. It does not mean the program is fully valid in every later sense.

Some checks occur after parsing.

Example:

```python
return 1
```

At module level, this is syntactically shaped like a return statement, but it is invalid outside a function.

Another example:

```python
break
```

At module level, this is grammatically a break statement, but it is invalid outside a loop.

These constraints are often checked during AST validation, symbol table analysis, or compilation.

The phases are separate:

```text
parser:
    Can these tokens form a syntax tree?

later compiler checks:
    Is this syntax legal in this context?
    Which names are local, global, free, or cell variables?
    Can this construct be compiled here?
```

## 19.18 `ast.parse`

The public `ast.parse()` function exposes CPython’s parsing machinery at Python level.

Example:

```python
import ast

src = """
def add(a, b):
    return a + b
"""

tree = ast.parse(src)
print(ast.dump(tree, indent=4))
```

This is useful for:

```text
linters
formatters
static analyzers
code generation tools
security scanners
refactoring tools
documentation tools
```

But `ast.parse()` returns an AST, not tokens and not bytecode.

For bytecode, use `compile()` and `dis`:

```python
import dis

code = compile("x = 1 + 2\n", "<input>", "exec")
dis.dis(code)
```

The relation is:

```text
tokenize.tokenize()    source to tokens
ast.parse()            source to AST
compile()              source or AST to code object
dis.dis()              code object to bytecode display
```

## 19.19 Parse Modes

Python source can be parsed in different modes.

Common modes:

| Mode     | Meaning                                  |
| -------- | ---------------------------------------- |
| `exec`   | Parse a module or sequence of statements |
| `eval`   | Parse a single expression                |
| `single` | Parse interactive input                  |

Example:

```python
compile("x = 1", "<input>", "exec")
compile("1 + 2", "<input>", "eval")
compile("print(1)", "<input>", "single")
```

`eval` mode rejects statements:

```python
compile("x = 1", "<input>", "eval")
```

That fails because assignment statements are not expressions.

The parse mode determines the grammar start symbol.

## 19.20 F-Strings and the Parser

Modern CPython parses f-string expressions through the parser machinery.

Before Python 3.12, f-string expression parsing had more special-case behavior. Python 3.12 changed f-strings through PEP 701, bringing them into the grammar more fully. The Python 3.12 documentation notes that f-strings are parsed with the PEG parser, allowing more precise error messages, and that tokenize performance improved due to these changes. ([Python documentation][2])

Example:

```python
name = "Ada"
text = f"hello {name.upper()}"
```

The embedded expression:

```python
name.upper()
```

is parsed as Python expression syntax.

This means f-strings are not just string interpolation at runtime. They contain nested syntax that must be parsed and represented structurally.

## 19.21 Parser Files in CPython

The important parser-related source areas include:

```text
Grammar/python.gram
Parser/
Python/ast.c
Include/internal/
Lib/ast.py
Lib/test/test_grammar.py
Lib/test/test_syntax.py
```

The exact file boundaries can change, but the conceptual split is stable:

| Area             | Role                                        |
| ---------------- | ------------------------------------------- |
| Grammar          | Defines Python syntax                       |
| Parser generator | Produces parser implementation from grammar |
| Parser runtime   | Consumes tokens and applies grammar rules   |
| AST construction | Builds AST nodes                            |
| AST validation   | Checks structural correctness               |
| Tests            | Protect syntax and diagnostics              |

When reading CPython, start with the grammar. Then follow how a grammar rule creates AST nodes.

## 19.22 Example: Parsing a Function

Source:

```python
def square(x):
    return x * x
```

Token-level view:

```text
NAME      "def"
NAME      "square"
LPAR      "("
NAME      "x"
RPAR      ")"
COLON     ":"
NEWLINE
INDENT
NAME      "return"
NAME      "x"
STAR      "*"
NAME      "x"
NEWLINE
DEDENT
ENDMARKER
```

AST-level view:

```text
Module
    FunctionDef
        name: "square"
        args:
            arguments
                arg: "x"
        body:
            Return
                BinOp
                    left: Name "x", Load
                    op: Mult
                    right: Name "x", Load
```

The parser creates this hierarchy because the grammar says:

```text
a module contains statements
a function definition is a statement
a function body is a block
a return statement can contain an expression
a multiplication expression contains two operands
a name expression loads a value
```

Each level gives meaning to the flat token stream.

## 19.23 Example: Parsing an `if` Statement

Source:

```python
if score >= 60:
    result = "pass"
else:
    result = "fail"
```

AST shape:

```text
Module
    If
        test:
            Compare
                left: Name "score"
                op: GtE
                comparator: Constant 60
        body:
            Assign
                target: Name "result", Store
                value: Constant "pass"
        orelse:
            Assign
                target: Name "result", Store
                value: Constant "fail"
```

The parser must attach `else` to the correct `if`.

Indentation tokens help here. The `else` appears after the `DEDENT` ending the first block, but at the same indentation level as the `if`.

This is one place where Python’s layout-sensitive syntax interacts directly with parser structure.

## 19.24 Example: Parsing a Comprehension

Source:

```python
squares = [x * x for x in values if x > 0]
```

AST shape:

```text
Assign
    target: Name "squares", Store
    value:
        ListComp
            element:
                BinOp
                    Name "x"
                    Mult
                    Name "x"
            generators:
                comprehension
                    target: Name "x", Store
                    iter: Name "values", Load
                    ifs:
                        Compare
                            Name "x"
                            Gt
                            Constant 0
```

Comprehensions are expression syntax, but they contain assignment-like targets and nested clauses.

The parser must distinguish:

```python
[x * x for x in values]
```

from a list literal:

```python
[x * x, y]
```

Both begin with `[` and an expression. The token sequence after the first expression determines which structure the parser builds.

## 19.25 Parser Boundaries

The parser does not execute code.

It does not call functions:

```python
f()
```

It only builds a call node.

It does not import modules:

```python
import os
```

It only builds an import node.

It does not resolve names:

```python
x + y
```

It only builds name and binary operation nodes.

It does not evaluate constants beyond what the AST layer represents.

The parser’s job is structural. Later phases assign scope, generate bytecode, and execute operations.

## 19.26 Minimal Mental Model

Use this model:

```text
The tokenizer emits a flat stream of tokens.
The parser matches that stream against Python grammar.
CPython’s modern parser is PEG-based.
Grammar alternatives are ordered.
The parser builds an AST.
The AST records statements, expressions, contexts, and source locations.
Syntax errors are raised when the token stream cannot form valid grammar.
Some validity checks happen after parsing.
```

Parsing is where source text first becomes tree-shaped. Everything after this stage works with structure rather than raw token order.

[1]: https://docs.python.org/3/reference/grammar.html?utm_source=chatgpt.com "10. Full Grammar specification"
[2]: https://docs.python.org/3/_sources/whatsnew/3.12.rst.txt?utm_source=chatgpt.com "https://docs.python.org/3/_sources/whatsnew/3.12.r..."