18. Tokenization

Tokenization is the first structural stage in CPython’s compilation pipeline.

It receives Python source text and produces a stream of tokens. The parser consumes that token stream and builds syntax structure from it.

At this stage, CPython does not yet know whether a program is meaningful. It only recognizes lexical units: names, numbers, strings, operators, newlines, indentation, and end-of-file markers.

The tokenizer turns this:

def add(a, b):
    return a + b

into a stream shaped like this:

NAME        "def"
NAME        "add"
LPAR        "("
NAME        "a"
COMMA       ","
NAME        "b"
RPAR        ")"
COLON       ":"
NEWLINE     "\n"
INDENT      "    "
NAME        "return"
NAME        "a"
PLUS        "+"
NAME        "b"
NEWLINE     "\n"
DEDENT      ""
ENDMARKER   ""

The exact token names and parser interface vary across CPython versions, but the core idea is stable. The Python standard library exposes a Python-level tokenizer through tokenize, while CPython’s parser uses its own C tokenizer internally. The public tokenize module also returns comments and an initial encoding token, which makes it suitable for tools such as formatters and syntax highlighters. (Python documentation)

18.1 Position in the Compilation Pipeline

The full source-to-execution path is:

bytes from file or string
    ↓
encoding detection
    ↓
decoded source text
    ↓
tokenizer
    ↓
token stream
    ↓
parser
    ↓
abstract syntax tree
    ↓
symbol table
    ↓
code object
    ↓
bytecode execution

Tokenization sits between raw source input and grammar parsing.

The tokenizer answers questions such as:

Where does this logical line end?
Is this identifier a name?
Is this numeric literal well-formed?
Is this string literal closed?
Did indentation increase or decrease?
Is this character part of an operator?
Has the source reached end-of-file?

The parser answers different questions:

Is this a valid function definition?
Is this expression allowed here?
Does this statement match a grammar rule?
Does this sequence form a valid pattern match?
How should these tokens be grouped into an AST?

The tokenizer does not build an AST. It only produces a sequence of lexical events.

18.2 Source Input and Encoding

Python source usually begins as bytes.

Before tokenization can proceed, CPython must determine how to decode those bytes into text. Python source files are UTF-8 by default, but a file may declare another encoding near the top.

Typical encoding declaration:

# -*- coding: latin-1 -*-

or:

# coding: utf-8

The tokenizer must handle this early because it cannot reliably classify source characters until the source is decoded.

At the Python level, tokenize.tokenize() begins by returning an ENCODING token. The token documentation notes that this ENCODING token is needed for the Python tokenize module and is not used by CPython’s C tokenizer in the same way. (Python documentation)

The practical model is:

read first source lines
detect encoding declaration if present
decode source bytes
normalize line handling
begin lexical scanning

This is why tokenization is not just a loop over characters. It also owns the boundary between external source bytes and internal source text.

18.3 Physical Lines and Logical Lines

Python source has physical lines and logical lines.

A physical line is one line in the source file.

A logical line is one complete Python statement or expression unit as seen by the parser.

Usually they are the same:

x = 1
y = 2

Here, each physical line is also a logical line.

But Python allows explicit line joining with backslash:

x = 1 + \
    2 + \
    3

This is one logical line spread across three physical lines.

Python also allows implicit line joining inside parentheses, brackets, and braces:

values = [
    1,
    2,
    3,
]

Inside grouping delimiters, newlines do not end the logical statement. The tokenizer tracks nesting depth so it can distinguish significant newlines from non-significant newlines.

Conceptually:

paren_level = 0

when "(" or "[" or "{" appears:
    paren_level += 1

when ")" or "]" or "}" appears:
    paren_level -= 1

when newline appears:
    if paren_level == 0:
        emit NEWLINE
    else:
        ignore as statement terminator

This rule is essential for Python’s readable multiline syntax.

18.4 `NEWLINE` and `NL`

Python-level tokenization distinguishes between logical-line newlines and non-terminating newlines.

At the public tokenize level:

Token	Meaning
`NEWLINE`	Ends a logical line
`NL`	Newline that does not end a logical line

Example:

x = (
    1 +
    2
)
y = 3

The newlines inside the parentheses are not statement terminators. The parser should not treat 1 + as a complete statement. Those line breaks exist for layout, not grammar.

In a tool-facing tokenizer, they appear as NL. In the parser-facing model, they are ignored or treated differently from real logical newlines.

This distinction matters for formatters and linters. A formatter may care about every physical newline. The parser only needs logical structure.

18.5 Indentation as Tokens

Python uses indentation as syntax.

That means the tokenizer must turn leading whitespace into tokens.

Example:

if ready:
    run()
    log()
finish()

The parser cannot understand this source using only names and punctuation. It needs explicit block boundaries.

The tokenizer emits:

NAME      "if"
NAME      "ready"
COLON     ":"
NEWLINE   "\n"
INDENT    "    "
NAME      "run"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
NAME      "log"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
DEDENT    ""
NAME      "finish"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
ENDMARKER ""

Indentation creates a virtual block start. Dedentation creates a virtual block end.

CPython maintains an indentation stack. At the start of a logical line, the tokenizer measures leading whitespace and compares it with the current indentation level.

Simplified model:

indent_stack = [0]

at beginning of logical line:
    col = indentation_column()

    if col > indent_stack[-1]:
        push col
        emit INDENT

    else if col == indent_stack[-1]:
        emit no indentation token

    else:
        while col < indent_stack[-1]:
            pop
            emit DEDENT

        if col != indent_stack[-1]:
            report indentation error

This stack discipline explains why inconsistent indentation is a lexical error before normal parsing can continue.

18.6 Tabs, Spaces, and Indentation Columns

Indentation is measured in columns, not raw characters.

Spaces advance by one column. Tabs advance to the next tab stop. Python’s tab handling exists for compatibility, but mixing tabs and spaces can produce ambiguous indentation and errors.

Example:

if x:
\tprint("tab")
    print("spaces")

The visual alignment may depend on editor settings. CPython cannot trust how a human editor displays this. It computes indentation using the language’s tab rules and raises errors when indentation is inconsistent.

The important internal point is that indentation is not stored as “number of leading characters.” It is converted into indentation levels. Those levels are compared against the indentation stack.

18.7 Blank Lines and Comment-Only Lines

Blank lines do not usually produce parser-significant tokens.

Example:

x = 1

y = 2

The empty line does not terminate a block or create a statement.

Comment-only lines behave similarly for the parser:

x = 1
# comment
y = 2

The public tokenize module returns comments because tools need them. CPython’s parser does not treat comments as syntax.

This difference is important:

Consumer	Needs comments?	Reason
Parser	No	Comments do not affect grammar
Formatter	Yes	Comments must be preserved
Syntax highlighter	Yes	Comments need styling
Linter	Yes	Comments may contain directives
Type checker	Sometimes	Comments may contain type comments

The public tokenizer is a tool API. The C tokenizer is part of the compiler front end.

18.8 Names, Keywords, and Soft Keywords

Identifiers are tokenized as names.

Example:

total = price + tax

The tokenizer sees:

NAME "total"
EQUAL "="
NAME "price"
PLUS "+"
NAME "tax"

Traditional Python keywords include words such as:

def
class
if
else
while
for
try
except
return
yield
import
from
with
lambda

At the lexical level, these are name-shaped sequences. The tokenizer or parser can classify them according to grammar needs.

Modern Python also has soft keywords. A soft keyword acts like a keyword only in specific grammar positions.

Examples include words used by pattern matching:

match value:
    case 0:
        pass

match and case can still be used as ordinary names in other contexts where the grammar permits it.

This is one reason tokenization and parsing must cooperate. A tokenizer that permanently converted every occurrence of match into a hard keyword token would reject valid code in contexts where match is only a name.

The practical rule:

hard keyword: reserved everywhere
soft keyword: special only in selected grammar positions
name: ordinary identifier

18.9 Unicode Identifiers

Python identifiers may contain many Unicode characters.

Example:

π = 3.14159
面积 = 42

The tokenizer must recognize identifier start and continuation characters according to Python’s identifier rules. This gives Python source code broad Unicode support.

But identifiers are still normalized and checked according to language rules. Not every Unicode character is legal in a name, and some visually similar characters can be distinct.

From an internals perspective, identifier handling requires:

Unicode-aware character classification
identifier start validation
identifier continuation validation
normalization rules
error reporting for invalid characters

This makes Python tokenization more complex than an ASCII-only language tokenizer.

18.10 Numeric Literals

The tokenizer recognizes numeric literals before the parser builds expressions.

Examples:

123
0b1010
0o755
0xff
1_000_000
3.14
10.
.5
1e9
1.2e-3
3j

These become number tokens.

The tokenizer must validate lexical form:

base prefixes
digits allowed in each base
underscore placement
decimal points
exponents
imaginary suffix

Some invalid numbers fail during tokenization:

0b102
1__2

The tokenizer does not evaluate arbitrary arithmetic. It only recognizes the literal token. Later compilation stages convert the token text into the corresponding Python object.

Example:

x = 1 + 2

Tokenization sees:

NAME     "x"
EQUAL    "="
NUMBER   "1"
PLUS     "+"
NUMBER   "2"

The fact that 1 + 2 can be folded into 3 belongs to later compiler optimization, not tokenization.

18.11 String Literals

String tokenization is more complicated than numeric tokenization.

Python supports:

"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
b"bytes"
f"value={x}"
fr"path={name}\n"

The tokenizer must identify:

string prefixes
quote style
single-line or triple-quoted form
raw strings
bytes strings
f-strings
escape sequences
line continuation rules
string termination

A normal string token is recognized as one lexical unit:

x = "hello"

Token stream:

NAME    "x"
EQUAL   "="
STRING  "\"hello\""

Triple-quoted strings can span physical lines:

text = """
line 1
line 2
"""

The tokenizer must keep scanning until it finds the matching triple quote.

Unterminated strings are tokenizer errors:

x = "missing end

The parser cannot recover meaningful grammar from an unterminated string because the tokenizer cannot produce a valid token stream.

18.12 F-Strings

F-strings are special because they contain both literal string content and embedded Python expressions.

Example:

name = "Ada"
text = f"hello {name.upper()}"

Inside the string, this part is literal text:

hello

This part is Python expression syntax:

name.upper()

The tokenizer and parser must cooperate to handle this nested structure.

Conceptually:

enter f-string mode
scan literal characters
when "{" starts expression:
    tokenize embedded Python expression
    parse embedded expression
return to f-string literal scanning
finish at closing quote

Nested expression handling makes f-strings much richer than ordinary string literals. They are not just string tokens with later text replacement. They contain syntax that must be parsed into expression nodes.

18.13 Operators and Delimiters

Python operators and delimiters include single-character and multi-character forms.

Examples:

+   -   *   /   //   %   **
=   ==  !=  <   <=   >   >=
:=  ->  @   @=
(   )   [   ]   {   }
,   :   .   ;   ...

The tokenizer usually applies longest-match behavior.

For example, when reading **=, it should produce one power-assignment operator token rather than *, *, and =.

Simplified logic:

if next characters form "**=":
    emit DOUBLESTAR_EQUAL
else if next characters form "**":
    emit DOUBLESTAR
else if next character is "*":
    emit STAR

This rule is common in tokenizers. It keeps the parser from having to reconstruct multi-character operators from smaller pieces.

18.14 Error Tokens and Lexical Errors

Some errors appear before parsing.

Examples:

x = "unterminated

if x:
  a = 1
 b = 2

x = 0b123

These are lexical or indentation errors.

The tokenizer must report enough information for useful diagnostics:

filename
line number
column offset
source line
error type
error message

Common tokenization-stage errors include:

Error	Cause
`SyntaxError`	Invalid lexical structure or token sequence
`IndentationError`	Invalid indentation level
`TabError`	Ambiguous indentation from tabs and spaces
`TokenError`	Public tokenizer error for incomplete input

Not every SyntaxError originates in tokenization. Many come from parsing. But the tokenizer owns errors that prevent a valid token stream from existing.

18.15 End of File and Synthetic Dedents

At end of file, CPython must close any open indentation blocks.

Example:

if x:
    if y:
        run()

The source ends while two indentation levels are still active. The tokenizer emits synthetic DEDENT tokens before ENDMARKER.

Conceptually:

NAME      "if"
NAME      "x"
COLON     ":"
NEWLINE
INDENT
NAME      "if"
NAME      "y"
COLON     ":"
NEWLINE
INDENT
NAME      "run"
LPAR
RPAR
NEWLINE
DEDENT
DEDENT
ENDMARKER

This lets the parser see block endings even when there are no explicit closing braces.

The tokenizer therefore creates tokens that have no direct character in the source file. INDENT, DEDENT, and ENDMARKER are structural tokens.

18.16 Tokenizer State

A tokenizer is stateful.

It must remember:

current input pointer
current line
current column
current indentation stack
current nesting level
whether scanning begins a line
whether inside a string
whether inside an f-string expression
whether an encoding was detected
whether interactive mode is active
pending INDENT or DEDENT tokens
error state

A stateless scanner would be insufficient for Python because meaning depends on layout and context.

Example:

x = [
    1,
    2,
]

The newline after 1, appears inside brackets. It should not become a logical NEWLINE.

Example:

if x:
    y = 1
z = 2

The leading whitespace before z causes a DEDENT.

Those decisions require remembered state.

18.17 Interactive Tokenization

Interactive input has special cases.

In a REPL, CPython often needs to decide whether input is complete.

Example:

>>> if x:
...

This is incomplete because a block body is expected.

Example:

>>> x = (1 +
...

This is incomplete because the parenthesized expression remains open.

The tokenizer and parser cooperate to decide whether to request another line or raise an error. Interactive mode therefore differs from file mode. End-of-input in a file means true EOF. End-of-input in the REPL may mean “ask for more text.”

18.18 Public `tokenize` Module

The standard library exposes tokenization through tokenize.

Example:

from io import BytesIO
import tokenize

src = b"x = 1 + 2\n"

for tok in tokenize.tokenize(BytesIO(src).readline):
    print(tok)

Output is shaped like:

TokenInfo(type=ENCODING, string='utf-8', ...)
TokenInfo(type=NAME, string='x', ...)
TokenInfo(type=OP, string='=', ...)
TokenInfo(type=NUMBER, string='1', ...)
TokenInfo(type=OP, string='+', ...)
TokenInfo(type=NUMBER, string='2', ...)
TokenInfo(type=NEWLINE, string='\n', ...)
TokenInfo(type=ENDMARKER, string='', ...)

The public tokenizer is useful for:

formatters
linters
code generators
syntax highlighters
refactoring tools
documentation tools
source-to-source transforms

The tokenize documentation describes it as a lexical scanner for Python source and notes that it returns comments as tokens, which makes it useful for pretty-printers and colorizers. (Python documentation)

18.19 C Tokenizer vs Python Tokenizer

There are two related tokenizer concepts in CPython:

Component	Location	Purpose
C tokenizer	CPython parser/compiler internals	Feed the parser during compilation
`Lib/tokenize.py`	Standard library	Expose tokenization to Python tools

They are not identical interfaces.

The C tokenizer is optimized for CPython’s compiler pipeline. It produces what the parser needs.

The Python tokenizer is a public tool interface. It preserves comments, exposes encoding, returns rich TokenInfo objects, and is designed for external consumers.

This distinction explains why token streams from tokenize may contain information the parser ignores.

18.20 Tokenization Example in Detail

Consider this source:

def area(r):
    pi = 3.14159
    return pi * r * r

A simplified token stream:

NAME       "def"
NAME       "area"
LPAR       "("
NAME       "r"
RPAR       ")"
COLON      ":"
NEWLINE    "\n"
INDENT     "    "
NAME       "pi"
EQUAL      "="
NUMBER     "3.14159"
NEWLINE    "\n"
NAME       "return"
NAME       "pi"
STAR       "*"
NAME       "r"
STAR       "*"
NAME       "r"
NEWLINE    "\n"
DEDENT     ""
ENDMARKER  ""

Important points:

def is lexically name-shaped but grammatically acts as a keyword.
The function body begins because indentation increases after NEWLINE.
3.14159 is a single number token.
return pi * r * r is one logical line.
The function body ends through a synthetic DEDENT.
The file ends through ENDMARKER.

The parser receives this stream and matches it against grammar rules for function definitions, suites, assignments, return statements, and expressions.

18.21 Tokenization Does Not Understand Full Semantics

The tokenizer does not know that this name is undefined:

print(missing_name)

It does not know that this call will fail:

1()

It does not know whether this import exists:

import does_not_exist

It only emits tokens.

Semantic checks happen later, often at runtime.

Tokenization is intentionally shallow. It recognizes lexical form, not program meaning.

18.22 Why Tokenization Matters

Tokenization seems small, but it shapes the whole language.

It defines:

how indentation becomes syntax
how source bytes become characters
how comments are ignored or preserved
how strings are delimited
how f-strings embed expressions
how operators are recognized
how logical lines are formed
how parser input is structured

For CPython contributors, tokenizer bugs can affect syntax, diagnostics, tools, compatibility, and security. A small lexical change can alter how every Python file is parsed.

For tooling authors, tokenization is often the best layer to work at. It preserves source-level information that the AST discards, including comments, exact spacing, physical lines, and operator spelling.

18.23 Minimal Mental Model

Use this model:

The tokenizer reads decoded Python source.
It emits lexical tokens.
It tracks indentation, nesting, strings, and line boundaries.
It inserts structural tokens such as INDENT, DEDENT, and ENDMARKER.
It reports lexical errors before parsing.
The parser consumes tokens and builds syntax structure.

That is the bridge from raw source text to grammar.

18. Tokenization

18.1 Position in the Compilation Pipeline

18.2 Source Input and Encoding

18.3 Physical Lines and Logical Lines

18.4 NEWLINE and NL

18.5 Indentation as Tokens

18.6 Tabs, Spaces, and Indentation Columns

18.7 Blank Lines and Comment-Only Lines

18.8 Names, Keywords, and Soft Keywords

18.9 Unicode Identifiers

18.10 Numeric Literals

18.11 String Literals

18.12 F-Strings

18.13 Operators and Delimiters

18.14 Error Tokens and Lexical Errors

18.15 End of File and Synthetic Dedents

18.16 Tokenizer State

18.17 Interactive Tokenization

18.18 Public tokenize Module

18.19 C Tokenizer vs Python Tokenizer

18.20 Tokenization Example in Detail

18.21 Tokenization Does Not Understand Full Semantics

18.22 Why Tokenization Matters

18.23 Minimal Mental Model

18.4 `NEWLINE` and `NL`

18.18 Public `tokenize` Module