Tokenize module internals, the C tokenizer in Parser/tokenize.c, and how indentation becomes INDENT/DEDENT tokens.
Tokenization is the first structural stage in CPython’s compilation pipeline.
It receives Python source text and produces a stream of tokens. The parser consumes that token stream and builds syntax structure from it.
At this stage, CPython does not yet know whether a program is meaningful. It only recognizes lexical units: names, numbers, strings, operators, newlines, indentation, and end-of-file markers.
The tokenizer turns this:
def add(a, b):
return a + binto a stream shaped like this:
NAME "def"
NAME "add"
LPAR "("
NAME "a"
COMMA ","
NAME "b"
RPAR ")"
COLON ":"
NEWLINE "\n"
INDENT " "
NAME "return"
NAME "a"
PLUS "+"
NAME "b"
NEWLINE "\n"
DEDENT ""
ENDMARKER ""The exact token names and parser interface vary across CPython versions, but the core idea is stable. The Python standard library exposes a Python-level tokenizer through tokenize, while CPython’s parser uses its own C tokenizer internally. The public tokenize module also returns comments and an initial encoding token, which makes it suitable for tools such as formatters and syntax highlighters. (Python documentation)
18.1 Position in the Compilation Pipeline
The full source-to-execution path is:
bytes from file or string
↓
encoding detection
↓
decoded source text
↓
tokenizer
↓
token stream
↓
parser
↓
abstract syntax tree
↓
symbol table
↓
code object
↓
bytecode executionTokenization sits between raw source input and grammar parsing.
The tokenizer answers questions such as:
Where does this logical line end?
Is this identifier a name?
Is this numeric literal well-formed?
Is this string literal closed?
Did indentation increase or decrease?
Is this character part of an operator?
Has the source reached end-of-file?The parser answers different questions:
Is this a valid function definition?
Is this expression allowed here?
Does this statement match a grammar rule?
Does this sequence form a valid pattern match?
How should these tokens be grouped into an AST?The tokenizer does not build an AST. It only produces a sequence of lexical events.
18.2 Source Input and Encoding
Python source usually begins as bytes.
Before tokenization can proceed, CPython must determine how to decode those bytes into text. Python source files are UTF-8 by default, but a file may declare another encoding near the top.
Typical encoding declaration:
# -*- coding: latin-1 -*-or:
# coding: utf-8The tokenizer must handle this early because it cannot reliably classify source characters until the source is decoded.
At the Python level, tokenize.tokenize() begins by returning an ENCODING token. The token documentation notes that this ENCODING token is needed for the Python tokenize module and is not used by CPython’s C tokenizer in the same way. (Python documentation)
The practical model is:
read first source lines
detect encoding declaration if present
decode source bytes
normalize line handling
begin lexical scanningThis is why tokenization is not just a loop over characters. It also owns the boundary between external source bytes and internal source text.
18.3 Physical Lines and Logical Lines
Python source has physical lines and logical lines.
A physical line is one line in the source file.
A logical line is one complete Python statement or expression unit as seen by the parser.
Usually they are the same:
x = 1
y = 2Here, each physical line is also a logical line.
But Python allows explicit line joining with backslash:
x = 1 + \
2 + \
3This is one logical line spread across three physical lines.
Python also allows implicit line joining inside parentheses, brackets, and braces:
values = [
1,
2,
3,
]Inside grouping delimiters, newlines do not end the logical statement. The tokenizer tracks nesting depth so it can distinguish significant newlines from non-significant newlines.
Conceptually:
paren_level = 0
when "(" or "[" or "{" appears:
paren_level += 1
when ")" or "]" or "}" appears:
paren_level -= 1
when newline appears:
if paren_level == 0:
emit NEWLINE
else:
ignore as statement terminatorThis rule is essential for Python’s readable multiline syntax.
18.4 NEWLINE and NL
Python-level tokenization distinguishes between logical-line newlines and non-terminating newlines.
At the public tokenize level:
| Token | Meaning |
|---|---|
NEWLINE | Ends a logical line |
NL | Newline that does not end a logical line |
Example:
x = (
1 +
2
)
y = 3The newlines inside the parentheses are not statement terminators. The parser should not treat 1 + as a complete statement. Those line breaks exist for layout, not grammar.
In a tool-facing tokenizer, they appear as NL. In the parser-facing model, they are ignored or treated differently from real logical newlines.
This distinction matters for formatters and linters. A formatter may care about every physical newline. The parser only needs logical structure.
18.5 Indentation as Tokens
Python uses indentation as syntax.
That means the tokenizer must turn leading whitespace into tokens.
Example:
if ready:
run()
log()
finish()The parser cannot understand this source using only names and punctuation. It needs explicit block boundaries.
The tokenizer emits:
NAME "if"
NAME "ready"
COLON ":"
NEWLINE "\n"
INDENT " "
NAME "run"
LPAR "("
RPAR ")"
NEWLINE "\n"
NAME "log"
LPAR "("
RPAR ")"
NEWLINE "\n"
DEDENT ""
NAME "finish"
LPAR "("
RPAR ")"
NEWLINE "\n"
ENDMARKER ""Indentation creates a virtual block start. Dedentation creates a virtual block end.
CPython maintains an indentation stack. At the start of a logical line, the tokenizer measures leading whitespace and compares it with the current indentation level.
Simplified model:
indent_stack = [0]
at beginning of logical line:
col = indentation_column()
if col > indent_stack[-1]:
push col
emit INDENT
else if col == indent_stack[-1]:
emit no indentation token
else:
while col < indent_stack[-1]:
pop
emit DEDENT
if col != indent_stack[-1]:
report indentation errorThis stack discipline explains why inconsistent indentation is a lexical error before normal parsing can continue.
18.6 Tabs, Spaces, and Indentation Columns
Indentation is measured in columns, not raw characters.
Spaces advance by one column. Tabs advance to the next tab stop. Python’s tab handling exists for compatibility, but mixing tabs and spaces can produce ambiguous indentation and errors.
Example:
if x:
\tprint("tab")
print("spaces")The visual alignment may depend on editor settings. CPython cannot trust how a human editor displays this. It computes indentation using the language’s tab rules and raises errors when indentation is inconsistent.
The important internal point is that indentation is not stored as “number of leading characters.” It is converted into indentation levels. Those levels are compared against the indentation stack.
18.7 Blank Lines and Comment-Only Lines
Blank lines do not usually produce parser-significant tokens.
Example:
x = 1
y = 2The empty line does not terminate a block or create a statement.
Comment-only lines behave similarly for the parser:
x = 1
# comment
y = 2The public tokenize module returns comments because tools need them. CPython’s parser does not treat comments as syntax.
This difference is important:
| Consumer | Needs comments? | Reason |
|---|---|---|
| Parser | No | Comments do not affect grammar |
| Formatter | Yes | Comments must be preserved |
| Syntax highlighter | Yes | Comments need styling |
| Linter | Yes | Comments may contain directives |
| Type checker | Sometimes | Comments may contain type comments |
The public tokenizer is a tool API. The C tokenizer is part of the compiler front end.
18.8 Names, Keywords, and Soft Keywords
Identifiers are tokenized as names.
Example:
total = price + taxThe tokenizer sees:
NAME "total"
EQUAL "="
NAME "price"
PLUS "+"
NAME "tax"Traditional Python keywords include words such as:
def
class
if
else
while
for
try
except
return
yield
import
from
with
lambdaAt the lexical level, these are name-shaped sequences. The tokenizer or parser can classify them according to grammar needs.
Modern Python also has soft keywords. A soft keyword acts like a keyword only in specific grammar positions.
Examples include words used by pattern matching:
match value:
case 0:
passmatch and case can still be used as ordinary names in other contexts where the grammar permits it.
This is one reason tokenization and parsing must cooperate. A tokenizer that permanently converted every occurrence of match into a hard keyword token would reject valid code in contexts where match is only a name.
The practical rule:
hard keyword: reserved everywhere
soft keyword: special only in selected grammar positions
name: ordinary identifier18.9 Unicode Identifiers
Python identifiers may contain many Unicode characters.
Example:
π = 3.14159
面积 = 42The tokenizer must recognize identifier start and continuation characters according to Python’s identifier rules. This gives Python source code broad Unicode support.
But identifiers are still normalized and checked according to language rules. Not every Unicode character is legal in a name, and some visually similar characters can be distinct.
From an internals perspective, identifier handling requires:
Unicode-aware character classification
identifier start validation
identifier continuation validation
normalization rules
error reporting for invalid charactersThis makes Python tokenization more complex than an ASCII-only language tokenizer.
18.10 Numeric Literals
The tokenizer recognizes numeric literals before the parser builds expressions.
Examples:
123
0b1010
0o755
0xff
1_000_000
3.14
10.
.5
1e9
1.2e-3
3jThese become number tokens.
The tokenizer must validate lexical form:
base prefixes
digits allowed in each base
underscore placement
decimal points
exponents
imaginary suffixSome invalid numbers fail during tokenization:
0b102
1__2The tokenizer does not evaluate arbitrary arithmetic. It only recognizes the literal token. Later compilation stages convert the token text into the corresponding Python object.
Example:
x = 1 + 2Tokenization sees:
NAME "x"
EQUAL "="
NUMBER "1"
PLUS "+"
NUMBER "2"The fact that 1 + 2 can be folded into 3 belongs to later compiler optimization, not tokenization.
18.11 String Literals
String tokenization is more complicated than numeric tokenization.
Python supports:
"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
b"bytes"
f"value={x}"
fr"path={name}\n"The tokenizer must identify:
string prefixes
quote style
single-line or triple-quoted form
raw strings
bytes strings
f-strings
escape sequences
line continuation rules
string terminationA normal string token is recognized as one lexical unit:
x = "hello"Token stream:
NAME "x"
EQUAL "="
STRING "\"hello\""Triple-quoted strings can span physical lines:
text = """
line 1
line 2
"""The tokenizer must keep scanning until it finds the matching triple quote.
Unterminated strings are tokenizer errors:
x = "missing endThe parser cannot recover meaningful grammar from an unterminated string because the tokenizer cannot produce a valid token stream.
18.12 F-Strings
F-strings are special because they contain both literal string content and embedded Python expressions.
Example:
name = "Ada"
text = f"hello {name.upper()}"Inside the string, this part is literal text:
hello This part is Python expression syntax:
name.upper()The tokenizer and parser must cooperate to handle this nested structure.
Conceptually:
enter f-string mode
scan literal characters
when "{" starts expression:
tokenize embedded Python expression
parse embedded expression
return to f-string literal scanning
finish at closing quoteNested expression handling makes f-strings much richer than ordinary string literals. They are not just string tokens with later text replacement. They contain syntax that must be parsed into expression nodes.
18.13 Operators and Delimiters
Python operators and delimiters include single-character and multi-character forms.
Examples:
+ - * / // % **
= == != < <= > >=
:= -> @ @=
( ) [ ] { }
, : . ; ...The tokenizer usually applies longest-match behavior.
For example, when reading **=, it should produce one power-assignment operator token rather than *, *, and =.
Simplified logic:
if next characters form "**=":
emit DOUBLESTAR_EQUAL
else if next characters form "**":
emit DOUBLESTAR
else if next character is "*":
emit STARThis rule is common in tokenizers. It keeps the parser from having to reconstruct multi-character operators from smaller pieces.
18.14 Error Tokens and Lexical Errors
Some errors appear before parsing.
Examples:
x = "unterminatedif x:
a = 1
b = 2x = 0b123These are lexical or indentation errors.
The tokenizer must report enough information for useful diagnostics:
filename
line number
column offset
source line
error type
error messageCommon tokenization-stage errors include:
| Error | Cause |
|---|---|
SyntaxError | Invalid lexical structure or token sequence |
IndentationError | Invalid indentation level |
TabError | Ambiguous indentation from tabs and spaces |
TokenError | Public tokenizer error for incomplete input |
Not every SyntaxError originates in tokenization. Many come from parsing. But the tokenizer owns errors that prevent a valid token stream from existing.
18.15 End of File and Synthetic Dedents
At end of file, CPython must close any open indentation blocks.
Example:
if x:
if y:
run()The source ends while two indentation levels are still active. The tokenizer emits synthetic DEDENT tokens before ENDMARKER.
Conceptually:
NAME "if"
NAME "x"
COLON ":"
NEWLINE
INDENT
NAME "if"
NAME "y"
COLON ":"
NEWLINE
INDENT
NAME "run"
LPAR
RPAR
NEWLINE
DEDENT
DEDENT
ENDMARKERThis lets the parser see block endings even when there are no explicit closing braces.
The tokenizer therefore creates tokens that have no direct character in the source file. INDENT, DEDENT, and ENDMARKER are structural tokens.
18.16 Tokenizer State
A tokenizer is stateful.
It must remember:
current input pointer
current line
current column
current indentation stack
current nesting level
whether scanning begins a line
whether inside a string
whether inside an f-string expression
whether an encoding was detected
whether interactive mode is active
pending INDENT or DEDENT tokens
error stateA stateless scanner would be insufficient for Python because meaning depends on layout and context.
Example:
x = [
1,
2,
]The newline after 1, appears inside brackets. It should not become a logical NEWLINE.
Example:
if x:
y = 1
z = 2The leading whitespace before z causes a DEDENT.
Those decisions require remembered state.
18.17 Interactive Tokenization
Interactive input has special cases.
In a REPL, CPython often needs to decide whether input is complete.
Example:
>>> if x:
...This is incomplete because a block body is expected.
Example:
>>> x = (1 +
...This is incomplete because the parenthesized expression remains open.
The tokenizer and parser cooperate to decide whether to request another line or raise an error. Interactive mode therefore differs from file mode. End-of-input in a file means true EOF. End-of-input in the REPL may mean “ask for more text.”
18.18 Public tokenize Module
The standard library exposes tokenization through tokenize.
Example:
from io import BytesIO
import tokenize
src = b"x = 1 + 2\n"
for tok in tokenize.tokenize(BytesIO(src).readline):
print(tok)Output is shaped like:
TokenInfo(type=ENCODING, string='utf-8', ...)
TokenInfo(type=NAME, string='x', ...)
TokenInfo(type=OP, string='=', ...)
TokenInfo(type=NUMBER, string='1', ...)
TokenInfo(type=OP, string='+', ...)
TokenInfo(type=NUMBER, string='2', ...)
TokenInfo(type=NEWLINE, string='\n', ...)
TokenInfo(type=ENDMARKER, string='', ...)The public tokenizer is useful for:
formatters
linters
code generators
syntax highlighters
refactoring tools
documentation tools
source-to-source transformsThe tokenize documentation describes it as a lexical scanner for Python source and notes that it returns comments as tokens, which makes it useful for pretty-printers and colorizers. (Python documentation)
18.19 C Tokenizer vs Python Tokenizer
There are two related tokenizer concepts in CPython:
| Component | Location | Purpose |
|---|---|---|
| C tokenizer | CPython parser/compiler internals | Feed the parser during compilation |
Lib/tokenize.py | Standard library | Expose tokenization to Python tools |
They are not identical interfaces.
The C tokenizer is optimized for CPython’s compiler pipeline. It produces what the parser needs.
The Python tokenizer is a public tool interface. It preserves comments, exposes encoding, returns rich TokenInfo objects, and is designed for external consumers.
This distinction explains why token streams from tokenize may contain information the parser ignores.
18.20 Tokenization Example in Detail
Consider this source:
def area(r):
pi = 3.14159
return pi * r * rA simplified token stream:
NAME "def"
NAME "area"
LPAR "("
NAME "r"
RPAR ")"
COLON ":"
NEWLINE "\n"
INDENT " "
NAME "pi"
EQUAL "="
NUMBER "3.14159"
NEWLINE "\n"
NAME "return"
NAME "pi"
STAR "*"
NAME "r"
STAR "*"
NAME "r"
NEWLINE "\n"
DEDENT ""
ENDMARKER ""Important points:
defis lexically name-shaped but grammatically acts as a keyword.- The function body begins because indentation increases after
NEWLINE. 3.14159is a single number token.return pi * r * ris one logical line.- The function body ends through a synthetic
DEDENT. - The file ends through
ENDMARKER.
The parser receives this stream and matches it against grammar rules for function definitions, suites, assignments, return statements, and expressions.
18.21 Tokenization Does Not Understand Full Semantics
The tokenizer does not know that this name is undefined:
print(missing_name)It does not know that this call will fail:
1()It does not know whether this import exists:
import does_not_existIt only emits tokens.
Semantic checks happen later, often at runtime.
Tokenization is intentionally shallow. It recognizes lexical form, not program meaning.
18.22 Why Tokenization Matters
Tokenization seems small, but it shapes the whole language.
It defines:
how indentation becomes syntax
how source bytes become characters
how comments are ignored or preserved
how strings are delimited
how f-strings embed expressions
how operators are recognized
how logical lines are formed
how parser input is structuredFor CPython contributors, tokenizer bugs can affect syntax, diagnostics, tools, compatibility, and security. A small lexical change can alter how every Python file is parsed.
For tooling authors, tokenization is often the best layer to work at. It preserves source-level information that the AST discards, including comments, exact spacing, physical lines, and operator spelling.
18.23 Minimal Mental Model
Use this model:
The tokenizer reads decoded Python source.
It emits lexical tokens.
It tracks indentation, nesting, strings, and line boundaries.
It inserts structural tokens such as INDENT, DEDENT, and ENDMARKER.
It reports lexical errors before parsing.
The parser consumes tokens and builds syntax structure.That is the bridge from raw source text to grammar.