# 18. Tokenization

# 18. Tokenization

Tokenization is the first structural stage in CPython’s compilation pipeline.

It receives Python source text and produces a stream of tokens. The parser consumes that token stream and builds syntax structure from it.

At this stage, CPython does not yet know whether a program is meaningful. It only recognizes lexical units: names, numbers, strings, operators, newlines, indentation, and end-of-file markers.

The tokenizer turns this:

```python
def add(a, b):
    return a + b
```

into a stream shaped like this:

```text
NAME        "def"
NAME        "add"
LPAR        "("
NAME        "a"
COMMA       ","
NAME        "b"
RPAR        ")"
COLON       ":"
NEWLINE     "\n"
INDENT      "    "
NAME        "return"
NAME        "a"
PLUS        "+"
NAME        "b"
NEWLINE     "\n"
DEDENT      ""
ENDMARKER   ""
```

The exact token names and parser interface vary across CPython versions, but the core idea is stable. The Python standard library exposes a Python-level tokenizer through `tokenize`, while CPython’s parser uses its own C tokenizer internally. The public `tokenize` module also returns comments and an initial encoding token, which makes it suitable for tools such as formatters and syntax highlighters. ([Python documentation][1])

## 18.1 Position in the Compilation Pipeline

The full source-to-execution path is:

```text
bytes from file or string
    ↓
encoding detection
    ↓
decoded source text
    ↓
tokenizer
    ↓
token stream
    ↓
parser
    ↓
abstract syntax tree
    ↓
symbol table
    ↓
code object
    ↓
bytecode execution
```

Tokenization sits between raw source input and grammar parsing.

The tokenizer answers questions such as:

```text
Where does this logical line end?
Is this identifier a name?
Is this numeric literal well-formed?
Is this string literal closed?
Did indentation increase or decrease?
Is this character part of an operator?
Has the source reached end-of-file?
```

The parser answers different questions:

```text
Is this a valid function definition?
Is this expression allowed here?
Does this statement match a grammar rule?
Does this sequence form a valid pattern match?
How should these tokens be grouped into an AST?
```

The tokenizer does not build an AST. It only produces a sequence of lexical events.

## 18.2 Source Input and Encoding

Python source usually begins as bytes.

Before tokenization can proceed, CPython must determine how to decode those bytes into text. Python source files are UTF-8 by default, but a file may declare another encoding near the top.

Typical encoding declaration:

```python
# -*- coding: latin-1 -*-
```

or:

```python
# coding: utf-8
```

The tokenizer must handle this early because it cannot reliably classify source characters until the source is decoded.

At the Python level, `tokenize.tokenize()` begins by returning an `ENCODING` token. The `token` documentation notes that this `ENCODING` token is needed for the Python `tokenize` module and is not used by CPython’s C tokenizer in the same way. ([Python documentation][2])

The practical model is:

```text
read first source lines
detect encoding declaration if present
decode source bytes
normalize line handling
begin lexical scanning
```

This is why tokenization is not just a loop over characters. It also owns the boundary between external source bytes and internal source text.

## 18.3 Physical Lines and Logical Lines

Python source has physical lines and logical lines.

A physical line is one line in the source file.

A logical line is one complete Python statement or expression unit as seen by the parser.

Usually they are the same:

```python
x = 1
y = 2
```

Here, each physical line is also a logical line.

But Python allows explicit line joining with backslash:

```python
x = 1 + \
    2 + \
    3
```

This is one logical line spread across three physical lines.

Python also allows implicit line joining inside parentheses, brackets, and braces:

```python
values = [
    1,
    2,
    3,
]
```

Inside grouping delimiters, newlines do not end the logical statement. The tokenizer tracks nesting depth so it can distinguish significant newlines from non-significant newlines.

Conceptually:

```text
paren_level = 0

when "(" or "[" or "{" appears:
    paren_level += 1

when ")" or "]" or "}" appears:
    paren_level -= 1

when newline appears:
    if paren_level == 0:
        emit NEWLINE
    else:
        ignore as statement terminator
```

This rule is essential for Python’s readable multiline syntax.

## 18.4 `NEWLINE` and `NL`

Python-level tokenization distinguishes between logical-line newlines and non-terminating newlines.

At the public `tokenize` level:

| Token     | Meaning                                  |
| --------- | ---------------------------------------- |
| `NEWLINE` | Ends a logical line                      |
| `NL`      | Newline that does not end a logical line |

Example:

```python
x = (
    1 +
    2
)
y = 3
```

The newlines inside the parentheses are not statement terminators. The parser should not treat `1 +` as a complete statement. Those line breaks exist for layout, not grammar.

In a tool-facing tokenizer, they appear as `NL`. In the parser-facing model, they are ignored or treated differently from real logical newlines.

This distinction matters for formatters and linters. A formatter may care about every physical newline. The parser only needs logical structure.

## 18.5 Indentation as Tokens

Python uses indentation as syntax.

That means the tokenizer must turn leading whitespace into tokens.

Example:

```python
if ready:
    run()
    log()
finish()
```

The parser cannot understand this source using only names and punctuation. It needs explicit block boundaries.

The tokenizer emits:

```text
NAME      "if"
NAME      "ready"
COLON     ":"
NEWLINE   "\n"
INDENT    "    "
NAME      "run"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
NAME      "log"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
DEDENT    ""
NAME      "finish"
LPAR      "("
RPAR      ")"
NEWLINE   "\n"
ENDMARKER ""
```

Indentation creates a virtual block start. Dedentation creates a virtual block end.

CPython maintains an indentation stack. At the start of a logical line, the tokenizer measures leading whitespace and compares it with the current indentation level.

Simplified model:

```text
indent_stack = [0]

at beginning of logical line:
    col = indentation_column()

    if col > indent_stack[-1]:
        push col
        emit INDENT

    else if col == indent_stack[-1]:
        emit no indentation token

    else:
        while col < indent_stack[-1]:
            pop
            emit DEDENT

        if col != indent_stack[-1]:
            report indentation error
```

This stack discipline explains why inconsistent indentation is a lexical error before normal parsing can continue.

## 18.6 Tabs, Spaces, and Indentation Columns

Indentation is measured in columns, not raw characters.

Spaces advance by one column. Tabs advance to the next tab stop. Python’s tab handling exists for compatibility, but mixing tabs and spaces can produce ambiguous indentation and errors.

Example:

```python
if x:
\tprint("tab")
    print("spaces")
```

The visual alignment may depend on editor settings. CPython cannot trust how a human editor displays this. It computes indentation using the language’s tab rules and raises errors when indentation is inconsistent.

The important internal point is that indentation is not stored as “number of leading characters.” It is converted into indentation levels. Those levels are compared against the indentation stack.

## 18.7 Blank Lines and Comment-Only Lines

Blank lines do not usually produce parser-significant tokens.

Example:

```python
x = 1

y = 2
```

The empty line does not terminate a block or create a statement.

Comment-only lines behave similarly for the parser:

```python
x = 1
# comment
y = 2
```

The public `tokenize` module returns comments because tools need them. CPython’s parser does not treat comments as syntax.

This difference is important:

| Consumer           | Needs comments? | Reason                             |
| ------------------ | --------------: | ---------------------------------- |
| Parser             |              No | Comments do not affect grammar     |
| Formatter          |             Yes | Comments must be preserved         |
| Syntax highlighter |             Yes | Comments need styling              |
| Linter             |             Yes | Comments may contain directives    |
| Type checker       |       Sometimes | Comments may contain type comments |

The public tokenizer is a tool API. The C tokenizer is part of the compiler front end.

## 18.8 Names, Keywords, and Soft Keywords

Identifiers are tokenized as names.

Example:

```python
total = price + tax
```

The tokenizer sees:

```text
NAME "total"
EQUAL "="
NAME "price"
PLUS "+"
NAME "tax"
```

Traditional Python keywords include words such as:

```text
def
class
if
else
while
for
try
except
return
yield
import
from
with
lambda
```

At the lexical level, these are name-shaped sequences. The tokenizer or parser can classify them according to grammar needs.

Modern Python also has soft keywords. A soft keyword acts like a keyword only in specific grammar positions.

Examples include words used by pattern matching:

```python
match value:
    case 0:
        pass
```

`match` and `case` can still be used as ordinary names in other contexts where the grammar permits it.

This is one reason tokenization and parsing must cooperate. A tokenizer that permanently converted every occurrence of `match` into a hard keyword token would reject valid code in contexts where `match` is only a name.

The practical rule:

```text
hard keyword: reserved everywhere
soft keyword: special only in selected grammar positions
name: ordinary identifier
```

## 18.9 Unicode Identifiers

Python identifiers may contain many Unicode characters.

Example:

```python
π = 3.14159
面积 = 42
```

The tokenizer must recognize identifier start and continuation characters according to Python’s identifier rules. This gives Python source code broad Unicode support.

But identifiers are still normalized and checked according to language rules. Not every Unicode character is legal in a name, and some visually similar characters can be distinct.

From an internals perspective, identifier handling requires:

```text
Unicode-aware character classification
identifier start validation
identifier continuation validation
normalization rules
error reporting for invalid characters
```

This makes Python tokenization more complex than an ASCII-only language tokenizer.

## 18.10 Numeric Literals

The tokenizer recognizes numeric literals before the parser builds expressions.

Examples:

```python
123
0b1010
0o755
0xff
1_000_000
3.14
10.
.5
1e9
1.2e-3
3j
```

These become number tokens.

The tokenizer must validate lexical form:

```text
base prefixes
digits allowed in each base
underscore placement
decimal points
exponents
imaginary suffix
```

Some invalid numbers fail during tokenization:

```python
0b102
1__2
```

The tokenizer does not evaluate arbitrary arithmetic. It only recognizes the literal token. Later compilation stages convert the token text into the corresponding Python object.

Example:

```python
x = 1 + 2
```

Tokenization sees:

```text
NAME     "x"
EQUAL    "="
NUMBER   "1"
PLUS     "+"
NUMBER   "2"
```

The fact that `1 + 2` can be folded into `3` belongs to later compiler optimization, not tokenization.

## 18.11 String Literals

String tokenization is more complicated than numeric tokenization.

Python supports:

```python
"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
b"bytes"
f"value={x}"
fr"path={name}\n"
```

The tokenizer must identify:

```text
string prefixes
quote style
single-line or triple-quoted form
raw strings
bytes strings
f-strings
escape sequences
line continuation rules
string termination
```

A normal string token is recognized as one lexical unit:

```python
x = "hello"
```

Token stream:

```text
NAME    "x"
EQUAL   "="
STRING  "\"hello\""
```

Triple-quoted strings can span physical lines:

```python
text = """
line 1
line 2
"""
```

The tokenizer must keep scanning until it finds the matching triple quote.

Unterminated strings are tokenizer errors:

```python
x = "missing end
```

The parser cannot recover meaningful grammar from an unterminated string because the tokenizer cannot produce a valid token stream.

## 18.12 F-Strings

F-strings are special because they contain both literal string content and embedded Python expressions.

Example:

```python
name = "Ada"
text = f"hello {name.upper()}"
```

Inside the string, this part is literal text:

```text
hello 
```

This part is Python expression syntax:

```python
name.upper()
```

The tokenizer and parser must cooperate to handle this nested structure.

Conceptually:

```text
enter f-string mode
scan literal characters
when "{" starts expression:
    tokenize embedded Python expression
    parse embedded expression
return to f-string literal scanning
finish at closing quote
```

Nested expression handling makes f-strings much richer than ordinary string literals. They are not just string tokens with later text replacement. They contain syntax that must be parsed into expression nodes.

## 18.13 Operators and Delimiters

Python operators and delimiters include single-character and multi-character forms.

Examples:

```text
+   -   *   /   //   %   **
=   ==  !=  <   <=   >   >=
:=  ->  @   @=
(   )   [   ]   {   }
,   :   .   ;   ...
```

The tokenizer usually applies longest-match behavior.

For example, when reading `**=`, it should produce one power-assignment operator token rather than `*`, `*`, and `=`.

Simplified logic:

```text
if next characters form "**=":
    emit DOUBLESTAR_EQUAL
else if next characters form "**":
    emit DOUBLESTAR
else if next character is "*":
    emit STAR
```

This rule is common in tokenizers. It keeps the parser from having to reconstruct multi-character operators from smaller pieces.

## 18.14 Error Tokens and Lexical Errors

Some errors appear before parsing.

Examples:

```python
x = "unterminated
```

```python
if x:
  a = 1
 b = 2
```

```python
x = 0b123
```

These are lexical or indentation errors.

The tokenizer must report enough information for useful diagnostics:

```text
filename
line number
column offset
source line
error type
error message
```

Common tokenization-stage errors include:

| Error              | Cause                                       |
| ------------------ | ------------------------------------------- |
| `SyntaxError`      | Invalid lexical structure or token sequence |
| `IndentationError` | Invalid indentation level                   |
| `TabError`         | Ambiguous indentation from tabs and spaces  |
| `TokenError`       | Public tokenizer error for incomplete input |

Not every `SyntaxError` originates in tokenization. Many come from parsing. But the tokenizer owns errors that prevent a valid token stream from existing.

## 18.15 End of File and Synthetic Dedents

At end of file, CPython must close any open indentation blocks.

Example:

```python
if x:
    if y:
        run()
```

The source ends while two indentation levels are still active. The tokenizer emits synthetic `DEDENT` tokens before `ENDMARKER`.

Conceptually:

```text
NAME      "if"
NAME      "x"
COLON     ":"
NEWLINE
INDENT
NAME      "if"
NAME      "y"
COLON     ":"
NEWLINE
INDENT
NAME      "run"
LPAR
RPAR
NEWLINE
DEDENT
DEDENT
ENDMARKER
```

This lets the parser see block endings even when there are no explicit closing braces.

The tokenizer therefore creates tokens that have no direct character in the source file. `INDENT`, `DEDENT`, and `ENDMARKER` are structural tokens.

## 18.16 Tokenizer State

A tokenizer is stateful.

It must remember:

```text
current input pointer
current line
current column
current indentation stack
current nesting level
whether scanning begins a line
whether inside a string
whether inside an f-string expression
whether an encoding was detected
whether interactive mode is active
pending INDENT or DEDENT tokens
error state
```

A stateless scanner would be insufficient for Python because meaning depends on layout and context.

Example:

```python
x = [
    1,
    2,
]
```

The newline after `1,` appears inside brackets. It should not become a logical `NEWLINE`.

Example:

```python
if x:
    y = 1
z = 2
```

The leading whitespace before `z` causes a `DEDENT`.

Those decisions require remembered state.

## 18.17 Interactive Tokenization

Interactive input has special cases.

In a REPL, CPython often needs to decide whether input is complete.

Example:

```python
>>> if x:
...
```

This is incomplete because a block body is expected.

Example:

```python
>>> x = (1 +
...
```

This is incomplete because the parenthesized expression remains open.

The tokenizer and parser cooperate to decide whether to request another line or raise an error. Interactive mode therefore differs from file mode. End-of-input in a file means true EOF. End-of-input in the REPL may mean “ask for more text.”

## 18.18 Public `tokenize` Module

The standard library exposes tokenization through `tokenize`.

Example:

```python
from io import BytesIO
import tokenize

src = b"x = 1 + 2\n"

for tok in tokenize.tokenize(BytesIO(src).readline):
    print(tok)
```

Output is shaped like:

```text
TokenInfo(type=ENCODING, string='utf-8', ...)
TokenInfo(type=NAME, string='x', ...)
TokenInfo(type=OP, string='=', ...)
TokenInfo(type=NUMBER, string='1', ...)
TokenInfo(type=OP, string='+', ...)
TokenInfo(type=NUMBER, string='2', ...)
TokenInfo(type=NEWLINE, string='\n', ...)
TokenInfo(type=ENDMARKER, string='', ...)
```

The public tokenizer is useful for:

```text
formatters
linters
code generators
syntax highlighters
refactoring tools
documentation tools
source-to-source transforms
```

The `tokenize` documentation describes it as a lexical scanner for Python source and notes that it returns comments as tokens, which makes it useful for pretty-printers and colorizers. ([Python documentation][1])

## 18.19 C Tokenizer vs Python Tokenizer

There are two related tokenizer concepts in CPython:

| Component         | Location                          | Purpose                             |
| ----------------- | --------------------------------- | ----------------------------------- |
| C tokenizer       | CPython parser/compiler internals | Feed the parser during compilation  |
| `Lib/tokenize.py` | Standard library                  | Expose tokenization to Python tools |

They are not identical interfaces.

The C tokenizer is optimized for CPython’s compiler pipeline. It produces what the parser needs.

The Python tokenizer is a public tool interface. It preserves comments, exposes encoding, returns rich `TokenInfo` objects, and is designed for external consumers.

This distinction explains why token streams from `tokenize` may contain information the parser ignores.

## 18.20 Tokenization Example in Detail

Consider this source:

```python
def area(r):
    pi = 3.14159
    return pi * r * r
```

A simplified token stream:

```text
NAME       "def"
NAME       "area"
LPAR       "("
NAME       "r"
RPAR       ")"
COLON      ":"
NEWLINE    "\n"
INDENT     "    "
NAME       "pi"
EQUAL      "="
NUMBER     "3.14159"
NEWLINE    "\n"
NAME       "return"
NAME       "pi"
STAR       "*"
NAME       "r"
STAR       "*"
NAME       "r"
NEWLINE    "\n"
DEDENT     ""
ENDMARKER  ""
```

Important points:

1. `def` is lexically name-shaped but grammatically acts as a keyword.
2. The function body begins because indentation increases after `NEWLINE`.
3. `3.14159` is a single number token.
4. `return pi * r * r` is one logical line.
5. The function body ends through a synthetic `DEDENT`.
6. The file ends through `ENDMARKER`.

The parser receives this stream and matches it against grammar rules for function definitions, suites, assignments, return statements, and expressions.

## 18.21 Tokenization Does Not Understand Full Semantics

The tokenizer does not know that this name is undefined:

```python
print(missing_name)
```

It does not know that this call will fail:

```python
1()
```

It does not know whether this import exists:

```python
import does_not_exist
```

It only emits tokens.

Semantic checks happen later, often at runtime.

Tokenization is intentionally shallow. It recognizes lexical form, not program meaning.

## 18.22 Why Tokenization Matters

Tokenization seems small, but it shapes the whole language.

It defines:

```text
how indentation becomes syntax
how source bytes become characters
how comments are ignored or preserved
how strings are delimited
how f-strings embed expressions
how operators are recognized
how logical lines are formed
how parser input is structured
```

For CPython contributors, tokenizer bugs can affect syntax, diagnostics, tools, compatibility, and security. A small lexical change can alter how every Python file is parsed.

For tooling authors, tokenization is often the best layer to work at. It preserves source-level information that the AST discards, including comments, exact spacing, physical lines, and operator spelling.

## 18.23 Minimal Mental Model

Use this model:

```text
The tokenizer reads decoded Python source.
It emits lexical tokens.
It tracks indentation, nesting, strings, and line boundaries.
It inserts structural tokens such as INDENT, DEDENT, and ENDMARKER.
It reports lexical errors before parsing.
The parser consumes tokens and builds syntax structure.
```

That is the bridge from raw source text to grammar.

[1]: https://docs.python.org/3/library/tokenize.html?utm_source=chatgpt.com "tokenize — Tokenizer for Python source"
[2]: https://docs.python.org/3/library/token.html?utm_source=chatgpt.com "token — Constants used with Python parse trees"

