# 72. Interpreter Dispatch

# 72. Interpreter Dispatch

Interpreter dispatch is the mechanism that drives bytecode execution inside CPython.

The compiler transforms Python source code into bytecode instructions. The interpreter loop then repeatedly fetches, decodes, and executes those instructions. Dispatch is the process that selects the implementation for each opcode and transfers control to it.

This chapter focuses on the execution engine inside CPython:

```text
Python source
    ↓
compiler
    ↓
bytecode
    ↓
evaluation loop
    ↓
opcode dispatch
    ↓
object operations
```

The dispatch mechanism dominates interpreter performance. Even small changes to dispatch logic can affect the speed of nearly every Python program.

## 72.1 The Evaluation Loop

The core interpreter loop lives in:

```text
Python/ceval.c
```

Historically, the main entry point has been:

```c
PyEval_EvalFrameDefault()
```

This function executes one Python frame.

A simplified conceptual model looks like:

```c
for (;;) {
    opcode = fetch_next_opcode();
    switch (opcode) {
        case LOAD_FAST:
            ...
            break;

        case BINARY_OP:
            ...
            break;

        case RETURN_VALUE:
            ...
            return result;
    }
}
```

The real implementation is substantially more complex:

```text
instruction decoding
specialized opcodes
inline caches
stack manipulation
exception propagation
reference counting
signal handling
tracing hooks
adaptive optimization
computed goto dispatch
```

Despite this complexity, the structure remains fundamentally iterative:

```text
fetch
decode
dispatch
execute
repeat
```

## 72.2 Frames as Execution Contexts

The interpreter executes bytecode inside a frame.

A frame contains the execution state for one active function call.

Conceptually:

```text
code object
instruction pointer
evaluation stack
local variables
globals
builtins
exception state
closure references
```

The interpreter loop repeatedly updates this frame state.

Each function call creates a new execution frame:

```python
def add(a, b):
    return a + b

add(1, 2)
```

At runtime:

```text
create frame
initialize locals
execute bytecode
return result
destroy frame
```

The frame acts as the working memory of the virtual machine.

## 72.3 Bytecode Instruction Format

CPython bytecode consists of instruction bytes stored in a code object.

Modern CPython uses a wordcode-style format where instructions are organized into fixed-size units.

Conceptually:

```text
opcode
operand
opcode
operand
opcode
operand
```

Instructions are stored in:

```python
f.__code__.co_code
```

Example:

```python
def f(x):
    return x + 1
```

Disassembly:

```python
import dis
dis.dis(f)
```

Possible output:

```text
LOAD_FAST      0 (x)
LOAD_CONST     1 (1)
BINARY_OP      0 (+)
RETURN_VALUE
```

The interpreter processes these instructions sequentially unless control flow changes.

## 72.4 Instruction Fetch

The dispatch loop begins by fetching the next opcode.

Conceptually:

```c
opcode = *next_instr++;
```

The interpreter maintains an instruction pointer into the bytecode stream.

Historically this was byte-oriented. Modern CPython stores instructions in a more structured format for decoding efficiency and inline cache integration.

The fetch phase must be extremely cheap because it executes for every instruction in every Python program.

Minor inefficiencies multiply across billions of executed opcodes.

## 72.5 Decode Phase

After fetching an opcode, the interpreter decodes its meaning.

Example:

```text
LOAD_FAST
```

means:

```text
load local variable onto evaluation stack
```

Example:

```text
BINARY_OP
```

means:

```text
pop two values
perform operation
push result
```

Some instructions include operands:

```text
LOAD_CONST 3
```

The operand identifies an entry in the constants table.

The decode stage interprets both opcode and operand fields.

## 72.6 Stack-Based Execution Model

CPython uses a stack machine.

Most instructions consume values from the stack and push results back.

Example:

```python
x + y
```

Bytecode:

```text
LOAD_FAST x
LOAD_FAST y
BINARY_OP +
```

Execution:

```text
push x
push y
pop y
pop x
compute x + y
push result
```

Stack machines simplify compiler generation because intermediate values naturally flow through the stack.

The tradeoff is higher instruction count compared to register machines.

## 72.7 Dispatch Mechanisms

Several dispatch strategies exist in virtual machines.

### Switch Dispatch

The simplest form:

```c
switch (opcode) {
    case LOAD_FAST:
        ...
        break;
}
```

Advantages:

```text
portable
simple
easy to debug
```

Disadvantages:

```text
branch-heavy
poor branch prediction
higher dispatch overhead
```

### Computed Goto Dispatch

Modern CPython primarily uses computed gotos on supported compilers.

Conceptually:

```c
goto *opcode_targets[opcode];
```

Each opcode jumps directly to its implementation label.

Advantages:

```text
fewer branch mispredictions
better CPU pipeline behavior
lower dispatch overhead
```

This significantly improves interpreter throughput.

The technique depends on compiler extensions such as GCC labels-as-values.

CPython falls back to switch dispatch on unsupported compilers.

## 72.8 Why Dispatch Overhead Matters

Interpreter dispatch overhead is large relative to useful work.

Example:

```python
x = a + b
```

At machine level this might become a few native instructions in compiled languages.

In CPython it involves:

```text
opcode fetch
opcode decode
stack operations
reference count updates
dynamic type checks
slot lookup
possible method dispatch
overflow handling
result allocation
```

Even before actual arithmetic occurs, the interpreter performs substantial runtime work.

This is why interpreter optimization focuses heavily on reducing dispatch cost.

## 72.9 Opcode Prediction

Older CPython versions used opcode prediction techniques.

Certain opcode pairs occur frequently:

```text
LOAD_FAST
LOAD_FAST
BINARY_OP
```

or:

```text
LOAD_FAST
RETURN_VALUE
```

The interpreter could speculate about likely next instructions and jump directly to them.

This reduced dispatch overhead slightly.

Modern adaptive specialization mechanisms reduced the importance of manual prediction schemes.

## 72.10 Computed Goto and CPU Pipelines

Modern CPUs rely heavily on branch prediction and instruction pipelines.

A naive switch dispatch creates unpredictable branches:

```text
switch(opcode)
```

The CPU may frequently mispredict the next branch target.

Mispredictions flush pipelines and waste cycles.

Computed goto dispatch improves locality:

```text
opcode directly selects target address
```

This reduces indirect branching overhead and improves branch predictor performance.

Interpreter engineering increasingly depends on CPU microarchitecture awareness.

## 72.11 Inline Caches

Modern CPython inserts inline cache entries into bytecode streams.

Certain operations repeatedly observe similar runtime types:

```python
obj.x
```

Often:

```text
obj has same type
attribute layout stable
lookup path unchanged
```

Instead of performing full dynamic lookup every time, the interpreter caches previous lookup information.

Dispatch then becomes:

```text
check cache validity
fast path if valid
fallback if invalid
```

This dramatically improves common operations.

## 72.12 Adaptive Specialization

CPython 3.11 introduced a specializing adaptive interpreter.

Generic opcodes transform into specialized versions after observing runtime behavior.

Example:

```text
BINARY_OP
```

may specialize into:

```text
BINARY_OP_ADD_INT
```

or equivalent internal optimized forms.

This allows dispatch to skip generic dynamic logic.

Instead of:

```text
check types
select operation
dispatch arithmetic
```

the specialized opcode already assumes:

```text
both operands are integers
```

Specialized dispatch reduces runtime overhead substantially.

## 72.13 Superinstructions

Some interpreters combine multiple common operations into fused instructions.

Example:

```text
LOAD_FAST
LOAD_FAST
```

might become:

```text
LOAD_FAST_LOAD_FAST
```

Advantages:

```text
fewer dispatches
better instruction locality
reduced interpreter overhead
```

Modern CPython includes several fused instructions.

These are interpreter-level analogues of instruction fusion in CPUs.

## 72.14 Opcode Handlers

Each opcode has a handler implementation.

Example conceptual handler:

```c
TARGET(LOAD_FAST) {
    PyObject *value = GETLOCAL(oparg);

    if (value == NULL) {
        error...
    }

    Py_INCREF(value);
    PUSH(value);

    DISPATCH();
}
```

Important operations:

```text
load local variable
increment reference count
push onto stack
continue dispatch
```

Even simple handlers must carefully maintain interpreter invariants.

## 72.15 The Evaluation Stack

The frame contains a value stack.

Conceptually:

```text
bottom
  x
  y
  z
top
```

Instructions manipulate this stack directly.

Example:

```text
LOAD_CONST 1
LOAD_CONST 2
BINARY_OP +
```

Execution:

```text
push 1
push 2
pop 2
pop 1
compute result
push 3
```

The stack pointer moves constantly during execution.

Efficient stack access is critical for interpreter performance.

## 72.16 Error Handling During Dispatch

Opcode handlers can fail.

Example:

```python
1 / 0
```

The division opcode detects division by zero and raises an exception.

Interpreter flow changes:

```text
normal execution
    ↓
exception raised
    ↓
unwind stack
    ↓
find exception handler
    ↓
resume or terminate
```

The dispatch loop therefore integrates tightly with exception machinery.

## 72.17 Reference Counting Inside Dispatch

Almost every opcode manipulates references.

Example:

```python
x = y
```

requires:

```text
increment new reference
decrement overwritten reference
```

Arithmetic operations may allocate new objects:

```python
x + y
```

Handlers must correctly maintain ownership.

Reference counting overhead is deeply intertwined with interpreter dispatch cost.

## 72.18 Fast Locals

Local variables are optimized heavily.

Instead of dictionary lookups, frames store locals in arrays:

```text
localsplus[0]
localsplus[1]
localsplus[2]
```

`LOAD_FAST` becomes array indexing instead of hash lookup.

This explains why local variables are faster than globals.

Global access:

```text
dictionary lookup
hash computation
namespace resolution
```

Local access:

```text
array access
```

Dispatch efficiency depends heavily on such layout optimizations.

## 72.19 Instruction Pointer Management

The interpreter tracks the current instruction position.

Control flow instructions modify it:

```text
JUMP_FORWARD
POP_JUMP_IF_FALSE
FOR_ITER
RETURN_VALUE
```

Loops work by rewinding the instruction pointer.

Example:

```python
while x:
    ...
```

Conceptually:

```text
evaluate condition
jump if false
execute body
jump backward
```

Dispatch must therefore support arbitrary control flow changes.

## 72.20 Dispatch and the GIL

Traditional CPython executes bytecode under the Global Interpreter Lock.

Only one thread executes Python bytecode at a time inside one interpreter.

This simplifies dispatch logic because opcode handlers can often assume internal runtime structures remain stable.

Without the GIL:

```text
reference counts require synchronization
object layouts require synchronization
many runtime invariants become concurrent
```

Free-threaded Python work significantly complicates dispatch implementation.

## 72.21 Tracing and Profiling Hooks

Debuggers and profilers integrate into the dispatch loop.

Features include:

```text
line tracing
opcode tracing
profiling callbacks
coverage measurement
debug breakpoints
```

These hooks add conditional checks into execution paths.

Optimized fast paths attempt to minimize overhead when tracing is disabled.

## 72.22 Signal Handling

The interpreter periodically checks for signals:

```text
KeyboardInterrupt
termination signals
pending calls
async events
```

This usually occurs between opcode executions.

The dispatch loop therefore serves as a scheduling checkpoint for runtime events.

## 72.23 Dispatch and Cache Locality

Interpreter performance depends heavily on memory locality.

Important structures:

```text
opcode handlers
bytecode stream
evaluation stack
frame data
object headers
type objects
inline caches
```

Poor locality increases cache misses.

Modern interpreter optimization increasingly resembles systems-level CPU-aware engineering.

## 72.24 Why CPython Uses a Stack Machine

Stack machines have advantages:

```text
simple compiler
compact bytecode
easy portability
simple operand encoding
```

Disadvantages:

```text
more instructions
more stack traffic
higher dispatch frequency
```

Register-based VMs often reduce instruction count but complicate compilation and operand encoding.

CPython historically prioritized simplicity and portability.

## 72.25 Dispatch Costs vs Native Execution

Native compiled code executes directly on hardware instructions.

CPython execution adds layers:

```text
bytecode fetch
bytecode decode
dispatch branch
runtime checks
dynamic typing
object allocation
reference counting
```

This explains much of Python’s performance profile.

The interpreter is highly dynamic and flexible, but every layer has cost.

## 72.26 The Adaptive Interpreter Architecture

Modern CPython increasingly behaves like a lightweight runtime optimizer.

Execution flow:

```text
start generic
observe runtime behavior
specialize instructions
cache lookup data
deoptimize if assumptions fail
```

This moves CPython closer to modern VM designs while preserving compatibility and simplicity.

The dispatch system is no longer merely a bytecode switch loop.

It is an adaptive runtime execution engine.

## 72.27 Relationship to JIT Compilation

A JIT compiler may eliminate much interpreter dispatch entirely.

Instead of:

```text
fetch opcode
dispatch opcode
execute opcode
```

the JIT generates native machine code.

CPython historically emphasized interpreter optimization rather than aggressive JIT compilation.

Recent work explores optional JIT systems layered atop adaptive specialization infrastructure.

## 72.28 Reading ceval.c

When reading `Python/ceval.c`, focus on:

| Area | Purpose |
|---|---|
| Frame evaluation | Main execution engine |
| Opcode handlers | Bytecode implementations |
| Stack macros | Value stack operations |
| Dispatch macros | Control transfer |
| Inline caches | Adaptive optimization |
| Error handling | Exception propagation |
| Fast paths | Performance-critical cases |

The file is performance-sensitive and macro-heavy.

Many constructs prioritize execution speed over readability.

## 72.29 A Mental Model

A useful mental model:

```text
CPython is a dynamic stack machine.
```

The interpreter:

```text
reads bytecode
moves PyObject pointers on a stack
dispatches opcode handlers
updates reference counts
maintains frames
handles exceptions
repeats continuously
```

Every Python program ultimately becomes this process.

## 72.30 Chapter Summary

Interpreter dispatch is the execution core of CPython.

The evaluation loop fetches bytecode instructions, decodes them, dispatches opcode handlers, manipulates Python objects, updates frame state, and continues until execution completes.

Modern CPython uses several important optimization techniques:

```text
computed goto dispatch
inline caches
adaptive specialization
superinstructions
fast locals
specialized opcode handlers
```

These mechanisms reduce dispatch overhead while preserving Python’s dynamic semantics.

Understanding interpreter dispatch is essential for understanding CPython performance, runtime behavior, bytecode execution, and the architecture of the virtual machine itself.
