Skip to content

72. Interpreter Dispatch

Computed goto dispatch table, the switch fallback, and how opcode prediction reduces branch mispredictions.

Interpreter dispatch is the mechanism that drives bytecode execution inside CPython.

The compiler transforms Python source code into bytecode instructions. The interpreter loop then repeatedly fetches, decodes, and executes those instructions. Dispatch is the process that selects the implementation for each opcode and transfers control to it.

This chapter focuses on the execution engine inside CPython:

Python source
compiler
bytecode
evaluation loop
opcode dispatch
object operations

The dispatch mechanism dominates interpreter performance. Even small changes to dispatch logic can affect the speed of nearly every Python program.

72.1 The Evaluation Loop

The core interpreter loop lives in:

Python/ceval.c

Historically, the main entry point has been:

PyEval_EvalFrameDefault()

This function executes one Python frame.

A simplified conceptual model looks like:

for (;;) {
    opcode = fetch_next_opcode();
    switch (opcode) {
        case LOAD_FAST:
            ...
            break;

        case BINARY_OP:
            ...
            break;

        case RETURN_VALUE:
            ...
            return result;
    }
}

The real implementation is substantially more complex:

instruction decoding
specialized opcodes
inline caches
stack manipulation
exception propagation
reference counting
signal handling
tracing hooks
adaptive optimization
computed goto dispatch

Despite this complexity, the structure remains fundamentally iterative:

fetch
decode
dispatch
execute
repeat

72.2 Frames as Execution Contexts

The interpreter executes bytecode inside a frame.

A frame contains the execution state for one active function call.

Conceptually:

code object
instruction pointer
evaluation stack
local variables
globals
builtins
exception state
closure references

The interpreter loop repeatedly updates this frame state.

Each function call creates a new execution frame:

def add(a, b):
    return a + b

add(1, 2)

At runtime:

create frame
initialize locals
execute bytecode
return result
destroy frame

The frame acts as the working memory of the virtual machine.

72.3 Bytecode Instruction Format

CPython bytecode consists of instruction bytes stored in a code object.

Modern CPython uses a wordcode-style format where instructions are organized into fixed-size units.

Conceptually:

opcode
operand
opcode
operand
opcode
operand

Instructions are stored in:

f.__code__.co_code

Example:

def f(x):
    return x + 1

Disassembly:

import dis
dis.dis(f)

Possible output:

LOAD_FAST      0 (x)
LOAD_CONST     1 (1)
BINARY_OP      0 (+)
RETURN_VALUE

The interpreter processes these instructions sequentially unless control flow changes.

72.4 Instruction Fetch

The dispatch loop begins by fetching the next opcode.

Conceptually:

opcode = *next_instr++;

The interpreter maintains an instruction pointer into the bytecode stream.

Historically this was byte-oriented. Modern CPython stores instructions in a more structured format for decoding efficiency and inline cache integration.

The fetch phase must be extremely cheap because it executes for every instruction in every Python program.

Minor inefficiencies multiply across billions of executed opcodes.

72.5 Decode Phase

After fetching an opcode, the interpreter decodes its meaning.

Example:

LOAD_FAST

means:

load local variable onto evaluation stack

Example:

BINARY_OP

means:

pop two values
perform operation
push result

Some instructions include operands:

LOAD_CONST 3

The operand identifies an entry in the constants table.

The decode stage interprets both opcode and operand fields.

72.6 Stack-Based Execution Model

CPython uses a stack machine.

Most instructions consume values from the stack and push results back.

Example:

x + y

Bytecode:

LOAD_FAST x
LOAD_FAST y
BINARY_OP +

Execution:

push x
push y
pop y
pop x
compute x + y
push result

Stack machines simplify compiler generation because intermediate values naturally flow through the stack.

The tradeoff is higher instruction count compared to register machines.

72.7 Dispatch Mechanisms

Several dispatch strategies exist in virtual machines.

Switch Dispatch

The simplest form:

switch (opcode) {
    case LOAD_FAST:
        ...
        break;
}

Advantages:

portable
simple
easy to debug

Disadvantages:

branch-heavy
poor branch prediction
higher dispatch overhead

Computed Goto Dispatch

Modern CPython primarily uses computed gotos on supported compilers.

Conceptually:

goto *opcode_targets[opcode];

Each opcode jumps directly to its implementation label.

Advantages:

fewer branch mispredictions
better CPU pipeline behavior
lower dispatch overhead

This significantly improves interpreter throughput.

The technique depends on compiler extensions such as GCC labels-as-values.

CPython falls back to switch dispatch on unsupported compilers.

72.8 Why Dispatch Overhead Matters

Interpreter dispatch overhead is large relative to useful work.

Example:

x = a + b

At machine level this might become a few native instructions in compiled languages.

In CPython it involves:

opcode fetch
opcode decode
stack operations
reference count updates
dynamic type checks
slot lookup
possible method dispatch
overflow handling
result allocation

Even before actual arithmetic occurs, the interpreter performs substantial runtime work.

This is why interpreter optimization focuses heavily on reducing dispatch cost.

72.9 Opcode Prediction

Older CPython versions used opcode prediction techniques.

Certain opcode pairs occur frequently:

LOAD_FAST
LOAD_FAST
BINARY_OP

or:

LOAD_FAST
RETURN_VALUE

The interpreter could speculate about likely next instructions and jump directly to them.

This reduced dispatch overhead slightly.

Modern adaptive specialization mechanisms reduced the importance of manual prediction schemes.

72.10 Computed Goto and CPU Pipelines

Modern CPUs rely heavily on branch prediction and instruction pipelines.

A naive switch dispatch creates unpredictable branches:

switch(opcode)

The CPU may frequently mispredict the next branch target.

Mispredictions flush pipelines and waste cycles.

Computed goto dispatch improves locality:

opcode directly selects target address

This reduces indirect branching overhead and improves branch predictor performance.

Interpreter engineering increasingly depends on CPU microarchitecture awareness.

72.11 Inline Caches

Modern CPython inserts inline cache entries into bytecode streams.

Certain operations repeatedly observe similar runtime types:

obj.x

Often:

obj has same type
attribute layout stable
lookup path unchanged

Instead of performing full dynamic lookup every time, the interpreter caches previous lookup information.

Dispatch then becomes:

check cache validity
fast path if valid
fallback if invalid

This dramatically improves common operations.

72.12 Adaptive Specialization

CPython 3.11 introduced a specializing adaptive interpreter.

Generic opcodes transform into specialized versions after observing runtime behavior.

Example:

BINARY_OP

may specialize into:

BINARY_OP_ADD_INT

or equivalent internal optimized forms.

This allows dispatch to skip generic dynamic logic.

Instead of:

check types
select operation
dispatch arithmetic

the specialized opcode already assumes:

both operands are integers

Specialized dispatch reduces runtime overhead substantially.

72.13 Superinstructions

Some interpreters combine multiple common operations into fused instructions.

Example:

LOAD_FAST
LOAD_FAST

might become:

LOAD_FAST_LOAD_FAST

Advantages:

fewer dispatches
better instruction locality
reduced interpreter overhead

Modern CPython includes several fused instructions.

These are interpreter-level analogues of instruction fusion in CPUs.

72.14 Opcode Handlers

Each opcode has a handler implementation.

Example conceptual handler:

TARGET(LOAD_FAST) {
    PyObject *value = GETLOCAL(oparg);

    if (value == NULL) {
        error...
    }

    Py_INCREF(value);
    PUSH(value);

    DISPATCH();
}

Important operations:

load local variable
increment reference count
push onto stack
continue dispatch

Even simple handlers must carefully maintain interpreter invariants.

72.15 The Evaluation Stack

The frame contains a value stack.

Conceptually:

bottom
  x
  y
  z
top

Instructions manipulate this stack directly.

Example:

LOAD_CONST 1
LOAD_CONST 2
BINARY_OP +

Execution:

push 1
push 2
pop 2
pop 1
compute result
push 3

The stack pointer moves constantly during execution.

Efficient stack access is critical for interpreter performance.

72.16 Error Handling During Dispatch

Opcode handlers can fail.

Example:

1 / 0

The division opcode detects division by zero and raises an exception.

Interpreter flow changes:

normal execution
exception raised
unwind stack
find exception handler
resume or terminate

The dispatch loop therefore integrates tightly with exception machinery.

72.17 Reference Counting Inside Dispatch

Almost every opcode manipulates references.

Example:

x = y

requires:

increment new reference
decrement overwritten reference

Arithmetic operations may allocate new objects:

x + y

Handlers must correctly maintain ownership.

Reference counting overhead is deeply intertwined with interpreter dispatch cost.

72.18 Fast Locals

Local variables are optimized heavily.

Instead of dictionary lookups, frames store locals in arrays:

localsplus[0]
localsplus[1]
localsplus[2]

LOAD_FAST becomes array indexing instead of hash lookup.

This explains why local variables are faster than globals.

Global access:

dictionary lookup
hash computation
namespace resolution

Local access:

array access

Dispatch efficiency depends heavily on such layout optimizations.

72.19 Instruction Pointer Management

The interpreter tracks the current instruction position.

Control flow instructions modify it:

JUMP_FORWARD
POP_JUMP_IF_FALSE
FOR_ITER
RETURN_VALUE

Loops work by rewinding the instruction pointer.

Example:

while x:
    ...

Conceptually:

evaluate condition
jump if false
execute body
jump backward

Dispatch must therefore support arbitrary control flow changes.

72.20 Dispatch and the GIL

Traditional CPython executes bytecode under the Global Interpreter Lock.

Only one thread executes Python bytecode at a time inside one interpreter.

This simplifies dispatch logic because opcode handlers can often assume internal runtime structures remain stable.

Without the GIL:

reference counts require synchronization
object layouts require synchronization
many runtime invariants become concurrent

Free-threaded Python work significantly complicates dispatch implementation.

72.21 Tracing and Profiling Hooks

Debuggers and profilers integrate into the dispatch loop.

Features include:

line tracing
opcode tracing
profiling callbacks
coverage measurement
debug breakpoints

These hooks add conditional checks into execution paths.

Optimized fast paths attempt to minimize overhead when tracing is disabled.

72.22 Signal Handling

The interpreter periodically checks for signals:

KeyboardInterrupt
termination signals
pending calls
async events

This usually occurs between opcode executions.

The dispatch loop therefore serves as a scheduling checkpoint for runtime events.

72.23 Dispatch and Cache Locality

Interpreter performance depends heavily on memory locality.

Important structures:

opcode handlers
bytecode stream
evaluation stack
frame data
object headers
type objects
inline caches

Poor locality increases cache misses.

Modern interpreter optimization increasingly resembles systems-level CPU-aware engineering.

72.24 Why CPython Uses a Stack Machine

Stack machines have advantages:

simple compiler
compact bytecode
easy portability
simple operand encoding

Disadvantages:

more instructions
more stack traffic
higher dispatch frequency

Register-based VMs often reduce instruction count but complicate compilation and operand encoding.

CPython historically prioritized simplicity and portability.

72.25 Dispatch Costs vs Native Execution

Native compiled code executes directly on hardware instructions.

CPython execution adds layers:

bytecode fetch
bytecode decode
dispatch branch
runtime checks
dynamic typing
object allocation
reference counting

This explains much of Python’s performance profile.

The interpreter is highly dynamic and flexible, but every layer has cost.

72.26 The Adaptive Interpreter Architecture

Modern CPython increasingly behaves like a lightweight runtime optimizer.

Execution flow:

start generic
observe runtime behavior
specialize instructions
cache lookup data
deoptimize if assumptions fail

This moves CPython closer to modern VM designs while preserving compatibility and simplicity.

The dispatch system is no longer merely a bytecode switch loop.

It is an adaptive runtime execution engine.

72.27 Relationship to JIT Compilation

A JIT compiler may eliminate much interpreter dispatch entirely.

Instead of:

fetch opcode
dispatch opcode
execute opcode

the JIT generates native machine code.

CPython historically emphasized interpreter optimization rather than aggressive JIT compilation.

Recent work explores optional JIT systems layered atop adaptive specialization infrastructure.

72.28 Reading ceval.c

When reading Python/ceval.c, focus on:

AreaPurpose
Frame evaluationMain execution engine
Opcode handlersBytecode implementations
Stack macrosValue stack operations
Dispatch macrosControl transfer
Inline cachesAdaptive optimization
Error handlingException propagation
Fast pathsPerformance-critical cases

The file is performance-sensitive and macro-heavy.

Many constructs prioritize execution speed over readability.

72.29 A Mental Model

A useful mental model:

CPython is a dynamic stack machine.

The interpreter:

reads bytecode
moves PyObject pointers on a stack
dispatches opcode handlers
updates reference counts
maintains frames
handles exceptions
repeats continuously

Every Python program ultimately becomes this process.

72.30 Chapter Summary

Interpreter dispatch is the execution core of CPython.

The evaluation loop fetches bytecode instructions, decodes them, dispatches opcode handlers, manipulates Python objects, updates frame state, and continues until execution completes.

Modern CPython uses several important optimization techniques:

computed goto dispatch
inline caches
adaptive specialization
superinstructions
fast locals
specialized opcode handlers

These mechanisms reduce dispatch overhead while preserving Python’s dynamic semantics.

Understanding interpreter dispatch is essential for understanding CPython performance, runtime behavior, bytecode execution, and the architecture of the virtual machine itself.