Computed goto dispatch table, the switch fallback, and how opcode prediction reduces branch mispredictions.
Interpreter dispatch is the mechanism that drives bytecode execution inside CPython.
The compiler transforms Python source code into bytecode instructions. The interpreter loop then repeatedly fetches, decodes, and executes those instructions. Dispatch is the process that selects the implementation for each opcode and transfers control to it.
This chapter focuses on the execution engine inside CPython:
Python source
↓
compiler
↓
bytecode
↓
evaluation loop
↓
opcode dispatch
↓
object operationsThe dispatch mechanism dominates interpreter performance. Even small changes to dispatch logic can affect the speed of nearly every Python program.
72.1 The Evaluation Loop
The core interpreter loop lives in:
Python/ceval.cHistorically, the main entry point has been:
PyEval_EvalFrameDefault()This function executes one Python frame.
A simplified conceptual model looks like:
for (;;) {
opcode = fetch_next_opcode();
switch (opcode) {
case LOAD_FAST:
...
break;
case BINARY_OP:
...
break;
case RETURN_VALUE:
...
return result;
}
}The real implementation is substantially more complex:
instruction decoding
specialized opcodes
inline caches
stack manipulation
exception propagation
reference counting
signal handling
tracing hooks
adaptive optimization
computed goto dispatchDespite this complexity, the structure remains fundamentally iterative:
fetch
decode
dispatch
execute
repeat72.2 Frames as Execution Contexts
The interpreter executes bytecode inside a frame.
A frame contains the execution state for one active function call.
Conceptually:
code object
instruction pointer
evaluation stack
local variables
globals
builtins
exception state
closure referencesThe interpreter loop repeatedly updates this frame state.
Each function call creates a new execution frame:
def add(a, b):
return a + b
add(1, 2)At runtime:
create frame
initialize locals
execute bytecode
return result
destroy frameThe frame acts as the working memory of the virtual machine.
72.3 Bytecode Instruction Format
CPython bytecode consists of instruction bytes stored in a code object.
Modern CPython uses a wordcode-style format where instructions are organized into fixed-size units.
Conceptually:
opcode
operand
opcode
operand
opcode
operandInstructions are stored in:
f.__code__.co_codeExample:
def f(x):
return x + 1Disassembly:
import dis
dis.dis(f)Possible output:
LOAD_FAST 0 (x)
LOAD_CONST 1 (1)
BINARY_OP 0 (+)
RETURN_VALUEThe interpreter processes these instructions sequentially unless control flow changes.
72.4 Instruction Fetch
The dispatch loop begins by fetching the next opcode.
Conceptually:
opcode = *next_instr++;The interpreter maintains an instruction pointer into the bytecode stream.
Historically this was byte-oriented. Modern CPython stores instructions in a more structured format for decoding efficiency and inline cache integration.
The fetch phase must be extremely cheap because it executes for every instruction in every Python program.
Minor inefficiencies multiply across billions of executed opcodes.
72.5 Decode Phase
After fetching an opcode, the interpreter decodes its meaning.
Example:
LOAD_FASTmeans:
load local variable onto evaluation stackExample:
BINARY_OPmeans:
pop two values
perform operation
push resultSome instructions include operands:
LOAD_CONST 3The operand identifies an entry in the constants table.
The decode stage interprets both opcode and operand fields.
72.6 Stack-Based Execution Model
CPython uses a stack machine.
Most instructions consume values from the stack and push results back.
Example:
x + yBytecode:
LOAD_FAST x
LOAD_FAST y
BINARY_OP +Execution:
push x
push y
pop y
pop x
compute x + y
push resultStack machines simplify compiler generation because intermediate values naturally flow through the stack.
The tradeoff is higher instruction count compared to register machines.
72.7 Dispatch Mechanisms
Several dispatch strategies exist in virtual machines.
Switch Dispatch
The simplest form:
switch (opcode) {
case LOAD_FAST:
...
break;
}Advantages:
portable
simple
easy to debugDisadvantages:
branch-heavy
poor branch prediction
higher dispatch overheadComputed Goto Dispatch
Modern CPython primarily uses computed gotos on supported compilers.
Conceptually:
goto *opcode_targets[opcode];Each opcode jumps directly to its implementation label.
Advantages:
fewer branch mispredictions
better CPU pipeline behavior
lower dispatch overheadThis significantly improves interpreter throughput.
The technique depends on compiler extensions such as GCC labels-as-values.
CPython falls back to switch dispatch on unsupported compilers.
72.8 Why Dispatch Overhead Matters
Interpreter dispatch overhead is large relative to useful work.
Example:
x = a + bAt machine level this might become a few native instructions in compiled languages.
In CPython it involves:
opcode fetch
opcode decode
stack operations
reference count updates
dynamic type checks
slot lookup
possible method dispatch
overflow handling
result allocationEven before actual arithmetic occurs, the interpreter performs substantial runtime work.
This is why interpreter optimization focuses heavily on reducing dispatch cost.
72.9 Opcode Prediction
Older CPython versions used opcode prediction techniques.
Certain opcode pairs occur frequently:
LOAD_FAST
LOAD_FAST
BINARY_OPor:
LOAD_FAST
RETURN_VALUEThe interpreter could speculate about likely next instructions and jump directly to them.
This reduced dispatch overhead slightly.
Modern adaptive specialization mechanisms reduced the importance of manual prediction schemes.
72.10 Computed Goto and CPU Pipelines
Modern CPUs rely heavily on branch prediction and instruction pipelines.
A naive switch dispatch creates unpredictable branches:
switch(opcode)The CPU may frequently mispredict the next branch target.
Mispredictions flush pipelines and waste cycles.
Computed goto dispatch improves locality:
opcode directly selects target addressThis reduces indirect branching overhead and improves branch predictor performance.
Interpreter engineering increasingly depends on CPU microarchitecture awareness.
72.11 Inline Caches
Modern CPython inserts inline cache entries into bytecode streams.
Certain operations repeatedly observe similar runtime types:
obj.xOften:
obj has same type
attribute layout stable
lookup path unchangedInstead of performing full dynamic lookup every time, the interpreter caches previous lookup information.
Dispatch then becomes:
check cache validity
fast path if valid
fallback if invalidThis dramatically improves common operations.
72.12 Adaptive Specialization
CPython 3.11 introduced a specializing adaptive interpreter.
Generic opcodes transform into specialized versions after observing runtime behavior.
Example:
BINARY_OPmay specialize into:
BINARY_OP_ADD_INTor equivalent internal optimized forms.
This allows dispatch to skip generic dynamic logic.
Instead of:
check types
select operation
dispatch arithmeticthe specialized opcode already assumes:
both operands are integersSpecialized dispatch reduces runtime overhead substantially.
72.13 Superinstructions
Some interpreters combine multiple common operations into fused instructions.
Example:
LOAD_FAST
LOAD_FASTmight become:
LOAD_FAST_LOAD_FASTAdvantages:
fewer dispatches
better instruction locality
reduced interpreter overheadModern CPython includes several fused instructions.
These are interpreter-level analogues of instruction fusion in CPUs.
72.14 Opcode Handlers
Each opcode has a handler implementation.
Example conceptual handler:
TARGET(LOAD_FAST) {
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
error...
}
Py_INCREF(value);
PUSH(value);
DISPATCH();
}Important operations:
load local variable
increment reference count
push onto stack
continue dispatchEven simple handlers must carefully maintain interpreter invariants.
72.15 The Evaluation Stack
The frame contains a value stack.
Conceptually:
bottom
x
y
z
topInstructions manipulate this stack directly.
Example:
LOAD_CONST 1
LOAD_CONST 2
BINARY_OP +Execution:
push 1
push 2
pop 2
pop 1
compute result
push 3The stack pointer moves constantly during execution.
Efficient stack access is critical for interpreter performance.
72.16 Error Handling During Dispatch
Opcode handlers can fail.
Example:
1 / 0The division opcode detects division by zero and raises an exception.
Interpreter flow changes:
normal execution
↓
exception raised
↓
unwind stack
↓
find exception handler
↓
resume or terminateThe dispatch loop therefore integrates tightly with exception machinery.
72.17 Reference Counting Inside Dispatch
Almost every opcode manipulates references.
Example:
x = yrequires:
increment new reference
decrement overwritten referenceArithmetic operations may allocate new objects:
x + yHandlers must correctly maintain ownership.
Reference counting overhead is deeply intertwined with interpreter dispatch cost.
72.18 Fast Locals
Local variables are optimized heavily.
Instead of dictionary lookups, frames store locals in arrays:
localsplus[0]
localsplus[1]
localsplus[2]LOAD_FAST becomes array indexing instead of hash lookup.
This explains why local variables are faster than globals.
Global access:
dictionary lookup
hash computation
namespace resolutionLocal access:
array accessDispatch efficiency depends heavily on such layout optimizations.
72.19 Instruction Pointer Management
The interpreter tracks the current instruction position.
Control flow instructions modify it:
JUMP_FORWARD
POP_JUMP_IF_FALSE
FOR_ITER
RETURN_VALUELoops work by rewinding the instruction pointer.
Example:
while x:
...Conceptually:
evaluate condition
jump if false
execute body
jump backwardDispatch must therefore support arbitrary control flow changes.
72.20 Dispatch and the GIL
Traditional CPython executes bytecode under the Global Interpreter Lock.
Only one thread executes Python bytecode at a time inside one interpreter.
This simplifies dispatch logic because opcode handlers can often assume internal runtime structures remain stable.
Without the GIL:
reference counts require synchronization
object layouts require synchronization
many runtime invariants become concurrentFree-threaded Python work significantly complicates dispatch implementation.
72.21 Tracing and Profiling Hooks
Debuggers and profilers integrate into the dispatch loop.
Features include:
line tracing
opcode tracing
profiling callbacks
coverage measurement
debug breakpointsThese hooks add conditional checks into execution paths.
Optimized fast paths attempt to minimize overhead when tracing is disabled.
72.22 Signal Handling
The interpreter periodically checks for signals:
KeyboardInterrupt
termination signals
pending calls
async eventsThis usually occurs between opcode executions.
The dispatch loop therefore serves as a scheduling checkpoint for runtime events.
72.23 Dispatch and Cache Locality
Interpreter performance depends heavily on memory locality.
Important structures:
opcode handlers
bytecode stream
evaluation stack
frame data
object headers
type objects
inline cachesPoor locality increases cache misses.
Modern interpreter optimization increasingly resembles systems-level CPU-aware engineering.
72.24 Why CPython Uses a Stack Machine
Stack machines have advantages:
simple compiler
compact bytecode
easy portability
simple operand encodingDisadvantages:
more instructions
more stack traffic
higher dispatch frequencyRegister-based VMs often reduce instruction count but complicate compilation and operand encoding.
CPython historically prioritized simplicity and portability.
72.25 Dispatch Costs vs Native Execution
Native compiled code executes directly on hardware instructions.
CPython execution adds layers:
bytecode fetch
bytecode decode
dispatch branch
runtime checks
dynamic typing
object allocation
reference countingThis explains much of Python’s performance profile.
The interpreter is highly dynamic and flexible, but every layer has cost.
72.26 The Adaptive Interpreter Architecture
Modern CPython increasingly behaves like a lightweight runtime optimizer.
Execution flow:
start generic
observe runtime behavior
specialize instructions
cache lookup data
deoptimize if assumptions failThis moves CPython closer to modern VM designs while preserving compatibility and simplicity.
The dispatch system is no longer merely a bytecode switch loop.
It is an adaptive runtime execution engine.
72.27 Relationship to JIT Compilation
A JIT compiler may eliminate much interpreter dispatch entirely.
Instead of:
fetch opcode
dispatch opcode
execute opcodethe JIT generates native machine code.
CPython historically emphasized interpreter optimization rather than aggressive JIT compilation.
Recent work explores optional JIT systems layered atop adaptive specialization infrastructure.
72.28 Reading ceval.c
When reading Python/ceval.c, focus on:
| Area | Purpose |
|---|---|
| Frame evaluation | Main execution engine |
| Opcode handlers | Bytecode implementations |
| Stack macros | Value stack operations |
| Dispatch macros | Control transfer |
| Inline caches | Adaptive optimization |
| Error handling | Exception propagation |
| Fast paths | Performance-critical cases |
The file is performance-sensitive and macro-heavy.
Many constructs prioritize execution speed over readability.
72.29 A Mental Model
A useful mental model:
CPython is a dynamic stack machine.The interpreter:
reads bytecode
moves PyObject pointers on a stack
dispatches opcode handlers
updates reference counts
maintains frames
handles exceptions
repeats continuouslyEvery Python program ultimately becomes this process.
72.30 Chapter Summary
Interpreter dispatch is the execution core of CPython.
The evaluation loop fetches bytecode instructions, decodes them, dispatches opcode handlers, manipulates Python objects, updates frame state, and continues until execution completes.
Modern CPython uses several important optimization techniques:
computed goto dispatch
inline caches
adaptive specialization
superinstructions
fast locals
specialized opcode handlersThese mechanisms reduce dispatch overhead while preserving Python’s dynamic semantics.
Understanding interpreter dispatch is essential for understanding CPython performance, runtime behavior, bytecode execution, and the architecture of the virtual machine itself.