73. Inline Caches

Inline caches are small cache records stored near bytecode instructions. They let CPython remember facts discovered during previous executions of an instruction, then reuse those facts on later executions.

The goal is simple: avoid repeating expensive dynamic lookups when the runtime situation has not changed.

For example, this expression looks small:

obj.name

But generic attribute lookup may involve:

look at obj type
check descriptors
check instance dictionary
check class dictionary
check base classes
handle __getattribute__
handle __getattr__
raise AttributeError if missing

If the same bytecode sees the same object type many times, CPython can cache the lookup path.

LOAD_ATTR
    cache: expected type
    cache: dictionary version
    cache: offset or descriptor data

The next execution can check the cache quickly. If the check succeeds, the interpreter takes a fast path. If it fails, CPython falls back to the generic slow path.

73.1 Why Inline Caches Exist

Python operations are dynamic.

This code:

x.value

does not statically mean “load field at offset 8”.

It means:

perform Python attribute access semantics

That semantic operation may involve many runtime decisions.

However, real programs often behave predictably:

for user in users:
    total += user.score

Inside the loop, user is usually the same type each time. The attribute name is fixed. The class layout is usually stable. The lookup result is often the same kind of operation.

Inline caches exploit that regularity.

They preserve Python semantics while optimizing the common case.

73.2 Inline Cache Position

An inline cache sits beside the instruction it supports.

Conceptually:

LOAD_ATTR score
CACHE
CACHE
CACHE

The cache entries are part of the bytecode stream layout, but they are not normal source-level operations. They are reserved storage used by the interpreter.

The instruction owns its cache records.

This differs from a global hash table cache:

global cache:
    key = operation + type + name
    value = resolved lookup

Inline caches are local:

bytecode offset 42:
    LOAD_ATTR score
    cache for this exact LOAD_ATTR

Locality matters. The interpreter can reach the cache directly from the instruction pointer.

73.3 Monomorphic Caches

The simplest useful inline cache is monomorphic.

It remembers one observed shape.

Example:

def f(obj):
    return obj.x

If f is called repeatedly with instances of the same class, the cache can store:

expected type: Point
attribute name: x
lookup data: instance dict offset or slot offset
version tag: current class dictionary version

Then execution becomes:

if type(obj) == Point and cached versions still match:
    use cached fast path
else:
    use generic lookup

This is called monomorphic because the call site or attribute site has one dominant type.

Many Python programs have many monomorphic sites.

73.4 Polymorphism and Cache Misses

A bytecode location may observe more than one type.

def get_name(obj):
    return obj.name

get_name(user)
get_name(team)
get_name(project)

The same LOAD_ATTR name instruction sees User, Team, and Project.

A simple monomorphic cache can only remember one of them. When a different type appears, the cache misses.

A miss does not produce incorrect behavior. It only means CPython uses the generic path and may update or deoptimize the cache.

Possible outcomes:

cache hit
    fast path

cache miss with useful new pattern
    adapt or respecialize

cache miss with unstable pattern
    stay generic or back off

Inline caches are speculative. They optimize what actually happens, not what the source code might do.

73.5 Attribute Lookup Caches

Attribute access is one of the most important cache targets.

The bytecode instruction:

LOAD_ATTR

supports expressions such as:

obj.x

Generic lookup must respect the descriptor protocol and class hierarchy.

A cache may remember:

object type
type version
dictionary version
attribute offset
descriptor pointer
whether result comes from instance or class

A slot-based class can be especially efficient:

class Point:
    __slots__ = ("x", "y")

For p.x, the cache can often reduce lookup to a checked offset load.

For normal instance dictionaries, the cache may remember dictionary layout information.

73.6 Method Lookup Caches

Method calls are another major target.

Consider:

obj.method(arg)

This involves two logical steps:

load method
call method

Naively, loading a method creates a bound method object:

function + self

Creating bound method objects repeatedly is expensive.

CPython has fast paths for method calls that avoid temporary bound method allocation in common cases.

Conceptually:

LOAD_METHOD method
PRECALL
CALL

The method lookup cache can remember the resolved method and the assumptions that make it valid.

Fast method calls matter because object-oriented Python code performs method dispatch constantly.

73.7 Global and Builtin Caches

Global name access is dynamic.

len(xs)

The name len is resolved through namespaces:

locals
globals
builtins

At module level, global lookup checks the module dictionary and then builtins.

A cache can remember:

globals dictionary version
builtins dictionary version
resolved object

If neither dictionary has changed, the cached object remains valid.

This makes repeated global and builtin access faster.

Example:

for item in xs:
    total += len(item)

The len lookup can usually be cached after the first few iterations.

73.8 Binary Operation Caches

Binary operations are dynamic.

a + b

The operation depends on runtime types:

int + int
str + str
list + list
custom __add__
custom __radd__
unsupported pair

The generic path must handle all of this.

An inline cache can specialize the site for common pairs:

int + int
float + float
str + str

For integer addition, the fast path can avoid much of the generic numeric dispatch. It still must handle overflow and object allocation rules, but it can skip broad type-dispatch logic.

73.9 Subscript Caches

Subscript access is also dynamic:

obj[key]

Common cases include:

list[index]
tuple[index]
dict[key]
str[index]
custom __getitem__

A cache can specialize:

list with integer index
tuple with integer index
dict with exact key type

Example:

for i in range(len(xs)):
    item = xs[i]

When xs is a list and i is an integer, the interpreter can use a specialized fast path.

73.10 Cache Validation

Every cache needs validation.

A cached result is valid only while its assumptions remain true.

For attribute lookup, assumptions may include:

object has expected type
type dictionary has not changed
instance dictionary layout has not changed
descriptor has not changed

For global lookup:

globals dictionary has not changed
builtins dictionary has not changed

For binary operations:

left operand has expected exact type
right operand has expected exact type

Validation must be cheaper than the full operation. Otherwise the cache does not help.

A typical fast path is:

check type pointer
check version tag
load cached value

73.11 Version Tags

Version tags are small change counters used to detect mutation.

For example, a dictionary can carry a version value that changes when the dictionary changes.

A cache can store:

expected dictionary version = 12345

Later:

if dict.version == 12345:
    cached lookup remains valid
else:
    cache miss

This avoids scanning the dictionary to prove that nothing changed.

Version tags turn invalidation into a cheap comparison.

73.12 Cache Invalidation

Some virtual machines use active invalidation: when a class changes, all dependent caches are found and cleared.

CPython primarily favors cheap validation at the use site.

That means the cache often remains physically present, but it stops matching once the relevant version tag changes.

Example:

class C:
    x = 1

def f(obj):
    return obj.x

C.x = 2

After C.x = 2, the class dictionary version changes. The cached lookup for obj.x fails validation and falls back to generic lookup.

The bytecode site can then adapt again.

73.13 Cache Warmup

Inline caches need warmup.

At first execution, the interpreter does not know what types a site will see.

Execution starts generic:

LOAD_ATTR

After enough executions, CPython gathers enough evidence to specialize:

LOAD_ATTR_INSTANCE_VALUE

or another specialized form.

The exact opcode names and thresholds vary by Python version.

The pattern is stable:

start generic
observe behavior
specialize
hit fast path
deoptimize if assumptions fail

73.14 Adaptive Instructions

Modern CPython uses adaptive instructions as part of specialization.

An adaptive instruction counts executions and misses.

Conceptually:

LOAD_ATTR_ADAPTIVE
    counter
    cache entries

When the counter reaches a threshold, CPython attempts specialization.

If specialization succeeds, the instruction changes into a more specific opcode.

If specialization fails repeatedly, CPython can delay future specialization attempts.

This prevents unstable code from wasting time on constant respecialization.

73.15 Deoptimization

Deoptimization means returning from a specialized form to a more generic form.

Example:

LOAD_ATTR_INSTANCE_VALUE
    ↓
LOAD_ATTR_ADAPTIVE
    ↓
LOAD_ATTR

A specialized instruction may deoptimize when assumptions fail too often.

Reasons include:

many unrelated object types
mutating class dictionaries
custom attribute hooks
changing globals
unusual descriptors

Deoptimization preserves correctness.

Optimization is optional. Semantics always come from the generic operation.

73.16 Cache Entries Are Hidden From Normal Disassembly

Inline caches occupy bytecode space, but normal disassembly hides them by default.

You can ask dis to show caches in recent Python versions:

import dis

def f(obj):
    return obj.x

dis.dis(f, show_caches=True)

Conceptually, you may see:

LOAD_FAST                0 (obj)
LOAD_ATTR                0 (x)
CACHE
CACHE
CACHE
CACHE
RETURN_VALUE

The exact number of cache entries depends on the instruction.

This makes dis useful for studying interpreter specialization.

73.17 Specialized Bytecode Is Runtime State

Specialization mutates executable bytecode state in memory.

The source code and logical code object remain the same from the language perspective, but the interpreter’s internal instruction stream may change as the program runs.

This has several consequences:

bytecode execution can become faster after warmup
the same source can specialize differently in different runs
debug and tracing modes may inhibit specialization
different Python versions may show different opcodes

A performance investigation should account for warmup.

A single cold execution may measure generic bytecode more than specialized execution.

73.18 Inline Caches and Correctness

Inline caches must preserve full Python semantics.

They cannot assume that Python behaves like a static language.

For example, this must still work:

class C:
    x = 1

obj = C()

def f():
    return obj.x

print(f())
C.x = 2
print(f())

The second call must see the updated value.

Therefore the cache must notice the class dictionary change.

Similarly, monkey patching builtins must remain visible:

import builtins

old_len = builtins.len
builtins.len = lambda x: 42

try:
    print(len([1, 2, 3]))
finally:
    builtins.len = old_len

A cache for len must not ignore the mutated builtins dictionary.

73.19 Inline Caches and the C API

The C API complicates caching.

Native extensions can mutate objects, dictionaries, types, and descriptors through C-level operations.

CPython’s caches must remain valid under those mutations.

Version tags and runtime checks provide the bridge:

C extension mutates dictionary
    ↓
dictionary version changes
    ↓
cached lookup fails validation
    ↓
generic path runs

This is one reason CPython cannot freely use aggressive assumptions without careful invalidation rules.

73.20 Inline Caches and Type Stability

Inline caches reward type-stable code.

Type-stable code repeatedly presents the same types at the same bytecode sites.

Example:

def area(rectangles):
    total = 0
    for r in rectangles:
        total += r.width * r.height
    return total

If every r is a Rectangle, the attribute sites specialize well.

Less stable code:

def read(obj):
    return obj.value

called with many unrelated object types may remain generic or miss often.

This does not make dynamic code wrong. It just changes the optimization profile.

73.21 Inline Caches and Object Layout

Object layout affects cache quality.

Instance dictionaries are flexible but require dictionary machinery.

Slots provide fixed storage:

class Point:
    __slots__ = ("x", "y")

A slot access can often be cached as:

expected type
slot offset

This is closer to field access in static languages.

However, __slots__ changes object semantics and should be used for concrete reasons such as memory use, layout stability, or very frequent attribute access.

73.22 Inline Caches and Descriptors

Descriptors make attribute access powerful.

Examples include:

functions
property objects
staticmethod
classmethod
custom descriptors

A descriptor can define:

__get__
__set__
__delete__

This affects whether an attribute is loaded from the instance, class, or descriptor result.

Inline caches must encode descriptor-sensitive paths.

For a normal method:

obj.method()

the cache may optimize method lookup.

For a property:

obj.value

the property getter must still execute.

The cache cannot replace a descriptor call with a raw value unless semantics allow it.

73.23 Inline Caches and Globals

Global caching depends on namespace stability.

Example:

def f(xs):
    return len(xs)

The function’s global dictionary and the builtins dictionary define name resolution.

A cache for len remains valid while both dictionaries keep the same relevant version state.

If the module assigns a new global:

len = lambda x: 0

the global dictionary changes. The lookup must be redone.

This keeps Python’s dynamic namespace behavior intact.

73.24 Inline Caches and Imports

Imports often create global names:

import math

def f(x):
    return math.sqrt(x)

This involves two cacheable operations:

LOAD_GLOBAL math
LOAD_ATTR sqrt

After warmup:

math lookup can be cached
sqrt attribute lookup can be cached

This is why moving imports outside hot loops helps, but repeated module attribute access can still become relatively efficient after specialization.

73.25 Inline Caches and Loops

Loops amplify cache benefits.

A small loop body may execute millions of times:

for p in points:
    total += p.x

The LOAD_ATTR x instruction may execute once per iteration.

A cache miss on the first few iterations matters little. Cache hits on the remaining iterations matter a lot.

This is the main performance argument for inline caches:

pay small warmup cost
reduce repeated dynamic overhead

73.26 Inline Caches vs Memoization

Inline caching differs from user-level memoization.

Memoization caches function results:

same input
same output

Inline caching caches operation resolution:

same runtime shape
same fast path

Example:

obj.x

The cache does not necessarily store the final value of obj.x. It may store how to find it quickly.

For mutable objects, storing the final value would often be wrong.

73.27 Inline Caches vs CPU Caches

Inline caches are interpreter-level data structures.

CPU caches are hardware-level memory caches.

They are unrelated mechanisms, but they interact.

Inline caches improve interpreter logic by avoiding expensive lookups.

CPU caches improve memory access by keeping recently used memory close to the processor.

A good inline cache design also considers hardware locality:

cache entries near bytecode
small fixed-size records
few pointer chases
cheap validation

73.28 Performance Shape

Inline caches improve common dynamic operations:

attribute access
method calls
global lookup
binary operations
subscript access
unpacking
calls

They help most when code has:

hot loops
stable types
stable globals
stable class dictionaries
repeated operations at the same bytecode sites

They help less when code has:

frequent monkey patching
many unrelated types at one site
custom dynamic lookup hooks
heavy tracing
mostly cold execution

73.29 Reading Inline Cache Code

When reading CPython source, look for:

adaptive opcode families
specialization counters
cache structures
version checks
deoptimization paths
miss handlers
generic fallback calls

Relevant areas include:

Python/bytecodes.c
Python/generated_cases.c.h
Python/specialize.c
Include/internal/pycore_code.h
Lib/dis.py

The exact file organization can change between CPython versions, but the concepts remain recognizable.

73.30 Mental Model

A useful model:

An inline cache turns repeated dynamic lookup into checked direct access.

The check preserves correctness.

The direct access improves performance.

The fallback preserves Python semantics.

generic operation
    ↓
observe
    ↓
specialize
    ↓
validate cache
    ↓
fast path
    ↓
fallback if invalid

73.31 Chapter Summary

Inline caches are local runtime caches attached to bytecode instructions.

They store facts such as:

observed types
dictionary versions
attribute offsets
resolved descriptors
global lookup results
operation-specific fast paths

They are central to modern CPython performance because they reduce repeated dynamic dispatch overhead without changing Python semantics.

Inline caches work best when a bytecode site sees stable runtime behavior. When assumptions fail, CPython falls back to generic execution and may respecialize or deoptimize later.