79. Memory Locality

Memory locality is the degree to which a program accesses memory that is close together in time and space.

CPython is a dynamic object runtime. Most Python values are heap-allocated objects, accessed through pointers. This gives Python flexibility, but it also means execution often follows chains of references through memory.

Example:

total += obj.value

At the machine level, this may touch:

current frame
evaluation stack
local variable slot
object header
type object
inline cache
instance dictionary or slot storage
attribute value object
integer object

Each additional pointer can lead to a CPU cache miss. Memory locality therefore affects CPython performance even when the algorithmic complexity looks unchanged.

79.1 CPU Caches

Modern CPUs are much faster than main memory.

To reduce the gap, CPUs keep recently used memory in cache.

Typical hierarchy:

registers
L1 cache
L2 cache
L3 cache
main memory

Accessing data in L1 cache is much faster than fetching it from main memory.

A program with good locality repeatedly touches nearby memory. A program with poor locality jumps between unrelated memory locations and waits on memory loads.

CPython performance often depends on this difference.

79.2 Temporal Locality

Temporal locality means recently used data is likely to be used again soon.

Example:

for item in items:
    total += item.price

The loop repeatedly touches:

same bytecode instructions
same frame
same local variables
same attribute cache
same type object

These structures may stay hot in CPU caches.

The object data may also have temporal locality if the objects were allocated near each other and accessed repeatedly.

79.3 Spatial Locality

Spatial locality means nearby memory addresses are likely to be used together.

Example:

xs = [1, 2, 3, 4]

A CPython list stores a contiguous array of pointers to objects.

The list’s pointer array has good spatial locality:

ob_item[0]
ob_item[1]
ob_item[2]
ob_item[3]

But the objects pointed to by those entries may live elsewhere on the heap.

For small integers, CPython may reuse cached integer objects. For general objects, the list stores references, not inline values.

79.4 Python Objects Are Pointer-Based

Most Python values are heap objects.

A Python list of integers:

xs = [1000, 1001, 1002]

is not usually stored like a C array of raw integers.

Conceptually:

list object
    pointer array
        -> PyLongObject(1000)
        -> PyLongObject(1001)
        -> PyLongObject(1002)

This representation supports arbitrary object values:

xs = [1, "hello", None, object()]

But it creates pointer indirection.

A numeric loop over Python integers therefore pays costs for:

pointer loads
object header reads
type checks
reference count updates
integer object allocation for results

79.5 Object Headers

Every normal Python object has an object header.

Conceptually:

typedef struct {
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

Variable-sized objects include a size field:

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;
} PyVarObject;

This header is necessary for dynamic typing and memory management.

It also means small values carry metadata.

A Python integer is not just the integer bits. It includes object identity, reference count, type pointer, and integer digit storage.

79.6 Lists and Locality

A CPython list is a dynamic array of object pointers.

Conceptually:

PyListObject
    length
    allocated capacity
    pointer to contiguous PyObject* array

This gives good locality for scanning the list pointer array:

for x in xs:
    ...

The loop reads consecutive pointers.

But each element access may then follow a pointer to a separate object.

For object lists, locality depends on where the objects were allocated.

For lists of boxed numbers, the pointer array is contiguous, but the numeric objects are separate heap objects.

79.7 Tuples and Locality

A tuple stores object references in fixed-size inline storage after the tuple header.

Conceptually:

tuple object
    header
    item pointer 0
    item pointer 1
    item pointer 2

Tuples have good locality for their item pointer storage because the tuple and its references are in one allocation.

Like lists, tuples store references to objects, not inline arbitrary values.

Tuples are immutable, so CPython can make assumptions that are harder for lists.

79.8 Dictionaries and Locality

Dictionaries are hash tables. Their locality behavior is more complex.

Modern CPython dictionaries use compact layouts with separate index and entry storage.

Lookup may touch:

dictionary object
keys object
indices array
entries array
key object
value object

Compared with older sparse designs, compact dictionaries improve locality and iteration performance.

Iteration is especially locality-sensitive:

for key, value in d.items():
    ...

The compact entries layout lets iteration walk entries more sequentially.

79.9 Instance Dictionaries

Object attributes are often stored in instance dictionaries.

Example:

obj.x

may require:

load object
load type
check descriptor tables
load instance dictionary
hash attribute name
probe dictionary
load value

Split dictionaries improve memory use and locality for many instances of the same class.

Instead of each instance storing duplicate keys, instances can share the key table and store only values.

Conceptually:

shared keys:
    name
    age
    email

instance values:
    "Alice"
    30
    "[email protected]"

This improves memory footprint and helps cache behavior for object-heavy programs.

79.10 Slots and Locality

__slots__ can improve locality by storing fixed fields directly in the object layout.

Example:

class Point:
    __slots__ = ("x", "y")

A Point instance can store references for x and y at fixed offsets.

Conceptually:

Point object
    header
    x pointer
    y pointer

This avoids a per-instance dictionary for those fields.

Benefits:

less memory per object
fewer pointer indirections
faster fixed-offset access
better cache density

Slots are useful for many small objects with stable fields.

79.11 Frames and Locality

A frame stores execution state for a function call.

Important frame data includes:

code object
instruction pointer
locals
evaluation stack
block and exception state

Modern CPython uses compact internal frame representations and avoids materializing full Python frame objects unless needed.

This improves locality for ordinary function calls.

Fast locals are stored in array-like slots:

localsplus[0]
localsplus[1]
localsplus[2]

This is much better than storing active locals in a dictionary during normal execution.

79.12 Fast Locals

Local variables are locality-friendly.

Example:

def f(a, b):
    c = a + b
    return c

The variables a, b, and c are stored in fast local slots.

Bytecode such as LOAD_FAST and STORE_FAST can access those slots by index.

Conceptually:

LOAD_FAST 0
LOAD_FAST 1
BINARY_OP +
STORE_FAST 2

This avoids dictionary lookup and keeps local state near the frame.

This is one reason locals are faster than globals.

79.13 Globals and Locality

Global variables live in dictionaries.

Example:

x = GLOBAL_VALUE

requires lookup in the module dictionary.

A global lookup touches more memory than a local variable load.

Modern CPython uses inline caches to reduce repeated global lookup overhead, but the underlying data is still dictionary-backed.

Global access may involve:

bytecode cache
globals dictionary
builtins dictionary
resolved object

Local access usually involves:

frame slot

79.14 Bytecode Locality

The interpreter repeatedly executes bytecode instructions stored in code objects.

Hot loops have good bytecode locality:

for x in xs:
    total += x

The same small group of instructions runs many times.

This helps instruction cache behavior.

Modern CPython also stores inline caches near bytecode instructions. This improves locality between the instruction and its runtime metadata.

Conceptually:

LOAD_ATTR x
CACHE
CACHE
CACHE

The interpreter can fetch the instruction and nearby cache state with fewer memory jumps.

79.15 Dispatch Locality

Opcode dispatch itself depends on locality.

The interpreter repeatedly jumps among opcode handlers.

Computed goto dispatch helps reduce branch overhead, but the CPU still needs to keep handler code hot.

Common opcode handlers such as:

LOAD_FAST
STORE_FAST
LOAD_CONST
LOAD_ATTR
CALL
BINARY_OP
RETURN_VALUE

are executed frequently and tend to remain in instruction cache.

Specialized opcodes improve locality by concentrating common behavior in shorter fast paths.

79.16 Reference Counting and Locality

Reference counting touches object headers.

Every Py_INCREF and Py_DECREF reads and writes the object’s reference count.

This means common operations frequently modify memory inside object headers.

Benefits:

immediate reclamation
simple ownership model
predictable finalization in many cases

Costs:

extra memory writes
cache line traffic
contention in free-threaded designs

Reference count writes can be expensive when objects are spread across memory or shared across CPU cores.

79.17 Allocation Locality

Objects allocated near each other may have better locality when used together.

CPython’s small object allocator, commonly called pymalloc, manages memory in arenas, pools, and blocks.

This improves allocation speed and tends to cluster similarly sized objects.

Better allocation locality helps workloads that create many objects of similar size.

However, Python object graphs can still become scattered due to long lifetimes, mixed object sizes, and complex references.

79.18 Small Object Allocator

CPython uses a specialized allocator for small objects.

The allocator groups memory roughly like:

arena
    pool
        fixed-size blocks

Objects of similar sizes are allocated from pools for that size class.

This reduces overhead compared with calling the system allocator for every small object.

It also improves locality because many similarly sized Python objects live near each other.

79.19 Free Lists

Some CPython object types use free lists.

A free list stores recently freed objects for reuse.

Historically, common examples include:

tuples
lists
frames
floats

The exact set changes across versions.

A free list can improve locality because a newly needed object may reuse recently touched memory.

It also reduces allocator overhead.

The tradeoff is additional memory retention and type-specific complexity.

79.20 Cache Lines

CPUs move memory in cache-line chunks, commonly 64 bytes.

If a program reads one byte, the CPU usually loads the whole cache line.

This favors compact layouts.

A compact object layout means useful adjacent fields can arrive together.

A scattered layout means each field may require a separate cache miss.

CPython data structure design often tries to pack hot fields tightly and avoid unnecessary pointer chasing.

79.21 False Sharing

False sharing occurs when different CPU cores write unrelated data that happen to live on the same cache line.

Traditional CPython’s GIL reduces many multi-core interpreter-level races, but false sharing can still matter in native extensions, free-threaded builds, and allocator structures.

Reference counts can become especially sensitive in free-threaded designs because object headers are frequently modified.

This is one reason free-threading work must revisit memory layout and synchronization costs.

79.22 Memory Locality and the GIL

The GIL simplifies many memory safety problems, but it does not remove memory locality concerns.

A single-threaded interpreter can still suffer from:

cache misses
pointer chasing
large working sets
poor object layout
unfriendly allocation patterns

The GIL mainly serializes bytecode execution. It does not make scattered memory access cheap.

79.23 Free-Threaded Builds

Free-threaded CPython changes locality concerns.

Without a single global interpreter lock, reference counts and object state updates need more careful synchronization.

Potential costs include:

atomic reference count operations
cache-line contention
more synchronization metadata
different allocator behavior

These costs are deeply tied to memory locality.

A design that is fast under the GIL may need adjustment when multiple CPU cores mutate object state concurrently.

79.24 Numeric Arrays vs Python Lists

A Python list of numbers stores pointers to Python number objects.

A numeric array library can store raw numbers contiguously.

Python list:

list pointer array
    -> PyLongObject
    -> PyLongObject
    -> PyLongObject

Numeric array:

int64
int64
int64

The numeric array has much better locality and lower per-element overhead.

This explains why libraries such as NumPy can be much faster for numeric workloads. They reduce Python object overhead and use contiguous machine-level data.

79.25 Struct-of-Arrays vs Array-of-Structs

Object-heavy Python code often resembles an array of pointers to objects.

Example:

points = [Point(x, y) for x, y in pairs]

Memory shape:

list
    -> Point(x, y)
    -> Point(x, y)
    -> Point(x, y)

A locality-friendly representation may separate fields:

xs = [...]
ys = [...]

This resembles struct-of-arrays layout.

It can improve scanning performance because each list contains related values.

In pure Python, values are still objects, but the access pattern can still improve cache behavior and interpreter specialization.

79.26 Object Graphs

Python programs often create object graphs:

user
    profile
        address
            country

Access:

user.profile.address.country

requires multiple attribute loads and pointer traversals.

Each level may involve:

object load
type check
dictionary or slot lookup
cache validation
value load

Deep object graphs can be expressive, but they increase memory indirection.

In hot paths, flattening or caching frequently used values in locals can help.

79.27 Local Variables as Locality Tools

A local variable stores a reference in the frame’s fast-local array.

Example:

profile = user.profile
address = profile.address
country = address.country

This can avoid repeated lookup if country or address is used many times.

Example:

price = item.price
tax = item.tax

total += price + price * tax

The local names price and tax are cheap to reload.

Modern CPython specializes attribute access, so this should be measured. But for repeated use, locals remain a useful locality and lookup optimization.

79.28 Working Set Size

The working set is the memory a program actively uses during a period.

A small working set fits in cache.

A large working set spills to slower memory.

Python programs can have large working sets because each object has overhead.

Example:

rows = [{"id": i, "name": str(i)} for i in range(1_000_000)]

This creates many dictionaries, strings, and integers.

A more compact representation can reduce memory pressure:

ids = [...]
names = [...]

or a columnar representation in a data library.

Lower memory use often improves speed because more data fits in CPU caches.

79.29 Branch Prediction and Locality

Memory locality interacts with branch prediction.

Generic dynamic operations branch on runtime type and layout:

is exact int?
is exact unicode?
does type match cache?
does dict version match?
is descriptor present?

Stable data improves both cache locality and branch prediction.

If a hot site repeatedly sees the same type, the CPU predicts the same branch outcomes.

If a site alternates among many unrelated types, both specialization and branch prediction suffer.

79.30 Inline Caches and Locality

Inline caches are explicitly locality-oriented.

They store runtime metadata beside the instruction.

Benefits:

fast access to cache data
fewer global lookup structures
better instruction-cache behavior
better data-cache locality

For an attribute access, the cache may store:

expected type
version tag
offset
descriptor information

The interpreter can validate and execute the fast path with a small amount of nearby data.

79.31 Specialization and Locality

Specialized opcodes improve locality by reducing generic work.

Example:

generic BINARY_OP
    many branches
    helper calls
    type protocol lookup

specialized integer add
    check exact int types
    perform fast integer path

The specialized handler is smaller and more predictable.

This helps:

instruction cache
branch predictor
data cache

Specialization is not only about fewer Python-level operations. It also improves CPU-level execution shape.

79.32 Measuring Locality Effects

Memory locality is harder to measure than algorithmic complexity.

Useful signals include:

runtime changes from data layout changes
memory usage changes
cache-miss counters from profilers
allocation counts
object counts
RSS growth

Tools may include:

pyperf
perf on Linux
Instruments on macOS
heap profilers
tracemalloc
sys.getsizeof

timeit can show symptoms, but low-level profilers are needed to see cache behavior directly.

79.33 `sys.getsizeof` Limitations

sys.getsizeof(obj) reports the direct size of an object.

It does not recursively include referenced objects.

Example:

import sys

xs = [1000, 1001, 1002]
print(sys.getsizeof(xs))

This reports the list object and pointer array, not the full size of all integer objects.

For memory locality, transitive object graphs matter.

A small direct object can point to a large scattered graph.

79.34 `tracemalloc`

tracemalloc tracks Python memory allocations.

Example:

import tracemalloc

tracemalloc.start()

data = [{"x": i} for i in range(100_000)]

current, peak = tracemalloc.get_traced_memory()
print(current, peak)

This helps identify allocation-heavy code.

It does not directly report CPU cache misses, but allocation pressure and working set size often correlate with locality problems.

79.35 Practical Guidelines

Memory locality suggests several practical rules.

Pattern	Reason
Prefer compact data representations in hot paths	Smaller working sets fit cache better
Use lists or arrays for sequential scans	Contiguous pointer or value storage improves locality
Use `__slots__` for many small fixed-shape objects	Reduces per-object memory and indirection
Avoid excessive tiny dictionaries in large datasets	Dict overhead expands working set
Cache repeated deep lookups in locals	Fast locals avoid repeated pointer traversal
Use numeric arrays for numeric workloads	Raw contiguous values beat boxed Python objects

These are performance rules, not design rules. Measure before changing clean code.

79.36 Reading CPython for Locality

Important source areas:

Area	Purpose
`Objects/listobject.c`	Dynamic array of object pointers
`Objects/tupleobject.c`	Fixed-size object pointer arrays
`Objects/dictobject.c`	Compact hash table layout
`Objects/longobject.c`	Integer object representation
`Objects/typeobject.c`	Type layout and attribute lookup
`Objects/obmalloc.c`	Small object allocator
`Include/object.h`	Core object headers

When reading the source, look for:

contiguous arrays
pointer indirection
cached hashes
version tags
object headers
allocation size classes
free lists

These details explain much of CPython’s real performance behavior.

79.37 Mental Model

A useful model:

CPython moves pointers more often than raw values.

Most operations manipulate PyObject * references.

Performance depends on:

how many pointers are followed
how close the pointed-to objects are
how often object headers are touched
how stable runtime shapes are
how much data fits in CPU cache

Good locality makes the interpreter and CPU cooperate better.

Poor locality turns dynamic flexibility into memory stalls.

79.38 Chapter Summary

Memory locality is a major hidden factor in CPython performance.

CPython uses pointer-based dynamic objects. This supports Python’s flexible semantics, but it creates indirection, object headers, reference count writes, dictionary lookups, and scattered object graphs.

Important locality mechanisms include:

compact dictionaries
split instance dictionaries
fast locals
inline caches
specialized opcodes
small object allocator
slots
contiguous list and tuple pointer storage

Good Python performance often comes from reducing the number of objects touched, keeping hot data compact, and making repeated operations see stable runtime shapes.

79. Memory Locality

79.1 CPU Caches

79.2 Temporal Locality

79.3 Spatial Locality

79.4 Python Objects Are Pointer-Based

79.5 Object Headers

79.6 Lists and Locality

79.7 Tuples and Locality

79.8 Dictionaries and Locality

79.9 Instance Dictionaries

79.10 Slots and Locality

79.11 Frames and Locality

79.12 Fast Locals

79.13 Globals and Locality

79.14 Bytecode Locality

79.15 Dispatch Locality

79.16 Reference Counting and Locality

79.17 Allocation Locality

79.18 Small Object Allocator

79.19 Free Lists

79.20 Cache Lines

79.21 False Sharing

79.22 Memory Locality and the GIL

79.23 Free-Threaded Builds

79.24 Numeric Arrays vs Python Lists

79.25 Struct-of-Arrays vs Array-of-Structs

79.26 Object Graphs

79.27 Local Variables as Locality Tools

79.28 Working Set Size

79.29 Branch Prediction and Locality

79.30 Inline Caches and Locality

79.31 Specialization and Locality

79.32 Measuring Locality Effects

79.33 sys.getsizeof Limitations

79.34 tracemalloc

79.35 Practical Guidelines

79.36 Reading CPython for Locality

79.37 Mental Model

79.38 Chapter Summary

79.33 `sys.getsizeof` Limitations

79.34 `tracemalloc`