Object allocation locality, arena page placement, freelists for common types, and cache-line–aware design.
Memory locality is the degree to which a program accesses memory that is close together in time and space.
CPython is a dynamic object runtime. Most Python values are heap-allocated objects, accessed through pointers. This gives Python flexibility, but it also means execution often follows chains of references through memory.
Example:
total += obj.valueAt the machine level, this may touch:
current frame
evaluation stack
local variable slot
object header
type object
inline cache
instance dictionary or slot storage
attribute value object
integer objectEach additional pointer can lead to a CPU cache miss. Memory locality therefore affects CPython performance even when the algorithmic complexity looks unchanged.
79.1 CPU Caches
Modern CPUs are much faster than main memory.
To reduce the gap, CPUs keep recently used memory in cache.
Typical hierarchy:
registers
L1 cache
L2 cache
L3 cache
main memoryAccessing data in L1 cache is much faster than fetching it from main memory.
A program with good locality repeatedly touches nearby memory. A program with poor locality jumps between unrelated memory locations and waits on memory loads.
CPython performance often depends on this difference.
79.2 Temporal Locality
Temporal locality means recently used data is likely to be used again soon.
Example:
for item in items:
total += item.priceThe loop repeatedly touches:
same bytecode instructions
same frame
same local variables
same attribute cache
same type objectThese structures may stay hot in CPU caches.
The object data may also have temporal locality if the objects were allocated near each other and accessed repeatedly.
79.3 Spatial Locality
Spatial locality means nearby memory addresses are likely to be used together.
Example:
xs = [1, 2, 3, 4]A CPython list stores a contiguous array of pointers to objects.
The list’s pointer array has good spatial locality:
ob_item[0]
ob_item[1]
ob_item[2]
ob_item[3]But the objects pointed to by those entries may live elsewhere on the heap.
For small integers, CPython may reuse cached integer objects. For general objects, the list stores references, not inline values.
79.4 Python Objects Are Pointer-Based
Most Python values are heap objects.
A Python list of integers:
xs = [1000, 1001, 1002]is not usually stored like a C array of raw integers.
Conceptually:
list object
pointer array
-> PyLongObject(1000)
-> PyLongObject(1001)
-> PyLongObject(1002)This representation supports arbitrary object values:
xs = [1, "hello", None, object()]But it creates pointer indirection.
A numeric loop over Python integers therefore pays costs for:
pointer loads
object header reads
type checks
reference count updates
integer object allocation for results79.5 Object Headers
Every normal Python object has an object header.
Conceptually:
typedef struct {
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
} PyObject;Variable-sized objects include a size field:
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size;
} PyVarObject;This header is necessary for dynamic typing and memory management.
It also means small values carry metadata.
A Python integer is not just the integer bits. It includes object identity, reference count, type pointer, and integer digit storage.
79.6 Lists and Locality
A CPython list is a dynamic array of object pointers.
Conceptually:
PyListObject
length
allocated capacity
pointer to contiguous PyObject* arrayThis gives good locality for scanning the list pointer array:
for x in xs:
...The loop reads consecutive pointers.
But each element access may then follow a pointer to a separate object.
For object lists, locality depends on where the objects were allocated.
For lists of boxed numbers, the pointer array is contiguous, but the numeric objects are separate heap objects.
79.7 Tuples and Locality
A tuple stores object references in fixed-size inline storage after the tuple header.
Conceptually:
tuple object
header
item pointer 0
item pointer 1
item pointer 2Tuples have good locality for their item pointer storage because the tuple and its references are in one allocation.
Like lists, tuples store references to objects, not inline arbitrary values.
Tuples are immutable, so CPython can make assumptions that are harder for lists.
79.8 Dictionaries and Locality
Dictionaries are hash tables. Their locality behavior is more complex.
Modern CPython dictionaries use compact layouts with separate index and entry storage.
Lookup may touch:
dictionary object
keys object
indices array
entries array
key object
value objectCompared with older sparse designs, compact dictionaries improve locality and iteration performance.
Iteration is especially locality-sensitive:
for key, value in d.items():
...The compact entries layout lets iteration walk entries more sequentially.
79.9 Instance Dictionaries
Object attributes are often stored in instance dictionaries.
Example:
obj.xmay require:
load object
load type
check descriptor tables
load instance dictionary
hash attribute name
probe dictionary
load valueSplit dictionaries improve memory use and locality for many instances of the same class.
Instead of each instance storing duplicate keys, instances can share the key table and store only values.
Conceptually:
shared keys:
name
age
email
instance values:
"Alice"
30
"[email protected]"This improves memory footprint and helps cache behavior for object-heavy programs.
79.10 Slots and Locality
__slots__ can improve locality by storing fixed fields directly in the object layout.
Example:
class Point:
__slots__ = ("x", "y")A Point instance can store references for x and y at fixed offsets.
Conceptually:
Point object
header
x pointer
y pointerThis avoids a per-instance dictionary for those fields.
Benefits:
less memory per object
fewer pointer indirections
faster fixed-offset access
better cache densitySlots are useful for many small objects with stable fields.
79.11 Frames and Locality
A frame stores execution state for a function call.
Important frame data includes:
code object
instruction pointer
locals
evaluation stack
block and exception stateModern CPython uses compact internal frame representations and avoids materializing full Python frame objects unless needed.
This improves locality for ordinary function calls.
Fast locals are stored in array-like slots:
localsplus[0]
localsplus[1]
localsplus[2]This is much better than storing active locals in a dictionary during normal execution.
79.12 Fast Locals
Local variables are locality-friendly.
Example:
def f(a, b):
c = a + b
return cThe variables a, b, and c are stored in fast local slots.
Bytecode such as LOAD_FAST and STORE_FAST can access those slots by index.
Conceptually:
LOAD_FAST 0
LOAD_FAST 1
BINARY_OP +
STORE_FAST 2This avoids dictionary lookup and keeps local state near the frame.
This is one reason locals are faster than globals.
79.13 Globals and Locality
Global variables live in dictionaries.
Example:
x = GLOBAL_VALUErequires lookup in the module dictionary.
A global lookup touches more memory than a local variable load.
Modern CPython uses inline caches to reduce repeated global lookup overhead, but the underlying data is still dictionary-backed.
Global access may involve:
bytecode cache
globals dictionary
builtins dictionary
resolved objectLocal access usually involves:
frame slot79.14 Bytecode Locality
The interpreter repeatedly executes bytecode instructions stored in code objects.
Hot loops have good bytecode locality:
for x in xs:
total += xThe same small group of instructions runs many times.
This helps instruction cache behavior.
Modern CPython also stores inline caches near bytecode instructions. This improves locality between the instruction and its runtime metadata.
Conceptually:
LOAD_ATTR x
CACHE
CACHE
CACHEThe interpreter can fetch the instruction and nearby cache state with fewer memory jumps.
79.15 Dispatch Locality
Opcode dispatch itself depends on locality.
The interpreter repeatedly jumps among opcode handlers.
Computed goto dispatch helps reduce branch overhead, but the CPU still needs to keep handler code hot.
Common opcode handlers such as:
LOAD_FAST
STORE_FAST
LOAD_CONST
LOAD_ATTR
CALL
BINARY_OP
RETURN_VALUEare executed frequently and tend to remain in instruction cache.
Specialized opcodes improve locality by concentrating common behavior in shorter fast paths.
79.16 Reference Counting and Locality
Reference counting touches object headers.
Every Py_INCREF and Py_DECREF reads and writes the object’s reference count.
This means common operations frequently modify memory inside object headers.
Benefits:
immediate reclamation
simple ownership model
predictable finalization in many casesCosts:
extra memory writes
cache line traffic
contention in free-threaded designsReference count writes can be expensive when objects are spread across memory or shared across CPU cores.
79.17 Allocation Locality
Objects allocated near each other may have better locality when used together.
CPython’s small object allocator, commonly called pymalloc, manages memory in arenas, pools, and blocks.
This improves allocation speed and tends to cluster similarly sized objects.
Better allocation locality helps workloads that create many objects of similar size.
However, Python object graphs can still become scattered due to long lifetimes, mixed object sizes, and complex references.
79.18 Small Object Allocator
CPython uses a specialized allocator for small objects.
The allocator groups memory roughly like:
arena
pool
fixed-size blocksObjects of similar sizes are allocated from pools for that size class.
This reduces overhead compared with calling the system allocator for every small object.
It also improves locality because many similarly sized Python objects live near each other.
79.19 Free Lists
Some CPython object types use free lists.
A free list stores recently freed objects for reuse.
Historically, common examples include:
tuples
lists
frames
floatsThe exact set changes across versions.
A free list can improve locality because a newly needed object may reuse recently touched memory.
It also reduces allocator overhead.
The tradeoff is additional memory retention and type-specific complexity.
79.20 Cache Lines
CPUs move memory in cache-line chunks, commonly 64 bytes.
If a program reads one byte, the CPU usually loads the whole cache line.
This favors compact layouts.
A compact object layout means useful adjacent fields can arrive together.
A scattered layout means each field may require a separate cache miss.
CPython data structure design often tries to pack hot fields tightly and avoid unnecessary pointer chasing.
79.21 False Sharing
False sharing occurs when different CPU cores write unrelated data that happen to live on the same cache line.
Traditional CPython’s GIL reduces many multi-core interpreter-level races, but false sharing can still matter in native extensions, free-threaded builds, and allocator structures.
Reference counts can become especially sensitive in free-threaded designs because object headers are frequently modified.
This is one reason free-threading work must revisit memory layout and synchronization costs.
79.22 Memory Locality and the GIL
The GIL simplifies many memory safety problems, but it does not remove memory locality concerns.
A single-threaded interpreter can still suffer from:
cache misses
pointer chasing
large working sets
poor object layout
unfriendly allocation patternsThe GIL mainly serializes bytecode execution. It does not make scattered memory access cheap.
79.23 Free-Threaded Builds
Free-threaded CPython changes locality concerns.
Without a single global interpreter lock, reference counts and object state updates need more careful synchronization.
Potential costs include:
atomic reference count operations
cache-line contention
more synchronization metadata
different allocator behaviorThese costs are deeply tied to memory locality.
A design that is fast under the GIL may need adjustment when multiple CPU cores mutate object state concurrently.
79.24 Numeric Arrays vs Python Lists
A Python list of numbers stores pointers to Python number objects.
A numeric array library can store raw numbers contiguously.
Python list:
list pointer array
-> PyLongObject
-> PyLongObject
-> PyLongObjectNumeric array:
int64
int64
int64The numeric array has much better locality and lower per-element overhead.
This explains why libraries such as NumPy can be much faster for numeric workloads. They reduce Python object overhead and use contiguous machine-level data.
79.25 Struct-of-Arrays vs Array-of-Structs
Object-heavy Python code often resembles an array of pointers to objects.
Example:
points = [Point(x, y) for x, y in pairs]Memory shape:
list
-> Point(x, y)
-> Point(x, y)
-> Point(x, y)A locality-friendly representation may separate fields:
xs = [...]
ys = [...]This resembles struct-of-arrays layout.
It can improve scanning performance because each list contains related values.
In pure Python, values are still objects, but the access pattern can still improve cache behavior and interpreter specialization.
79.26 Object Graphs
Python programs often create object graphs:
user
profile
address
countryAccess:
user.profile.address.countryrequires multiple attribute loads and pointer traversals.
Each level may involve:
object load
type check
dictionary or slot lookup
cache validation
value loadDeep object graphs can be expressive, but they increase memory indirection.
In hot paths, flattening or caching frequently used values in locals can help.
79.27 Local Variables as Locality Tools
A local variable stores a reference in the frame’s fast-local array.
Example:
profile = user.profile
address = profile.address
country = address.countryThis can avoid repeated lookup if country or address is used many times.
Example:
price = item.price
tax = item.tax
total += price + price * taxThe local names price and tax are cheap to reload.
Modern CPython specializes attribute access, so this should be measured. But for repeated use, locals remain a useful locality and lookup optimization.
79.28 Working Set Size
The working set is the memory a program actively uses during a period.
A small working set fits in cache.
A large working set spills to slower memory.
Python programs can have large working sets because each object has overhead.
Example:
rows = [{"id": i, "name": str(i)} for i in range(1_000_000)]This creates many dictionaries, strings, and integers.
A more compact representation can reduce memory pressure:
ids = [...]
names = [...]or a columnar representation in a data library.
Lower memory use often improves speed because more data fits in CPU caches.
79.29 Branch Prediction and Locality
Memory locality interacts with branch prediction.
Generic dynamic operations branch on runtime type and layout:
is exact int?
is exact unicode?
does type match cache?
does dict version match?
is descriptor present?Stable data improves both cache locality and branch prediction.
If a hot site repeatedly sees the same type, the CPU predicts the same branch outcomes.
If a site alternates among many unrelated types, both specialization and branch prediction suffer.
79.30 Inline Caches and Locality
Inline caches are explicitly locality-oriented.
They store runtime metadata beside the instruction.
Benefits:
fast access to cache data
fewer global lookup structures
better instruction-cache behavior
better data-cache localityFor an attribute access, the cache may store:
expected type
version tag
offset
descriptor informationThe interpreter can validate and execute the fast path with a small amount of nearby data.
79.31 Specialization and Locality
Specialized opcodes improve locality by reducing generic work.
Example:
generic BINARY_OP
many branches
helper calls
type protocol lookup
specialized integer add
check exact int types
perform fast integer pathThe specialized handler is smaller and more predictable.
This helps:
instruction cache
branch predictor
data cacheSpecialization is not only about fewer Python-level operations. It also improves CPU-level execution shape.
79.32 Measuring Locality Effects
Memory locality is harder to measure than algorithmic complexity.
Useful signals include:
runtime changes from data layout changes
memory usage changes
cache-miss counters from profilers
allocation counts
object counts
RSS growthTools may include:
pyperf
perf on Linux
Instruments on macOS
heap profilers
tracemalloc
sys.getsizeoftimeit can show symptoms, but low-level profilers are needed to see cache behavior directly.
79.33 sys.getsizeof Limitations
sys.getsizeof(obj) reports the direct size of an object.
It does not recursively include referenced objects.
Example:
import sys
xs = [1000, 1001, 1002]
print(sys.getsizeof(xs))This reports the list object and pointer array, not the full size of all integer objects.
For memory locality, transitive object graphs matter.
A small direct object can point to a large scattered graph.
79.34 tracemalloc
tracemalloc tracks Python memory allocations.
Example:
import tracemalloc
tracemalloc.start()
data = [{"x": i} for i in range(100_000)]
current, peak = tracemalloc.get_traced_memory()
print(current, peak)This helps identify allocation-heavy code.
It does not directly report CPU cache misses, but allocation pressure and working set size often correlate with locality problems.
79.35 Practical Guidelines
Memory locality suggests several practical rules.
| Pattern | Reason |
|---|---|
| Prefer compact data representations in hot paths | Smaller working sets fit cache better |
| Use lists or arrays for sequential scans | Contiguous pointer or value storage improves locality |
Use __slots__ for many small fixed-shape objects | Reduces per-object memory and indirection |
| Avoid excessive tiny dictionaries in large datasets | Dict overhead expands working set |
| Cache repeated deep lookups in locals | Fast locals avoid repeated pointer traversal |
| Use numeric arrays for numeric workloads | Raw contiguous values beat boxed Python objects |
These are performance rules, not design rules. Measure before changing clean code.
79.36 Reading CPython for Locality
Important source areas:
| Area | Purpose |
|---|---|
Objects/listobject.c | Dynamic array of object pointers |
Objects/tupleobject.c | Fixed-size object pointer arrays |
Objects/dictobject.c | Compact hash table layout |
Objects/longobject.c | Integer object representation |
Objects/typeobject.c | Type layout and attribute lookup |
Objects/obmalloc.c | Small object allocator |
Include/object.h | Core object headers |
When reading the source, look for:
contiguous arrays
pointer indirection
cached hashes
version tags
object headers
allocation size classes
free listsThese details explain much of CPython’s real performance behavior.
79.37 Mental Model
A useful model:
CPython moves pointers more often than raw values.Most operations manipulate PyObject * references.
Performance depends on:
how many pointers are followed
how close the pointed-to objects are
how often object headers are touched
how stable runtime shapes are
how much data fits in CPU cacheGood locality makes the interpreter and CPU cooperate better.
Poor locality turns dynamic flexibility into memory stalls.
79.38 Chapter Summary
Memory locality is a major hidden factor in CPython performance.
CPython uses pointer-based dynamic objects. This supports Python’s flexible semantics, but it creates indirection, object headers, reference count writes, dictionary lookups, and scattered object graphs.
Important locality mechanisms include:
compact dictionaries
split instance dictionaries
fast locals
inline caches
specialized opcodes
small object allocator
slots
contiguous list and tuple pointer storageGood Python performance often comes from reducing the number of objects touched, keeping hot data compact, and making repeated operations see stable runtime shapes.