# 79. Memory Locality

# 79. Memory Locality

Memory locality is the degree to which a program accesses memory that is close together in time and space.

CPython is a dynamic object runtime. Most Python values are heap-allocated objects, accessed through pointers. This gives Python flexibility, but it also means execution often follows chains of references through memory.

Example:

```python
total += obj.value
```

At the machine level, this may touch:

```text
current frame
evaluation stack
local variable slot
object header
type object
inline cache
instance dictionary or slot storage
attribute value object
integer object
```

Each additional pointer can lead to a CPU cache miss. Memory locality therefore affects CPython performance even when the algorithmic complexity looks unchanged.

## 79.1 CPU Caches

Modern CPUs are much faster than main memory.

To reduce the gap, CPUs keep recently used memory in cache.

Typical hierarchy:

```text
registers
L1 cache
L2 cache
L3 cache
main memory
```

Accessing data in L1 cache is much faster than fetching it from main memory.

A program with good locality repeatedly touches nearby memory. A program with poor locality jumps between unrelated memory locations and waits on memory loads.

CPython performance often depends on this difference.

## 79.2 Temporal Locality

Temporal locality means recently used data is likely to be used again soon.

Example:

```python
for item in items:
    total += item.price
```

The loop repeatedly touches:

```text
same bytecode instructions
same frame
same local variables
same attribute cache
same type object
```

These structures may stay hot in CPU caches.

The object data may also have temporal locality if the objects were allocated near each other and accessed repeatedly.

## 79.3 Spatial Locality

Spatial locality means nearby memory addresses are likely to be used together.

Example:

```python
xs = [1, 2, 3, 4]
```

A CPython list stores a contiguous array of pointers to objects.

The list’s pointer array has good spatial locality:

```text
ob_item[0]
ob_item[1]
ob_item[2]
ob_item[3]
```

But the objects pointed to by those entries may live elsewhere on the heap.

For small integers, CPython may reuse cached integer objects. For general objects, the list stores references, not inline values.

## 79.4 Python Objects Are Pointer-Based

Most Python values are heap objects.

A Python list of integers:

```python
xs = [1000, 1001, 1002]
```

is not usually stored like a C array of raw integers.

Conceptually:

```text
list object
    pointer array
        -> PyLongObject(1000)
        -> PyLongObject(1001)
        -> PyLongObject(1002)
```

This representation supports arbitrary object values:

```python
xs = [1, "hello", None, object()]
```

But it creates pointer indirection.

A numeric loop over Python integers therefore pays costs for:

```text
pointer loads
object header reads
type checks
reference count updates
integer object allocation for results
```

## 79.5 Object Headers

Every normal Python object has an object header.

Conceptually:

```c
typedef struct {
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;
```

Variable-sized objects include a size field:

```c
typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;
} PyVarObject;
```

This header is necessary for dynamic typing and memory management.

It also means small values carry metadata.

A Python integer is not just the integer bits. It includes object identity, reference count, type pointer, and integer digit storage.

## 79.6 Lists and Locality

A CPython list is a dynamic array of object pointers.

Conceptually:

```text
PyListObject
    length
    allocated capacity
    pointer to contiguous PyObject* array
```

This gives good locality for scanning the list pointer array:

```python
for x in xs:
    ...
```

The loop reads consecutive pointers.

But each element access may then follow a pointer to a separate object.

For object lists, locality depends on where the objects were allocated.

For lists of boxed numbers, the pointer array is contiguous, but the numeric objects are separate heap objects.

## 79.7 Tuples and Locality

A tuple stores object references in fixed-size inline storage after the tuple header.

Conceptually:

```text
tuple object
    header
    item pointer 0
    item pointer 1
    item pointer 2
```

Tuples have good locality for their item pointer storage because the tuple and its references are in one allocation.

Like lists, tuples store references to objects, not inline arbitrary values.

Tuples are immutable, so CPython can make assumptions that are harder for lists.

## 79.8 Dictionaries and Locality

Dictionaries are hash tables. Their locality behavior is more complex.

Modern CPython dictionaries use compact layouts with separate index and entry storage.

Lookup may touch:

```text
dictionary object
keys object
indices array
entries array
key object
value object
```

Compared with older sparse designs, compact dictionaries improve locality and iteration performance.

Iteration is especially locality-sensitive:

```python
for key, value in d.items():
    ...
```

The compact entries layout lets iteration walk entries more sequentially.

## 79.9 Instance Dictionaries

Object attributes are often stored in instance dictionaries.

Example:

```python
obj.x
```

may require:

```text
load object
load type
check descriptor tables
load instance dictionary
hash attribute name
probe dictionary
load value
```

Split dictionaries improve memory use and locality for many instances of the same class.

Instead of each instance storing duplicate keys, instances can share the key table and store only values.

Conceptually:

```text
shared keys:
    name
    age
    email

instance values:
    "Alice"
    30
    "a@example.com"
```

This improves memory footprint and helps cache behavior for object-heavy programs.

## 79.10 Slots and Locality

`__slots__` can improve locality by storing fixed fields directly in the object layout.

Example:

```python
class Point:
    __slots__ = ("x", "y")
```

A `Point` instance can store references for `x` and `y` at fixed offsets.

Conceptually:

```text
Point object
    header
    x pointer
    y pointer
```

This avoids a per-instance dictionary for those fields.

Benefits:

```text
less memory per object
fewer pointer indirections
faster fixed-offset access
better cache density
```

Slots are useful for many small objects with stable fields.

## 79.11 Frames and Locality

A frame stores execution state for a function call.

Important frame data includes:

```text
code object
instruction pointer
locals
evaluation stack
block and exception state
```

Modern CPython uses compact internal frame representations and avoids materializing full Python frame objects unless needed.

This improves locality for ordinary function calls.

Fast locals are stored in array-like slots:

```text
localsplus[0]
localsplus[1]
localsplus[2]
```

This is much better than storing active locals in a dictionary during normal execution.

## 79.12 Fast Locals

Local variables are locality-friendly.

Example:

```python
def f(a, b):
    c = a + b
    return c
```

The variables `a`, `b`, and `c` are stored in fast local slots.

Bytecode such as `LOAD_FAST` and `STORE_FAST` can access those slots by index.

Conceptually:

```text
LOAD_FAST 0
LOAD_FAST 1
BINARY_OP +
STORE_FAST 2
```

This avoids dictionary lookup and keeps local state near the frame.

This is one reason locals are faster than globals.

## 79.13 Globals and Locality

Global variables live in dictionaries.

Example:

```python
x = GLOBAL_VALUE
```

requires lookup in the module dictionary.

A global lookup touches more memory than a local variable load.

Modern CPython uses inline caches to reduce repeated global lookup overhead, but the underlying data is still dictionary-backed.

Global access may involve:

```text
bytecode cache
globals dictionary
builtins dictionary
resolved object
```

Local access usually involves:

```text
frame slot
```

## 79.14 Bytecode Locality

The interpreter repeatedly executes bytecode instructions stored in code objects.

Hot loops have good bytecode locality:

```python
for x in xs:
    total += x
```

The same small group of instructions runs many times.

This helps instruction cache behavior.

Modern CPython also stores inline caches near bytecode instructions. This improves locality between the instruction and its runtime metadata.

Conceptually:

```text
LOAD_ATTR x
CACHE
CACHE
CACHE
```

The interpreter can fetch the instruction and nearby cache state with fewer memory jumps.

## 79.15 Dispatch Locality

Opcode dispatch itself depends on locality.

The interpreter repeatedly jumps among opcode handlers.

Computed goto dispatch helps reduce branch overhead, but the CPU still needs to keep handler code hot.

Common opcode handlers such as:

```text
LOAD_FAST
STORE_FAST
LOAD_CONST
LOAD_ATTR
CALL
BINARY_OP
RETURN_VALUE
```

are executed frequently and tend to remain in instruction cache.

Specialized opcodes improve locality by concentrating common behavior in shorter fast paths.

## 79.16 Reference Counting and Locality

Reference counting touches object headers.

Every `Py_INCREF` and `Py_DECREF` reads and writes the object’s reference count.

This means common operations frequently modify memory inside object headers.

Benefits:

```text
immediate reclamation
simple ownership model
predictable finalization in many cases
```

Costs:

```text
extra memory writes
cache line traffic
contention in free-threaded designs
```

Reference count writes can be expensive when objects are spread across memory or shared across CPU cores.

## 79.17 Allocation Locality

Objects allocated near each other may have better locality when used together.

CPython’s small object allocator, commonly called `pymalloc`, manages memory in arenas, pools, and blocks.

This improves allocation speed and tends to cluster similarly sized objects.

Better allocation locality helps workloads that create many objects of similar size.

However, Python object graphs can still become scattered due to long lifetimes, mixed object sizes, and complex references.

## 79.18 Small Object Allocator

CPython uses a specialized allocator for small objects.

The allocator groups memory roughly like:

```text
arena
    pool
        fixed-size blocks
```

Objects of similar sizes are allocated from pools for that size class.

This reduces overhead compared with calling the system allocator for every small object.

It also improves locality because many similarly sized Python objects live near each other.

## 79.19 Free Lists

Some CPython object types use free lists.

A free list stores recently freed objects for reuse.

Historically, common examples include:

```text
tuples
lists
frames
floats
```

The exact set changes across versions.

A free list can improve locality because a newly needed object may reuse recently touched memory.

It also reduces allocator overhead.

The tradeoff is additional memory retention and type-specific complexity.

## 79.20 Cache Lines

CPUs move memory in cache-line chunks, commonly 64 bytes.

If a program reads one byte, the CPU usually loads the whole cache line.

This favors compact layouts.

A compact object layout means useful adjacent fields can arrive together.

A scattered layout means each field may require a separate cache miss.

CPython data structure design often tries to pack hot fields tightly and avoid unnecessary pointer chasing.

## 79.21 False Sharing

False sharing occurs when different CPU cores write unrelated data that happen to live on the same cache line.

Traditional CPython’s GIL reduces many multi-core interpreter-level races, but false sharing can still matter in native extensions, free-threaded builds, and allocator structures.

Reference counts can become especially sensitive in free-threaded designs because object headers are frequently modified.

This is one reason free-threading work must revisit memory layout and synchronization costs.

## 79.22 Memory Locality and the GIL

The GIL simplifies many memory safety problems, but it does not remove memory locality concerns.

A single-threaded interpreter can still suffer from:

```text
cache misses
pointer chasing
large working sets
poor object layout
unfriendly allocation patterns
```

The GIL mainly serializes bytecode execution. It does not make scattered memory access cheap.

## 79.23 Free-Threaded Builds

Free-threaded CPython changes locality concerns.

Without a single global interpreter lock, reference counts and object state updates need more careful synchronization.

Potential costs include:

```text
atomic reference count operations
cache-line contention
more synchronization metadata
different allocator behavior
```

These costs are deeply tied to memory locality.

A design that is fast under the GIL may need adjustment when multiple CPU cores mutate object state concurrently.

## 79.24 Numeric Arrays vs Python Lists

A Python list of numbers stores pointers to Python number objects.

A numeric array library can store raw numbers contiguously.

Python list:

```text
list pointer array
    -> PyLongObject
    -> PyLongObject
    -> PyLongObject
```

Numeric array:

```text
int64
int64
int64
```

The numeric array has much better locality and lower per-element overhead.

This explains why libraries such as NumPy can be much faster for numeric workloads. They reduce Python object overhead and use contiguous machine-level data.

## 79.25 Struct-of-Arrays vs Array-of-Structs

Object-heavy Python code often resembles an array of pointers to objects.

Example:

```python
points = [Point(x, y) for x, y in pairs]
```

Memory shape:

```text
list
    -> Point(x, y)
    -> Point(x, y)
    -> Point(x, y)
```

A locality-friendly representation may separate fields:

```python
xs = [...]
ys = [...]
```

This resembles struct-of-arrays layout.

It can improve scanning performance because each list contains related values.

In pure Python, values are still objects, but the access pattern can still improve cache behavior and interpreter specialization.

## 79.26 Object Graphs

Python programs often create object graphs:

```text
user
    profile
        address
            country
```

Access:

```python
user.profile.address.country
```

requires multiple attribute loads and pointer traversals.

Each level may involve:

```text
object load
type check
dictionary or slot lookup
cache validation
value load
```

Deep object graphs can be expressive, but they increase memory indirection.

In hot paths, flattening or caching frequently used values in locals can help.

## 79.27 Local Variables as Locality Tools

A local variable stores a reference in the frame’s fast-local array.

Example:

```python
profile = user.profile
address = profile.address
country = address.country
```

This can avoid repeated lookup if `country` or `address` is used many times.

Example:

```python
price = item.price
tax = item.tax

total += price + price * tax
```

The local names `price` and `tax` are cheap to reload.

Modern CPython specializes attribute access, so this should be measured. But for repeated use, locals remain a useful locality and lookup optimization.

## 79.28 Working Set Size

The working set is the memory a program actively uses during a period.

A small working set fits in cache.

A large working set spills to slower memory.

Python programs can have large working sets because each object has overhead.

Example:

```python
rows = [{"id": i, "name": str(i)} for i in range(1_000_000)]
```

This creates many dictionaries, strings, and integers.

A more compact representation can reduce memory pressure:

```python
ids = [...]
names = [...]
```

or a columnar representation in a data library.

Lower memory use often improves speed because more data fits in CPU caches.

## 79.29 Branch Prediction and Locality

Memory locality interacts with branch prediction.

Generic dynamic operations branch on runtime type and layout:

```text
is exact int?
is exact unicode?
does type match cache?
does dict version match?
is descriptor present?
```

Stable data improves both cache locality and branch prediction.

If a hot site repeatedly sees the same type, the CPU predicts the same branch outcomes.

If a site alternates among many unrelated types, both specialization and branch prediction suffer.

## 79.30 Inline Caches and Locality

Inline caches are explicitly locality-oriented.

They store runtime metadata beside the instruction.

Benefits:

```text
fast access to cache data
fewer global lookup structures
better instruction-cache behavior
better data-cache locality
```

For an attribute access, the cache may store:

```text
expected type
version tag
offset
descriptor information
```

The interpreter can validate and execute the fast path with a small amount of nearby data.

## 79.31 Specialization and Locality

Specialized opcodes improve locality by reducing generic work.

Example:

```text
generic BINARY_OP
    many branches
    helper calls
    type protocol lookup

specialized integer add
    check exact int types
    perform fast integer path
```

The specialized handler is smaller and more predictable.

This helps:

```text
instruction cache
branch predictor
data cache
```

Specialization is not only about fewer Python-level operations. It also improves CPU-level execution shape.

## 79.32 Measuring Locality Effects

Memory locality is harder to measure than algorithmic complexity.

Useful signals include:

```text
runtime changes from data layout changes
memory usage changes
cache-miss counters from profilers
allocation counts
object counts
RSS growth
```

Tools may include:

```text
pyperf
perf on Linux
Instruments on macOS
heap profilers
tracemalloc
sys.getsizeof
```

`timeit` can show symptoms, but low-level profilers are needed to see cache behavior directly.

## 79.33 `sys.getsizeof` Limitations

`sys.getsizeof(obj)` reports the direct size of an object.

It does not recursively include referenced objects.

Example:

```python
import sys

xs = [1000, 1001, 1002]
print(sys.getsizeof(xs))
```

This reports the list object and pointer array, not the full size of all integer objects.

For memory locality, transitive object graphs matter.

A small direct object can point to a large scattered graph.

## 79.34 `tracemalloc`

`tracemalloc` tracks Python memory allocations.

Example:

```python
import tracemalloc

tracemalloc.start()

data = [{"x": i} for i in range(100_000)]

current, peak = tracemalloc.get_traced_memory()
print(current, peak)
```

This helps identify allocation-heavy code.

It does not directly report CPU cache misses, but allocation pressure and working set size often correlate with locality problems.

## 79.35 Practical Guidelines

Memory locality suggests several practical rules.

| Pattern | Reason |
|---|---|
| Prefer compact data representations in hot paths | Smaller working sets fit cache better |
| Use lists or arrays for sequential scans | Contiguous pointer or value storage improves locality |
| Use `__slots__` for many small fixed-shape objects | Reduces per-object memory and indirection |
| Avoid excessive tiny dictionaries in large datasets | Dict overhead expands working set |
| Cache repeated deep lookups in locals | Fast locals avoid repeated pointer traversal |
| Use numeric arrays for numeric workloads | Raw contiguous values beat boxed Python objects |

These are performance rules, not design rules. Measure before changing clean code.

## 79.36 Reading CPython for Locality

Important source areas:

| Area | Purpose |
|---|---|
| `Objects/listobject.c` | Dynamic array of object pointers |
| `Objects/tupleobject.c` | Fixed-size object pointer arrays |
| `Objects/dictobject.c` | Compact hash table layout |
| `Objects/longobject.c` | Integer object representation |
| `Objects/typeobject.c` | Type layout and attribute lookup |
| `Objects/obmalloc.c` | Small object allocator |
| `Include/object.h` | Core object headers |

When reading the source, look for:

```text
contiguous arrays
pointer indirection
cached hashes
version tags
object headers
allocation size classes
free lists
```

These details explain much of CPython’s real performance behavior.

## 79.37 Mental Model

A useful model:

```text
CPython moves pointers more often than raw values.
```

Most operations manipulate `PyObject *` references.

Performance depends on:

```text
how many pointers are followed
how close the pointed-to objects are
how often object headers are touched
how stable runtime shapes are
how much data fits in CPU cache
```

Good locality makes the interpreter and CPU cooperate better.

Poor locality turns dynamic flexibility into memory stalls.

## 79.38 Chapter Summary

Memory locality is a major hidden factor in CPython performance.

CPython uses pointer-based dynamic objects. This supports Python’s flexible semantics, but it creates indirection, object headers, reference count writes, dictionary lookups, and scattered object graphs.

Important locality mechanisms include:

```text
compact dictionaries
split instance dictionaries
fast locals
inline caches
specialized opcodes
small object allocator
slots
contiguous list and tuple pointer storage
```

Good Python performance often comes from reducing the number of objects touched, keeping hot data compact, and making repeated operations see stable runtime shapes.
