# 92. Free-Threaded CPython

# 92. Free-Threaded CPython

Free-threaded CPython is a major redesign of the interpreter runtime that removes the traditional Global Interpreter Lock (GIL) and allows multiple threads to execute Python bytecode concurrently within the same interpreter.

Historically, CPython relied on the GIL to serialize execution of Python code. The GIL simplified memory management, reference counting, object mutation, allocator coordination, and internal runtime invariants. Only one thread at a time executed Python bytecode inside a process interpreter.

Free-threaded CPython changes this model.

The runtime must now preserve interpreter correctness while multiple CPU threads simultaneously manipulate Python objects, dictionaries, frames, reference counts, caches, and internal runtime structures.

This chapter examines:

```text
why the GIL existed
why removing it is difficult
how free-threaded CPython works
how memory management changes
how object access changes
how container synchronization works
how extension compatibility changes
what performance tradeoffs appear
```

The free-threaded work is one of the largest architectural changes in CPython history.

## 92.1 Historical Background

CPython traditionally used a single global lock protecting interpreter execution.

Conceptually:

```text
Thread A acquires GIL
    executes bytecode
Thread B waits
Thread A releases GIL
Thread B acquires GIL
```

This gave CPython several properties:

| Property | Effect |
|---|---|
| Reference counting updates are serialized | `ob_refcnt` operations stay simple |
| Object mutation is implicitly protected | Many internals avoid fine-grained locking |
| Interpreter state remains coherent | Frames and caches avoid races |
| Extension authors assume single-threaded interpreter execution | Simpler C APIs |

The cost was limited parallel execution for CPU-bound Python code.

Example:

```python
import threading

def work():
    total = 0
    for i in range(100_000_000):
        total += i

threads = [threading.Thread(target=work) for _ in range(4)]

for t in threads:
    t.start()

for t in threads:
    t.join()
```

Traditional CPython usually does not achieve near-4x CPU scaling here because threads compete for the GIL.

The GIL became one of the defining implementation characteristics of CPython.

## 92.2 Why the GIL Was Difficult to Remove

The GIL was not merely a scheduling mechanism.

It acted as a global correctness boundary.

Without the GIL, nearly every runtime subsystem becomes concurrently mutable:

```text
reference counts
object headers
dictionaries
lists
type caches
attribute caches
allocator metadata
garbage collector state
interned strings
import state
frame stacks
exception state
```

Consider a simple increment:

```python
x += 1
```

Under the GIL:

```text
load x
compute x + 1
store x
```

No other thread can mutate interpreter state during these bytecode operations.

Without the GIL:

```text
Thread A reads x
Thread B reads x
Thread A writes x + 1
Thread B writes stale value
```

The runtime must now enforce synchronization explicitly.

The challenge extends far beyond Python-level semantics.

Even this operation becomes unsafe:

```c
Py_INCREF(obj);
```

Traditional CPython used plain integer increments:

```c
++obj->ob_refcnt;
```

Without the GIL, concurrent increments can race.

The free-threaded runtime therefore changes fundamental assumptions across the interpreter.

## 92.3 The Free-Threaded Build

Modern CPython introduces an experimental free-threaded build configuration.

The build disables the traditional GIL and enables runtime mechanisms required for concurrent execution.

Conceptually:

```text
traditional build
    one thread executes Python bytecode at a time

free-threaded build
    multiple threads execute Python bytecode simultaneously
```

This is not merely a runtime flag.

Large parts of the interpreter behave differently:

```text
reference counting strategy
container synchronization
allocator coordination
object access rules
C extension requirements
runtime invariants
```

The free-threaded runtime aims to preserve Python language semantics while changing interpreter-level concurrency guarantees.

## 92.4 Atomic Reference Counting

Reference counting is one of the central problems in free-threaded CPython.

Traditional CPython:

```c
obj->ob_refcnt++;
obj->ob_refcnt--;
```

This is unsafe under concurrent execution.

Free-threaded CPython uses atomic operations for many reference count updates.

Conceptually:

```c
atomic_fetch_add(&obj->ob_refcnt, 1);
atomic_fetch_sub(&obj->ob_refcnt, 1);
```

Atomic operations guarantee correctness under concurrent modification.

However, they introduce costs:

| Cost | Reason |
|---|---|
| Higher instruction overhead | Atomic operations are more expensive |
| Cache synchronization | CPU cores coordinate cache lines |
| Memory ordering constraints | Stronger synchronization semantics |
| Reduced locality | Shared objects bounce between cores |

Reference counting becomes one of the major scalability bottlenecks in a highly parallel runtime.

## 92.5 Biased Reference Counting

Free-threaded CPython introduces techniques to reduce atomic overhead.

One important strategy is biased reference counting.

The idea:

```text
most objects are heavily used by one thread
avoid global atomic synchronization when possible
delay or batch cross-thread coordination
```

Conceptually:

```text
thread-local reference ownership
    +
shared atomic reference state
```

A thread can manipulate references cheaply while ownership remains local.

Cross-thread sharing requires synchronization.

This reduces contention for common cases:

```python
def local_work():
    xs = []
    for i in range(1_000_000):
        xs.append(i)
```

Most objects here remain thread-local.

The runtime attempts to avoid expensive global atomic traffic for such objects.

## 92.6 Object Immortality

Another optimization is immortal objects.

Some objects are effectively permanent:

```python
None
True
False
small integers
interned constants
builtin singletons
```

Traditionally, these still participated in reference counting.

Free-threaded CPython introduces immortal objects whose reference counts no longer behave normally.

Conceptually:

```text
immortal object
    refcount never reaches zero
    no deallocation
    many INCREF/DECREF operations skipped
```

This reduces synchronization overhead for heavily shared objects.

For example:

```python
x = None
```

would otherwise produce enormous cross-thread reference count traffic.

Immortal objects remove much of this pressure.

## 92.7 Container Synchronization

Containers become major synchronization points.

Examples:

```python
list.append(x)
dict[key] = value
set.add(x)
```

Under the GIL, internal container state was implicitly protected.

Without the GIL, concurrent mutations must coordinate safely.

The runtime introduces internal synchronization mechanisms.

Conceptually:

```text
per-container locks
atomic state transitions
careful resize coordination
safe iteration invariants
```

A dictionary resize becomes particularly difficult.

Traditional dict resize:

```text
allocate new table
rehash entries
replace table pointer
free old table
```

Without synchronization, another thread may:

```text
read partially migrated table
follow invalid pointer
observe inconsistent state
```

The free-threaded runtime must guarantee container integrity during concurrent access.

## 92.8 Memory Allocation Under Concurrency

CPython includes specialized allocators:

```text
pymalloc
arena allocators
object free lists
small object allocators
```

These systems historically assumed GIL protection.

Free-threaded execution requires allocator synchronization.

Challenges include:

```text
concurrent allocation
concurrent free
free list corruption
arena reuse races
cache locality degradation
false sharing
```

The runtime attempts to preserve allocation performance while ensuring correctness.

Thread-local allocation structures become increasingly important.

## 92.9 Garbage Collection Changes

The cyclic garbage collector must also adapt.

Traditional CPython could often assume interpreter-wide serialization during GC-sensitive operations.

Free-threaded execution introduces new problems:

```text
objects mutate during collection
reference graphs change concurrently
container traversal races appear
finalizers execute concurrently
```

The collector must coordinate safely with running threads.

Key challenges:

| Problem | Example |
|---|---|
| Object mutation during traversal | List contents change while scanning |
| Concurrent resurrection | `__del__` creates new references |
| Cross-thread visibility | One thread frees object seen by another |
| Container instability | Dict resize during traversal |

The collector therefore requires stronger synchronization and more careful state management.

## 92.10 Interpreter State Isolation

Traditional CPython relied heavily on process-global state.

Examples:

```text
interned strings
type caches
import caches
runtime registries
allocator state
```

Free-threaded work pushes CPython toward improved interpreter isolation.

This overlaps with subinterpreter work.

The runtime increasingly distinguishes:

```text
process-global state
interpreter-local state
thread-local state
```

This decomposition is necessary for scalable concurrency.

## 92.11 Frame Execution Under Parallelism

Frames represent active execution contexts.

A frame contains:

```text
instruction pointer
locals
stack
exception state
code object
```

Traditional CPython assumed only one thread executed a frame at a time.

Free-threaded CPython must enforce stronger ownership guarantees.

Conceptually:

```text
a frame belongs to one executing thread
shared frame access requires synchronization
```

Debuggers, profilers, tracers, and introspection tools become more complicated because execution can now proceed simultaneously across many interpreter threads.

## 92.12 Bytecode Evaluation Without the GIL

The evaluation loop changes substantially.

Traditional interpreter:

```text
acquire GIL
execute bytecode
release GIL periodically
```

Free-threaded interpreter:

```text
execute bytecode concurrently
coordinate mutable shared state explicitly
```

This affects:

```text
attribute caches
inline caches
specialization metadata
object access
exception handling
call machinery
```

The adaptive interpreter introduced in newer CPython versions must now operate correctly under concurrent mutation.

## 92.13 C Extension Compatibility

C extensions are one of the hardest compatibility problems.

Many extensions historically assumed:

```text
the GIL protects internal state
PyObject operations are serialized
reference counting is implicitly safe
container access is effectively single-threaded
```

These assumptions become invalid in free-threaded mode.

Unsafe example:

```c
static PyObject *global_cache;
```

Multiple threads may now mutate or access this simultaneously.

Extension authors must reconsider:

```text
locking
thread ownership
reference lifetime
global state
borrowed references
shared buffers
```

Some extensions remain incompatible until rewritten.

## 92.14 Borrowed References Become Dangerous

Borrowed references are especially problematic.

Traditional CPython often relied on the GIL:

```c
PyObject *item = PyList_GET_ITEM(list, 0);
```

This returns a borrowed reference.

Under the GIL:

```text
another thread cannot concurrently destroy list item
```

Without the GIL:

```text
another thread may mutate list
another thread may delete object
borrowed pointer may become invalid
```

This creates severe safety hazards.

Free-threaded CPython pushes toward safer ownership models and stronger APIs.

## 92.15 Performance Tradeoffs

Removing the GIL does not automatically improve performance.

Single-thread performance may decrease due to:

```text
atomic operations
extra synchronization
cache contention
larger metadata
locking overhead
memory fences
```

Parallel workloads may improve substantially.

Typical tradeoff:

| Workload | Effect |
|---|---|
| Single-thread CPU-bound | Often slower |
| Multi-thread CPU-bound | Potentially much faster |
| I/O-bound | Smaller difference |
| Allocation-heavy | May suffer from contention |
| Shared-object-heavy | May suffer from cache synchronization |

The runtime therefore balances:

```text
single-thread efficiency
parallel scalability
compatibility
implementation complexity
```

## 92.16 False Sharing and Cache Coherence

Modern multicore systems introduce hardware-level costs.

Suppose two threads repeatedly update reference counts on nearby objects.

CPU cache lines may bounce between cores:

```text
Core A modifies cache line
Core B invalidates cache line
Core A reloads cache line
```

This is called false sharing.

Even logically independent objects can interfere through cache coherence protocols.

Free-threaded runtime design therefore depends heavily on:

```text
memory layout
allocator design
object placement
cache locality
ownership heuristics
```

Concurrency performance is often dominated by hardware memory behavior rather than algorithmic complexity alone.

## 92.17 Lock Granularity

A free-threaded runtime must decide lock granularity carefully.

Coarse-grained locks:

```text
simpler correctness
less parallelism
more contention
```

Fine-grained locks:

```text
better scalability
higher complexity
deadlock risk
larger metadata cost
```

CPython historically favored simplicity through the GIL.

Free-threaded CPython must move toward more localized synchronization without making the runtime unmaintainable.

This is one of the core architectural tensions in the project.

## 92.18 Thread Safety of Built-in Types

Built-in operations acquire new semantics under parallelism.

Questions include:

```text
Can two threads append to one list safely?
Can iteration proceed during mutation?
What operations are atomic?
What consistency guarantees exist?
```

The runtime attempts to preserve intuitive safety while avoiding excessive locking.

However, Python programs should still avoid unsynchronized shared mutable state where possible.

Example:

```python
shared = []

def worker():
    for i in range(1000):
        shared.append(i)
```

The runtime may preserve structural integrity of the list, but logical ordering and higher-level invariants still require application-level synchronization.

## 92.19 Interaction With Subinterpreters

Subinterpreters and free-threading are related but distinct.

Subinterpreters isolate runtime state:

```text
modules
globals
builtins
execution state
```

Free-threading allows concurrent execution inside one interpreter.

Together, they support future scalability directions:

```text
multiple isolated interpreters
parallel execution
reduced global runtime state
better multicore utilization
```

The long-term architecture increasingly moves away from large globally shared runtime structures.

## 92.20 Runtime Invariants Become Explicit

The GIL historically hid many implicit assumptions.

Example assumptions:

```text
reference counts never race
dict mutation is serialized
frame stacks are stable
object lifetime is predictable
```

Free-threaded CPython forces these assumptions to become explicit runtime invariants.

Every subsystem must answer:

```text
Who owns this object?
Who may mutate this state?
What synchronization protects this structure?
When is this pointer valid?
What ordering guarantees exist?
```

This changes the engineering style of the interpreter itself.

## 92.21 Tooling and Debugging Challenges

Concurrent runtimes are harder to debug.

Problems include:

```text
race conditions
deadlocks
heisenbugs
timing-sensitive corruption
memory visibility bugs
```

Traditional deterministic assumptions become weaker.

Debugging tools must handle:

```text
simultaneous frame execution
parallel object mutation
cross-thread reference lifetime
concurrent allocator activity
```

Testing also becomes more difficult because many concurrency bugs appear nondeterministically.

## 92.22 Free-Threading and Python Semantics

The Python language itself changes relatively little.

Most user-visible semantics remain stable:

```python
x = [1, 2, 3]
x.append(4)
```

still behaves as expected.

The major changes are implementation-level:

```text
actual parallel bytecode execution
different performance characteristics
different extension safety requirements
different memory synchronization costs
```

The goal is preserving Python behavior while changing runtime scalability.

## 92.23 Long-Term Implications

Free-threaded CPython affects nearly every part of the ecosystem:

| Area | Impact |
|---|---|
| Interpreter runtime | Fundamental redesign |
| C extensions | Compatibility changes |
| Scientific computing | Better multicore scaling potential |
| Web servers | Improved concurrent execution |
| Tooling | Harder concurrency debugging |
| Allocators | Higher synchronization complexity |
| Object model | New lifetime rules |
| Performance engineering | Cache behavior becomes central |

The project represents a shift from:

```text
single-thread simplicity
```

toward:

```text
parallel runtime scalability
```

while attempting to preserve compatibility with decades of Python software.

## 92.24 Chapter Summary

Free-threaded CPython removes the traditional Global Interpreter Lock and allows multiple threads to execute Python bytecode concurrently inside one interpreter.

Achieving this requires major runtime redesigns:

```text
atomic and biased reference counting
container synchronization
allocator coordination
garbage collector changes
safer ownership models
interpreter state isolation
extension compatibility work
```

The GIL historically acted as a global correctness mechanism. Removing it forces CPython to make synchronization explicit across the entire runtime.

The result is a more parallel interpreter, but also a more complex one.
