92. Free-Threaded CPython

Free-threaded CPython is a major redesign of the interpreter runtime that removes the traditional Global Interpreter Lock (GIL) and allows multiple threads to execute Python bytecode concurrently within the same interpreter.

Historically, CPython relied on the GIL to serialize execution of Python code. The GIL simplified memory management, reference counting, object mutation, allocator coordination, and internal runtime invariants. Only one thread at a time executed Python bytecode inside a process interpreter.

Free-threaded CPython changes this model.

The runtime must now preserve interpreter correctness while multiple CPU threads simultaneously manipulate Python objects, dictionaries, frames, reference counts, caches, and internal runtime structures.

This chapter examines:

why the GIL existed
why removing it is difficult
how free-threaded CPython works
how memory management changes
how object access changes
how container synchronization works
how extension compatibility changes
what performance tradeoffs appear

The free-threaded work is one of the largest architectural changes in CPython history.

92.1 Historical Background

CPython traditionally used a single global lock protecting interpreter execution.

Conceptually:

Thread A acquires GIL
    executes bytecode
Thread B waits
Thread A releases GIL
Thread B acquires GIL

This gave CPython several properties:

Property	Effect
Reference counting updates are serialized	`ob_refcnt` operations stay simple
Object mutation is implicitly protected	Many internals avoid fine-grained locking
Interpreter state remains coherent	Frames and caches avoid races
Extension authors assume single-threaded interpreter execution	Simpler C APIs

The cost was limited parallel execution for CPU-bound Python code.

Example:

import threading

def work():
    total = 0
    for i in range(100_000_000):
        total += i

threads = [threading.Thread(target=work) for _ in range(4)]

for t in threads:
    t.start()

for t in threads:
    t.join()

Traditional CPython usually does not achieve near-4x CPU scaling here because threads compete for the GIL.

The GIL became one of the defining implementation characteristics of CPython.

92.2 Why the GIL Was Difficult to Remove

The GIL was not merely a scheduling mechanism.

It acted as a global correctness boundary.

Without the GIL, nearly every runtime subsystem becomes concurrently mutable:

reference counts
object headers
dictionaries
lists
type caches
attribute caches
allocator metadata
garbage collector state
interned strings
import state
frame stacks
exception state

Consider a simple increment:

x += 1

Under the GIL:

load x
compute x + 1
store x

No other thread can mutate interpreter state during these bytecode operations.

Without the GIL:

Thread A reads x
Thread B reads x
Thread A writes x + 1
Thread B writes stale value

The runtime must now enforce synchronization explicitly.

The challenge extends far beyond Python-level semantics.

Even this operation becomes unsafe:

Py_INCREF(obj);

Traditional CPython used plain integer increments:

++obj->ob_refcnt;

Without the GIL, concurrent increments can race.

The free-threaded runtime therefore changes fundamental assumptions across the interpreter.

92.3 The Free-Threaded Build

Modern CPython introduces an experimental free-threaded build configuration.

The build disables the traditional GIL and enables runtime mechanisms required for concurrent execution.

Conceptually:

traditional build
    one thread executes Python bytecode at a time

free-threaded build
    multiple threads execute Python bytecode simultaneously

This is not merely a runtime flag.

Large parts of the interpreter behave differently:

reference counting strategy
container synchronization
allocator coordination
object access rules
C extension requirements
runtime invariants

The free-threaded runtime aims to preserve Python language semantics while changing interpreter-level concurrency guarantees.

92.4 Atomic Reference Counting

Reference counting is one of the central problems in free-threaded CPython.

Traditional CPython:

obj->ob_refcnt++;
obj->ob_refcnt--;

This is unsafe under concurrent execution.

Free-threaded CPython uses atomic operations for many reference count updates.

Conceptually:

atomic_fetch_add(&obj->ob_refcnt, 1);
atomic_fetch_sub(&obj->ob_refcnt, 1);

Atomic operations guarantee correctness under concurrent modification.

However, they introduce costs:

Cost	Reason
Higher instruction overhead	Atomic operations are more expensive
Cache synchronization	CPU cores coordinate cache lines
Memory ordering constraints	Stronger synchronization semantics
Reduced locality	Shared objects bounce between cores

Reference counting becomes one of the major scalability bottlenecks in a highly parallel runtime.

92.5 Biased Reference Counting

Free-threaded CPython introduces techniques to reduce atomic overhead.

One important strategy is biased reference counting.

The idea:

most objects are heavily used by one thread
avoid global atomic synchronization when possible
delay or batch cross-thread coordination

Conceptually:

thread-local reference ownership
    +
shared atomic reference state

A thread can manipulate references cheaply while ownership remains local.

Cross-thread sharing requires synchronization.

This reduces contention for common cases:

def local_work():
    xs = []
    for i in range(1_000_000):
        xs.append(i)

Most objects here remain thread-local.

The runtime attempts to avoid expensive global atomic traffic for such objects.

92.6 Object Immortality

Another optimization is immortal objects.

Some objects are effectively permanent:

None
True
False
small integers
interned constants
builtin singletons

Traditionally, these still participated in reference counting.

Free-threaded CPython introduces immortal objects whose reference counts no longer behave normally.

Conceptually:

immortal object
    refcount never reaches zero
    no deallocation
    many INCREF/DECREF operations skipped

This reduces synchronization overhead for heavily shared objects.

For example:

x = None

would otherwise produce enormous cross-thread reference count traffic.

Immortal objects remove much of this pressure.

92.7 Container Synchronization

Containers become major synchronization points.

Examples:

list.append(x)
dict[key] = value
set.add(x)

Under the GIL, internal container state was implicitly protected.

Without the GIL, concurrent mutations must coordinate safely.

The runtime introduces internal synchronization mechanisms.

Conceptually:

per-container locks
atomic state transitions
careful resize coordination
safe iteration invariants

A dictionary resize becomes particularly difficult.

Traditional dict resize:

allocate new table
rehash entries
replace table pointer
free old table

Without synchronization, another thread may:

read partially migrated table
follow invalid pointer
observe inconsistent state

The free-threaded runtime must guarantee container integrity during concurrent access.

92.8 Memory Allocation Under Concurrency

CPython includes specialized allocators:

pymalloc
arena allocators
object free lists
small object allocators

These systems historically assumed GIL protection.

Free-threaded execution requires allocator synchronization.

Challenges include:

concurrent allocation
concurrent free
free list corruption
arena reuse races
cache locality degradation
false sharing

The runtime attempts to preserve allocation performance while ensuring correctness.

Thread-local allocation structures become increasingly important.

92.9 Garbage Collection Changes

The cyclic garbage collector must also adapt.

Traditional CPython could often assume interpreter-wide serialization during GC-sensitive operations.

Free-threaded execution introduces new problems:

objects mutate during collection
reference graphs change concurrently
container traversal races appear
finalizers execute concurrently

The collector must coordinate safely with running threads.

Key challenges:

Problem	Example
Object mutation during traversal	List contents change while scanning
Concurrent resurrection	`__del__` creates new references
Cross-thread visibility	One thread frees object seen by another
Container instability	Dict resize during traversal

The collector therefore requires stronger synchronization and more careful state management.

92.10 Interpreter State Isolation

Traditional CPython relied heavily on process-global state.

Examples:

interned strings
type caches
import caches
runtime registries
allocator state

Free-threaded work pushes CPython toward improved interpreter isolation.

This overlaps with subinterpreter work.

The runtime increasingly distinguishes:

process-global state
interpreter-local state
thread-local state

This decomposition is necessary for scalable concurrency.

92.11 Frame Execution Under Parallelism

Frames represent active execution contexts.

A frame contains:

instruction pointer
locals
stack
exception state
code object

Traditional CPython assumed only one thread executed a frame at a time.

Free-threaded CPython must enforce stronger ownership guarantees.

Conceptually:

a frame belongs to one executing thread
shared frame access requires synchronization

Debuggers, profilers, tracers, and introspection tools become more complicated because execution can now proceed simultaneously across many interpreter threads.

92.12 Bytecode Evaluation Without the GIL

The evaluation loop changes substantially.

Traditional interpreter:

acquire GIL
execute bytecode
release GIL periodically

Free-threaded interpreter:

execute bytecode concurrently
coordinate mutable shared state explicitly

This affects:

attribute caches
inline caches
specialization metadata
object access
exception handling
call machinery

The adaptive interpreter introduced in newer CPython versions must now operate correctly under concurrent mutation.

92.13 C Extension Compatibility

C extensions are one of the hardest compatibility problems.

Many extensions historically assumed:

the GIL protects internal state
PyObject operations are serialized
reference counting is implicitly safe
container access is effectively single-threaded

These assumptions become invalid in free-threaded mode.

Unsafe example:

static PyObject *global_cache;

Multiple threads may now mutate or access this simultaneously.

Extension authors must reconsider:

locking
thread ownership
reference lifetime
global state
borrowed references
shared buffers

Some extensions remain incompatible until rewritten.

92.14 Borrowed References Become Dangerous

Borrowed references are especially problematic.

Traditional CPython often relied on the GIL:

PyObject *item = PyList_GET_ITEM(list, 0);

This returns a borrowed reference.

Under the GIL:

another thread cannot concurrently destroy list item

Without the GIL:

another thread may mutate list
another thread may delete object
borrowed pointer may become invalid

This creates severe safety hazards.

Free-threaded CPython pushes toward safer ownership models and stronger APIs.

92.15 Performance Tradeoffs

Removing the GIL does not automatically improve performance.

Single-thread performance may decrease due to:

atomic operations
extra synchronization
cache contention
larger metadata
locking overhead
memory fences

Parallel workloads may improve substantially.

Typical tradeoff:

Workload	Effect
Single-thread CPU-bound	Often slower
Multi-thread CPU-bound	Potentially much faster
I/O-bound	Smaller difference
Allocation-heavy	May suffer from contention
Shared-object-heavy	May suffer from cache synchronization

The runtime therefore balances:

single-thread efficiency
parallel scalability
compatibility
implementation complexity

92.16 False Sharing and Cache Coherence

Modern multicore systems introduce hardware-level costs.

Suppose two threads repeatedly update reference counts on nearby objects.

CPU cache lines may bounce between cores:

Core A modifies cache line
Core B invalidates cache line
Core A reloads cache line

This is called false sharing.

Even logically independent objects can interfere through cache coherence protocols.

Free-threaded runtime design therefore depends heavily on:

memory layout
allocator design
object placement
cache locality
ownership heuristics

Concurrency performance is often dominated by hardware memory behavior rather than algorithmic complexity alone.

92.17 Lock Granularity

A free-threaded runtime must decide lock granularity carefully.

Coarse-grained locks:

simpler correctness
less parallelism
more contention

Fine-grained locks:

better scalability
higher complexity
deadlock risk
larger metadata cost

CPython historically favored simplicity through the GIL.

Free-threaded CPython must move toward more localized synchronization without making the runtime unmaintainable.

This is one of the core architectural tensions in the project.

92.18 Thread Safety of Built-in Types

Built-in operations acquire new semantics under parallelism.

Questions include:

Can two threads append to one list safely?
Can iteration proceed during mutation?
What operations are atomic?
What consistency guarantees exist?

The runtime attempts to preserve intuitive safety while avoiding excessive locking.

However, Python programs should still avoid unsynchronized shared mutable state where possible.

Example:

shared = []

def worker():
    for i in range(1000):
        shared.append(i)

The runtime may preserve structural integrity of the list, but logical ordering and higher-level invariants still require application-level synchronization.

92.19 Interaction With Subinterpreters

Subinterpreters and free-threading are related but distinct.

Subinterpreters isolate runtime state:

modules
globals
builtins
execution state

Free-threading allows concurrent execution inside one interpreter.

Together, they support future scalability directions:

multiple isolated interpreters
parallel execution
reduced global runtime state
better multicore utilization

The long-term architecture increasingly moves away from large globally shared runtime structures.

92.20 Runtime Invariants Become Explicit

The GIL historically hid many implicit assumptions.

Example assumptions:

reference counts never race
dict mutation is serialized
frame stacks are stable
object lifetime is predictable

Free-threaded CPython forces these assumptions to become explicit runtime invariants.

Every subsystem must answer:

Who owns this object?
Who may mutate this state?
What synchronization protects this structure?
When is this pointer valid?
What ordering guarantees exist?

This changes the engineering style of the interpreter itself.

92.21 Tooling and Debugging Challenges

Concurrent runtimes are harder to debug.

Problems include:

race conditions
deadlocks
heisenbugs
timing-sensitive corruption
memory visibility bugs

Traditional deterministic assumptions become weaker.

Debugging tools must handle:

simultaneous frame execution
parallel object mutation
cross-thread reference lifetime
concurrent allocator activity

Testing also becomes more difficult because many concurrency bugs appear nondeterministically.

92.22 Free-Threading and Python Semantics

The Python language itself changes relatively little.

Most user-visible semantics remain stable:

x = [1, 2, 3]
x.append(4)

still behaves as expected.

The major changes are implementation-level:

actual parallel bytecode execution
different performance characteristics
different extension safety requirements
different memory synchronization costs

The goal is preserving Python behavior while changing runtime scalability.

92.23 Long-Term Implications

Free-threaded CPython affects nearly every part of the ecosystem:

Area	Impact
Interpreter runtime	Fundamental redesign
C extensions	Compatibility changes
Scientific computing	Better multicore scaling potential
Web servers	Improved concurrent execution
Tooling	Harder concurrency debugging
Allocators	Higher synchronization complexity
Object model	New lifetime rules
Performance engineering	Cache behavior becomes central

The project represents a shift from:

single-thread simplicity

toward:

parallel runtime scalability

while attempting to preserve compatibility with decades of Python software.

92.24 Chapter Summary

Free-threaded CPython removes the traditional Global Interpreter Lock and allows multiple threads to execute Python bytecode concurrently inside one interpreter.

Achieving this requires major runtime redesigns:

atomic and biased reference counting
container synchronization
allocator coordination
garbage collector changes
safer ownership models
interpreter state isolation
extension compatibility work

The GIL historically acted as a global correctness mechanism. Removing it forces CPython to make synchronization explicit across the entire runtime.

The result is a more parallel interpreter, but also a more complex one.