PEP 703 no-GIL build: per-object locking, biased reference counting, and the free-threaded evaluation loop.
Free-threaded CPython is a major redesign of the interpreter runtime that removes the traditional Global Interpreter Lock (GIL) and allows multiple threads to execute Python bytecode concurrently within the same interpreter.
Historically, CPython relied on the GIL to serialize execution of Python code. The GIL simplified memory management, reference counting, object mutation, allocator coordination, and internal runtime invariants. Only one thread at a time executed Python bytecode inside a process interpreter.
Free-threaded CPython changes this model.
The runtime must now preserve interpreter correctness while multiple CPU threads simultaneously manipulate Python objects, dictionaries, frames, reference counts, caches, and internal runtime structures.
This chapter examines:
why the GIL existed
why removing it is difficult
how free-threaded CPython works
how memory management changes
how object access changes
how container synchronization works
how extension compatibility changes
what performance tradeoffs appearThe free-threaded work is one of the largest architectural changes in CPython history.
92.1 Historical Background
CPython traditionally used a single global lock protecting interpreter execution.
Conceptually:
Thread A acquires GIL
executes bytecode
Thread B waits
Thread A releases GIL
Thread B acquires GILThis gave CPython several properties:
| Property | Effect |
|---|---|
| Reference counting updates are serialized | ob_refcnt operations stay simple |
| Object mutation is implicitly protected | Many internals avoid fine-grained locking |
| Interpreter state remains coherent | Frames and caches avoid races |
| Extension authors assume single-threaded interpreter execution | Simpler C APIs |
The cost was limited parallel execution for CPU-bound Python code.
Example:
import threading
def work():
total = 0
for i in range(100_000_000):
total += i
threads = [threading.Thread(target=work) for _ in range(4)]
for t in threads:
t.start()
for t in threads:
t.join()Traditional CPython usually does not achieve near-4x CPU scaling here because threads compete for the GIL.
The GIL became one of the defining implementation characteristics of CPython.
92.2 Why the GIL Was Difficult to Remove
The GIL was not merely a scheduling mechanism.
It acted as a global correctness boundary.
Without the GIL, nearly every runtime subsystem becomes concurrently mutable:
reference counts
object headers
dictionaries
lists
type caches
attribute caches
allocator metadata
garbage collector state
interned strings
import state
frame stacks
exception stateConsider a simple increment:
x += 1Under the GIL:
load x
compute x + 1
store xNo other thread can mutate interpreter state during these bytecode operations.
Without the GIL:
Thread A reads x
Thread B reads x
Thread A writes x + 1
Thread B writes stale valueThe runtime must now enforce synchronization explicitly.
The challenge extends far beyond Python-level semantics.
Even this operation becomes unsafe:
Py_INCREF(obj);Traditional CPython used plain integer increments:
++obj->ob_refcnt;Without the GIL, concurrent increments can race.
The free-threaded runtime therefore changes fundamental assumptions across the interpreter.
92.3 The Free-Threaded Build
Modern CPython introduces an experimental free-threaded build configuration.
The build disables the traditional GIL and enables runtime mechanisms required for concurrent execution.
Conceptually:
traditional build
one thread executes Python bytecode at a time
free-threaded build
multiple threads execute Python bytecode simultaneouslyThis is not merely a runtime flag.
Large parts of the interpreter behave differently:
reference counting strategy
container synchronization
allocator coordination
object access rules
C extension requirements
runtime invariantsThe free-threaded runtime aims to preserve Python language semantics while changing interpreter-level concurrency guarantees.
92.4 Atomic Reference Counting
Reference counting is one of the central problems in free-threaded CPython.
Traditional CPython:
obj->ob_refcnt++;
obj->ob_refcnt--;This is unsafe under concurrent execution.
Free-threaded CPython uses atomic operations for many reference count updates.
Conceptually:
atomic_fetch_add(&obj->ob_refcnt, 1);
atomic_fetch_sub(&obj->ob_refcnt, 1);Atomic operations guarantee correctness under concurrent modification.
However, they introduce costs:
| Cost | Reason |
|---|---|
| Higher instruction overhead | Atomic operations are more expensive |
| Cache synchronization | CPU cores coordinate cache lines |
| Memory ordering constraints | Stronger synchronization semantics |
| Reduced locality | Shared objects bounce between cores |
Reference counting becomes one of the major scalability bottlenecks in a highly parallel runtime.
92.5 Biased Reference Counting
Free-threaded CPython introduces techniques to reduce atomic overhead.
One important strategy is biased reference counting.
The idea:
most objects are heavily used by one thread
avoid global atomic synchronization when possible
delay or batch cross-thread coordinationConceptually:
thread-local reference ownership
+
shared atomic reference stateA thread can manipulate references cheaply while ownership remains local.
Cross-thread sharing requires synchronization.
This reduces contention for common cases:
def local_work():
xs = []
for i in range(1_000_000):
xs.append(i)Most objects here remain thread-local.
The runtime attempts to avoid expensive global atomic traffic for such objects.
92.6 Object Immortality
Another optimization is immortal objects.
Some objects are effectively permanent:
None
True
False
small integers
interned constants
builtin singletonsTraditionally, these still participated in reference counting.
Free-threaded CPython introduces immortal objects whose reference counts no longer behave normally.
Conceptually:
immortal object
refcount never reaches zero
no deallocation
many INCREF/DECREF operations skippedThis reduces synchronization overhead for heavily shared objects.
For example:
x = Nonewould otherwise produce enormous cross-thread reference count traffic.
Immortal objects remove much of this pressure.
92.7 Container Synchronization
Containers become major synchronization points.
Examples:
list.append(x)
dict[key] = value
set.add(x)Under the GIL, internal container state was implicitly protected.
Without the GIL, concurrent mutations must coordinate safely.
The runtime introduces internal synchronization mechanisms.
Conceptually:
per-container locks
atomic state transitions
careful resize coordination
safe iteration invariantsA dictionary resize becomes particularly difficult.
Traditional dict resize:
allocate new table
rehash entries
replace table pointer
free old tableWithout synchronization, another thread may:
read partially migrated table
follow invalid pointer
observe inconsistent stateThe free-threaded runtime must guarantee container integrity during concurrent access.
92.8 Memory Allocation Under Concurrency
CPython includes specialized allocators:
pymalloc
arena allocators
object free lists
small object allocatorsThese systems historically assumed GIL protection.
Free-threaded execution requires allocator synchronization.
Challenges include:
concurrent allocation
concurrent free
free list corruption
arena reuse races
cache locality degradation
false sharingThe runtime attempts to preserve allocation performance while ensuring correctness.
Thread-local allocation structures become increasingly important.
92.9 Garbage Collection Changes
The cyclic garbage collector must also adapt.
Traditional CPython could often assume interpreter-wide serialization during GC-sensitive operations.
Free-threaded execution introduces new problems:
objects mutate during collection
reference graphs change concurrently
container traversal races appear
finalizers execute concurrentlyThe collector must coordinate safely with running threads.
Key challenges:
| Problem | Example |
|---|---|
| Object mutation during traversal | List contents change while scanning |
| Concurrent resurrection | __del__ creates new references |
| Cross-thread visibility | One thread frees object seen by another |
| Container instability | Dict resize during traversal |
The collector therefore requires stronger synchronization and more careful state management.
92.10 Interpreter State Isolation
Traditional CPython relied heavily on process-global state.
Examples:
interned strings
type caches
import caches
runtime registries
allocator stateFree-threaded work pushes CPython toward improved interpreter isolation.
This overlaps with subinterpreter work.
The runtime increasingly distinguishes:
process-global state
interpreter-local state
thread-local stateThis decomposition is necessary for scalable concurrency.
92.11 Frame Execution Under Parallelism
Frames represent active execution contexts.
A frame contains:
instruction pointer
locals
stack
exception state
code objectTraditional CPython assumed only one thread executed a frame at a time.
Free-threaded CPython must enforce stronger ownership guarantees.
Conceptually:
a frame belongs to one executing thread
shared frame access requires synchronizationDebuggers, profilers, tracers, and introspection tools become more complicated because execution can now proceed simultaneously across many interpreter threads.
92.12 Bytecode Evaluation Without the GIL
The evaluation loop changes substantially.
Traditional interpreter:
acquire GIL
execute bytecode
release GIL periodicallyFree-threaded interpreter:
execute bytecode concurrently
coordinate mutable shared state explicitlyThis affects:
attribute caches
inline caches
specialization metadata
object access
exception handling
call machineryThe adaptive interpreter introduced in newer CPython versions must now operate correctly under concurrent mutation.
92.13 C Extension Compatibility
C extensions are one of the hardest compatibility problems.
Many extensions historically assumed:
the GIL protects internal state
PyObject operations are serialized
reference counting is implicitly safe
container access is effectively single-threadedThese assumptions become invalid in free-threaded mode.
Unsafe example:
static PyObject *global_cache;Multiple threads may now mutate or access this simultaneously.
Extension authors must reconsider:
locking
thread ownership
reference lifetime
global state
borrowed references
shared buffersSome extensions remain incompatible until rewritten.
92.14 Borrowed References Become Dangerous
Borrowed references are especially problematic.
Traditional CPython often relied on the GIL:
PyObject *item = PyList_GET_ITEM(list, 0);This returns a borrowed reference.
Under the GIL:
another thread cannot concurrently destroy list itemWithout the GIL:
another thread may mutate list
another thread may delete object
borrowed pointer may become invalidThis creates severe safety hazards.
Free-threaded CPython pushes toward safer ownership models and stronger APIs.
92.15 Performance Tradeoffs
Removing the GIL does not automatically improve performance.
Single-thread performance may decrease due to:
atomic operations
extra synchronization
cache contention
larger metadata
locking overhead
memory fencesParallel workloads may improve substantially.
Typical tradeoff:
| Workload | Effect |
|---|---|
| Single-thread CPU-bound | Often slower |
| Multi-thread CPU-bound | Potentially much faster |
| I/O-bound | Smaller difference |
| Allocation-heavy | May suffer from contention |
| Shared-object-heavy | May suffer from cache synchronization |
The runtime therefore balances:
single-thread efficiency
parallel scalability
compatibility
implementation complexity92.16 False Sharing and Cache Coherence
Modern multicore systems introduce hardware-level costs.
Suppose two threads repeatedly update reference counts on nearby objects.
CPU cache lines may bounce between cores:
Core A modifies cache line
Core B invalidates cache line
Core A reloads cache lineThis is called false sharing.
Even logically independent objects can interfere through cache coherence protocols.
Free-threaded runtime design therefore depends heavily on:
memory layout
allocator design
object placement
cache locality
ownership heuristicsConcurrency performance is often dominated by hardware memory behavior rather than algorithmic complexity alone.
92.17 Lock Granularity
A free-threaded runtime must decide lock granularity carefully.
Coarse-grained locks:
simpler correctness
less parallelism
more contentionFine-grained locks:
better scalability
higher complexity
deadlock risk
larger metadata costCPython historically favored simplicity through the GIL.
Free-threaded CPython must move toward more localized synchronization without making the runtime unmaintainable.
This is one of the core architectural tensions in the project.
92.18 Thread Safety of Built-in Types
Built-in operations acquire new semantics under parallelism.
Questions include:
Can two threads append to one list safely?
Can iteration proceed during mutation?
What operations are atomic?
What consistency guarantees exist?The runtime attempts to preserve intuitive safety while avoiding excessive locking.
However, Python programs should still avoid unsynchronized shared mutable state where possible.
Example:
shared = []
def worker():
for i in range(1000):
shared.append(i)The runtime may preserve structural integrity of the list, but logical ordering and higher-level invariants still require application-level synchronization.
92.19 Interaction With Subinterpreters
Subinterpreters and free-threading are related but distinct.
Subinterpreters isolate runtime state:
modules
globals
builtins
execution stateFree-threading allows concurrent execution inside one interpreter.
Together, they support future scalability directions:
multiple isolated interpreters
parallel execution
reduced global runtime state
better multicore utilizationThe long-term architecture increasingly moves away from large globally shared runtime structures.
92.20 Runtime Invariants Become Explicit
The GIL historically hid many implicit assumptions.
Example assumptions:
reference counts never race
dict mutation is serialized
frame stacks are stable
object lifetime is predictableFree-threaded CPython forces these assumptions to become explicit runtime invariants.
Every subsystem must answer:
Who owns this object?
Who may mutate this state?
What synchronization protects this structure?
When is this pointer valid?
What ordering guarantees exist?This changes the engineering style of the interpreter itself.
92.21 Tooling and Debugging Challenges
Concurrent runtimes are harder to debug.
Problems include:
race conditions
deadlocks
heisenbugs
timing-sensitive corruption
memory visibility bugsTraditional deterministic assumptions become weaker.
Debugging tools must handle:
simultaneous frame execution
parallel object mutation
cross-thread reference lifetime
concurrent allocator activityTesting also becomes more difficult because many concurrency bugs appear nondeterministically.
92.22 Free-Threading and Python Semantics
The Python language itself changes relatively little.
Most user-visible semantics remain stable:
x = [1, 2, 3]
x.append(4)still behaves as expected.
The major changes are implementation-level:
actual parallel bytecode execution
different performance characteristics
different extension safety requirements
different memory synchronization costsThe goal is preserving Python behavior while changing runtime scalability.
92.23 Long-Term Implications
Free-threaded CPython affects nearly every part of the ecosystem:
| Area | Impact |
|---|---|
| Interpreter runtime | Fundamental redesign |
| C extensions | Compatibility changes |
| Scientific computing | Better multicore scaling potential |
| Web servers | Improved concurrent execution |
| Tooling | Harder concurrency debugging |
| Allocators | Higher synchronization complexity |
| Object model | New lifetime rules |
| Performance engineering | Cache behavior becomes central |
The project represents a shift from:
single-thread simplicitytoward:
parallel runtime scalabilitywhile attempting to preserve compatibility with decades of Python software.
92.24 Chapter Summary
Free-threaded CPython removes the traditional Global Interpreter Lock and allows multiple threads to execute Python bytecode concurrently inside one interpreter.
Achieving this requires major runtime redesigns:
atomic and biased reference counting
container synchronization
allocator coordination
garbage collector changes
safer ownership models
interpreter state isolation
extension compatibility workThe GIL historically acted as a global correctness mechanism. Removing it forces CPython to make synchronization explicit across the entire runtime.
The result is a more parallel interpreter, but also a more complex one.