81. Benchmarking CPython

Benchmarking CPython means measuring execution speed in a controlled, repeatable way. Profiling answers where time is spent. Benchmarking answers how much a change affects performance.

A benchmark should compare two states:

baseline
candidate

The baseline may be an older CPython build, an earlier version of a function, or a different implementation strategy. The candidate is the version being tested.

The goal is not to get one impressive number. The goal is to produce a trustworthy comparison.

81.1 Why Benchmarking Is Hard

Small timing differences are easy to misread.

A Python benchmark can be affected by:

CPU frequency scaling
thermal throttling
background processes
memory pressure
operating system scheduling
Python build options
compiler version
CPU cache state
adaptive interpreter warmup
garbage collection
imports
randomized hashing
I/O variability

A single run rarely means much.

Reliable benchmarking reduces noise, repeats measurements, and compares distributions.

81.2 Benchmarking vs Profiling

Profiling and benchmarking are related but different.

Activity	Main question	Example
Profiling	Where is time spent?	`cProfile`, `perf`, `tracemalloc`
Benchmarking	Did performance change?	`pyperf`, benchmark suites
Microbenchmarking	How fast is one small operation?	`timeit`, `pyperf timeit`
Macrobenchmarking	How fast is a realistic workload?	full application or suite

A profiler may show that attribute lookup dominates a loop. A benchmark tells whether changing the object layout improved total runtime.

81.3 What to Benchmark

Benchmark the workload that matters.

Possible benchmark targets:

small language operation
standard library function
application hot path
import time
serialization workload
web request handler
test suite runtime
compiler performance
interpreter startup
C extension boundary

A small benchmark is easier to interpret. A large benchmark is more representative.

Good performance work often uses both:

microbenchmark:
    isolate mechanism

macrobenchmark:
    confirm real-world effect

81.4 Baselines

A baseline must be explicit.

Bad:

this feels faster

Better:

CPython main branch at commit A
candidate branch at commit B
same compiler
same machine
same benchmark command

A benchmark result without a baseline is just a measurement.

A useful comparison records:

Python version
git commit
compiler
optimization flags
operating system
CPU model
benchmark command
environment variables
number of runs

81.5 Use `pyperf` for Serious Benchmarks

pyperf is the standard tool for reliable Python benchmarking.

It handles:

warmups
multiple worker processes
statistics
metadata
JSON output
result comparison
system tuning helpers

Example:

python -m pyperf timeit \
  -s 'xs = list(range(1000))' \
  'sum(xs)'

Save results:

python -m pyperf timeit \
  -o baseline.json \
  -s 'xs = list(range(1000))' \
  'sum(xs)'

Compare:

python -m pyperf compare_to baseline.json candidate.json

This is much better than reading one manual timer result.

81.6 `timeit`

timeit is useful for quick checks.

Example:

python -m timeit -s 'x = 1' 'x + 1'

In Python code:

import timeit

duration = timeit.timeit(
    "obj.x",
    setup="""
class C:
    pass
obj = C()
obj.x = 1
""",
    number=10_000_000,
)

print(duration)

timeit is convenient, but it gives less control than pyperf.

Use timeit for exploration. Use pyperf for claims.

81.7 Warmup

Modern CPython uses adaptive specialization.

This means early executions may differ from later executions.

Example:

def f(obj):
    return obj.x

The first few executions may use generic attribute lookup. Later executions may use a specialized LOAD_ATTR path.

A benchmark must allow warmup.

Bad benchmark:

import time

start = time.perf_counter()
f(obj)
end = time.perf_counter()

print(end - start)

This mostly measures one cold call.

Better:

for _ in range(10_000):
    f(obj)

start = time.perf_counter()
for _ in range(1_000_000):
    f(obj)
end = time.perf_counter()

print(end - start)

pyperf manages warmup more carefully.

81.8 Dead Code Elimination Is Less Relevant, But Still Think About Results

CPython generally does not perform aggressive compiler dead-code elimination like a native optimizing compiler.

This means:

x + 1

inside timeit is still executed.

However, benchmark structure still matters.

Bad:

timeit("f()", setup="def f(): return 1")

This may mostly measure call overhead, not useful work.

Better:

timeit("total += f(i)", setup="def f(x): return x + 1\ntotal = 0\ni = 1")

Even then, assignment scope and setup shape matter.

A benchmark should force the operation you care about to happen in the same shape as real code.

81.9 Avoid I/O in Microbenchmarks

I/O is noisy.

Avoid including these in tight microbenchmarks:

disk reads
network requests
database queries
printing
subprocesses
random sleeps

For I/O workloads, benchmark at a larger level and measure wall time, latency, throughput, and variance.

For interpreter mechanics, keep the benchmark CPU-bound and deterministic.

81.10 Avoid Measuring Setup

Keep setup outside the timed section.

Bad:

timeit("[i for i in range(1000)]")

This measures both range iteration and list construction, which may be desired, but often setup accidentally dominates.

For lookup benchmarking:

timeit(
    "d['key']",
    setup="d = {'key': 1}",
)

This measures lookup, not dictionary creation.

For construction benchmarking:

timeit(
    "{str(i): i for i in range(1000)}",
)

This intentionally measures construction.

Be explicit about what the benchmark includes.

81.11 Microbenchmarks

Microbenchmarks isolate small operations.

Examples:

local variable load
global lookup
attribute access
method call
list append
dictionary lookup
function call
integer addition
exception raising

Example:

python -m pyperf timeit \
  -s 'class C: pass; obj = C(); obj.x = 1' \
  'obj.x'

Microbenchmarks are useful when testing a specific interpreter mechanism.

They are dangerous when used to claim application-level improvement.

A 20 percent improvement in one microbenchmark may produce no visible application speedup if that operation is not a dominant cost.

81.12 Macrobenchmarks

Macrobenchmarks measure larger workloads.

Examples:

run a web request handler
parse a large JSON file
render templates
run a test suite
import a package
compile many Python files
run a CLI command
execute a data-processing pipeline

Macrobenchmarks include many interacting costs.

They are harder to explain, but more representative.

A good macrobenchmark should have:

fixed input data
fixed environment
repeatable command
clear metric
low external I/O variability

81.13 Throughput and Latency

Different workloads need different metrics.

Throughput:

operations per second
requests per second
files processed per second
objects parsed per second

Latency:

time per request
time to first response
p50 latency
p95 latency
p99 latency

A change can improve throughput while worsening tail latency.

For services, benchmark distributions, not just averages.

81.14 Mean, Median, and Variance

Benchmark results are distributions.

Useful values:

Metric	Meaning
Mean	Average across runs
Median	Middle result, less sensitive to outliers
Standard deviation	Spread of results
Min	Best observed result
Max	Worst observed result

A small speedup with high variance is weak evidence.

Example:

baseline: 100 ms ± 5 ms
candidate: 98 ms ± 6 ms

This difference may be noise.

A stronger result:

baseline: 100 ms ± 1 ms
candidate: 92 ms ± 1 ms

The distributions are clearly separated.

81.15 Geometric Mean

Benchmark suites often use geometric mean for ratios.

If each benchmark produces a speed ratio:

candidate_time / baseline_time

the geometric mean summarizes multiplicative changes better than arithmetic mean.

Example:

benchmark A: 0.90x
benchmark B: 1.10x
benchmark C: 1.00x

The geometric mean treats relative changes consistently.

For CPython suite comparisons, avoid over-focusing on one aggregate. Look at individual benchmarks too.

81.16 Speedup and Slowdown

Use clear ratio language.

If baseline is 10 seconds and candidate is 8 seconds:

candidate is 1.25x as fast
candidate is 20 percent less time

Calculation:

speed ratio = baseline_time / candidate_time = 10 / 8 = 1.25
time reduction = (10 - 8) / 10 = 20 percent

Do not confuse “25 percent faster” with “20 percent less time.” They are related but not identical.

81.17 Comparing Python Versions

When comparing Python versions, use the same benchmark suite and environment.

Example:

python3.12 -m pyperf timeit -o py312.json 'sum(range(1000))'
python3.13 -m pyperf timeit -o py313.json 'sum(range(1000))'

python3.13 -m pyperf compare_to py312.json py313.json

Record:

exact Python versions
build type
compiler
CPU
operating system
command

Python version changes may affect bytecode, specialization, object layout, and standard library behavior.

81.18 Comparing CPython Branches

For CPython development, compare builds from different branches or commits.

Typical workflow:

git checkout main
./configure --prefix=/tmp/py-main CFLAGS="-O3 -g"
make -j
make install

git checkout my-branch
./configure --prefix=/tmp/py-branch CFLAGS="-O3 -g"
make -j
make install

Then run the same benchmark using both interpreters.

Avoid comparing:

debug build vs release build
different compiler flags
different system load
different dependency versions

Those differences can swamp the change being tested.

81.19 Debug Builds Distort Speed

CPython debug builds are useful for correctness work.

They add checks and diagnostics.

They may change:

object layout
reference counting overhead
assertion cost
allocator behavior
execution speed

Use debug builds for finding bugs.

Use release-like builds for speed measurements.

A common performance build uses optimization and debug symbols:

./configure CFLAGS="-O3 -g"

This keeps native profiling symbols while preserving realistic optimization.

81.20 System Isolation

Benchmark on a quiet system.

Good conditions:

few background tasks
stable CPU frequency
no thermal throttling
consistent power mode
enough free memory
same terminal environment
same filesystem state where relevant

On laptops, power and thermal behavior can heavily affect results.

For serious CPython work, use a dedicated machine or carefully controlled environment.

81.21 CPU Frequency

Modern CPUs change frequency dynamically.

This can distort benchmarks.

Examples:

turbo boost
thermal throttling
power saving mode
background load
laptop battery mode

A candidate benchmark may look faster simply because the CPU was running at a higher frequency.

pyperf system tune can help on supported systems, but you still need to understand the machine.

81.22 Hash Randomization

Python randomizes string hashes by default.

This can affect dictionary and set workloads.

For reproducibility, you may set:

PYTHONHASHSEED=0

This makes hash behavior deterministic.

However, fixed hash seeds may hide behavior that appears under normal randomized execution.

Use fixed seeds for repeatability when needed. Use varied seeds when testing robustness.

81.23 Garbage Collection Effects

Garbage collection can affect benchmarks.

A benchmark that creates many container cycles may trigger cyclic GC.

You can inspect GC behavior:

import gc

print(gc.get_count())
print(gc.get_stats())

Some microbenchmarks disable GC:

import gc
gc.disable()
try:
    run_benchmark()
finally:
    gc.enable()

This is valid only if GC is not part of what you want to measure.

For allocation-heavy real workloads, disabling GC may produce misleading results.

81.24 Allocation Effects

Allocation-heavy benchmarks are sensitive to allocator state.

A benchmark that creates many objects may be affected by:

free lists
pymalloc pools
arena reuse
system allocator behavior
memory fragmentation

Repeated runs may become faster or slower as allocator state changes.

Use multiple worker processes to reduce carryover effects.

This is another reason pyperf is preferable.

81.25 Cache Effects

CPU cache state affects microbenchmarks.

A small benchmark may fit entirely in cache.

A real workload may have a much larger working set.

Example:

xs = list(range(100))

may benchmark very differently from:

xs = list(range(10_000_000))

Both test “loop over a list,” but their memory behavior differs.

Choose input sizes that match the question.

81.26 Branch Prediction Effects

Stable data can make branch prediction highly effective.

Example:

for x in xs:
    if x > 0:
        total += x

If every x is positive, the branch is predictable.

If signs are random, the branch may be less predictable.

CPython itself has many type and cache validation branches. Stable runtime types help both interpreter specialization and CPU branch prediction.

81.27 Adaptive Specialization Effects

Modern CPython rewrites bytecode execution paths based on runtime behavior.

This affects benchmarks of:

attribute access
global lookup
binary operations
function calls
method calls
subscript operations

To inspect specialization:

import dis

def f(obj):
    return obj.x + 1

for _ in range(10_000):
    f(obj)

dis.dis(f, adaptive=True, show_caches=True)

A benchmark should specify whether it measures cold or warm behavior.

Most throughput benchmarks should measure warm behavior.

Startup benchmarks may intentionally measure cold behavior.

81.28 Cold Benchmarks

Cold benchmarks measure first-run behavior.

Examples:

interpreter startup
first import
first request after process start
first function execution
first regex compile
first template render

Cold benchmarks matter for:

CLIs
serverless functions
short-lived scripts
developer tools
test runners

Adaptive specialization may not help much if code runs only once.

81.29 Warm Benchmarks

Warm benchmarks measure steady-state behavior.

Examples:

long-running service
worker process
data pipeline
training loop
repeated request handler

Warm benchmarks should allow:

bytecode specialization
cache population
allocator stabilization
import completion

They answer a different question from cold benchmarks.

A system can have excellent warm throughput and poor cold startup.

81.30 Benchmark Suites

A benchmark suite collects multiple workloads.

For CPython, suites can include:

startup
regex
JSON
pickle
logging
template rendering
async workloads
numeric Python loops
object-heavy workloads
compiler workloads

A suite helps avoid optimizing one narrow case while slowing many others.

When interpreting a suite:

look at aggregate result
inspect individual wins
inspect individual regressions
explain outliers

One large regression may matter more than a small aggregate improvement.

81.31 The `pyperformance` Suite

CPython performance work often uses the pyperformance benchmark suite.

It contains a collection of Python workloads intended to track interpreter-level performance over time.

A typical workflow is:

pyperformance run -o baseline.json --python=/path/to/python-main
pyperformance run -o candidate.json --python=/path/to/python-branch
pyperformance compare baseline.json candidate.json

This provides broader coverage than a single microbenchmark.

Use it to catch unintended regressions.

81.32 Benchmarking Standard Library Changes

For standard library changes, benchmark both isolated functions and realistic use.

Example change: JSON encoder optimization.

Microbenchmark:

python -m pyperf timeit \
  -s 'import json; data = {"x": list(range(1000))}' \
  'json.dumps(data)'

Macrobenchmark:

application workload that serializes real payloads

The microbenchmark confirms the target improved. The macrobenchmark confirms the improvement matters.

81.33 Benchmarking Interpreter Changes

Interpreter changes can have broad effects.

Examples:

opcode handler changes
reference counting changes
dictionary layout changes
frame layout changes
allocator changes
inline cache changes

These require broad benchmark coverage.

A change that improves one opcode may regress another path through instruction cache pressure, branch behavior, or larger data structures.

Use both focused benchmarks and benchmark suites.

81.34 Benchmarking Memory

Speed is not the only metric.

Memory benchmarks measure:

peak RSS
allocated bytes
object count
arena count
working set size
GC pressure

A change may improve speed by using more memory.

That tradeoff may or may not be acceptable.

Use tools such as:

tracemalloc
resource module
platform memory tools
external RSS measurement
heap profilers

81.35 Benchmarking Startup

Startup benchmarking should isolate phases.

Useful commands:

python -S -c pass
python -c pass
python -X importtime -c "import package"

Questions:

how much is interpreter startup?
how much is site import?
how much is application import?
how much is module-level work?

Startup benchmarks are sensitive to filesystem cache and environment.

Run repeatedly and compare under controlled conditions.

81.36 Benchmarking Imports

Import benchmarks are important for developer tools and CLIs.

Use:

python -X importtime -c "import your_package"

For repeated measurement:

python -m pyperf command \
  -- python -c "import your_package"

Import time can regress when modules add eager imports, perform runtime type work, or execute expensive module-level initialization.

81.37 Benchmarking C Extensions

C extension benchmarks should separate:

call overhead
argument conversion
native computation
data copying
GIL behavior
result construction

Example:

benchmark empty call
benchmark small input
benchmark large input
benchmark repeated calls
benchmark batched call

A C extension may be fast internally but slow overall if it copies data or creates many Python objects.

81.38 Benchmarking Async Code

Async benchmarks need care.

Measure:

throughput
latency
event loop lag
task scheduling overhead
queue delay
I/O simulation realism

Avoid fake benchmarks that await completed coroutines only, unless that is the mechanism under test.

For networked async code, use controlled local servers or mocks to reduce external noise.

81.39 Benchmarking Threaded Code

Thread benchmarks should distinguish:

CPU-bound Python code
I/O-bound waiting
native code that releases the GIL
lock contention
queue overhead

Traditional CPython serializes Python bytecode under the GIL, so CPU-bound Python threads often do not scale.

Free-threaded builds require different benchmarks because synchronization and reference count behavior change.

81.40 Benchmarking Free-Threaded CPython

Free-threaded CPython benchmarking should measure both single-thread and multi-thread behavior.

Important dimensions:

single-thread overhead
multi-thread scaling
reference count contention
allocator contention
object sharing
lock granularity
C extension compatibility

A free-threaded build may improve parallel workloads while slowing single-thread workloads.

Benchmark both. Do not report only the favorable side.

81.41 Benchmarking With Native Profilers

When a benchmark changes, use profiling to explain why.

Example workflow:

benchmark shows 8 percent slowdown
run native profiler
inspect hot symbols
find increased dict lookup cost
inspect code change
create focused microbenchmark
fix or justify regression

Benchmarks detect changes. Profilers explain changes.

The two tools should be used together.

81.42 Reporting Results

A good benchmark report includes:

summary
baseline and candidate
environment
commands
raw result files
main wins
main regressions
interpretation
known limitations

Avoid vague claims.

Bad:

this is faster

Good:

On this machine, candidate reduces median runtime for benchmark X from 120.4 ms to 111.8 ms across 20 pyperf runs. Benchmark Y regresses from 80.1 ms to 83.0 ms. Raw pyperf JSON files are attached.

81.43 Statistical Significance

Do not overstate tiny changes.

If a result is inside noise, say so.

Example:

candidate: 1.01x faster, but run-to-run variation is 1.5 percent

This is weak evidence.

pyperf compare_to helps identify meaningful differences, but human judgment still matters.

Look for stable, explainable changes.

81.44 Regression Hunting

When performance regresses, reduce the search space.

Useful workflow:

confirm regression
find affected benchmark
bisect commits
profile before and after
create smaller reproducer
identify mechanism
fix or document tradeoff

For CPython, git bisect plus a repeatable benchmark command is powerful.

A regression without a reproducer is hard to fix.

81.45 Benchmark Reproducers

A benchmark reproducer should be:

small enough to run easily
large enough to show the effect
deterministic
documented
independent of external services

Include setup, command, input data, and expected comparison.

A good reproducer lets another developer validate the result.

81.46 Benchmarking Pitfalls in Python Code

Common pitfalls:

Pitfall	Example	Problem
Measuring import in setup accidentally	setup imports large package	Hides runtime cost
Measuring printing	`print(x)` in loop	Terminal I/O dominates
Measuring random data generation	generate input in timed code	Mixes setup and target
Too few iterations	one call	High noise
Overly tiny operation	`x + 1` only	Timer overhead and dispatch dominate
Wrong scope	globals instead of locals	Measures lookup difference accidentally

A good benchmark has a narrow, explicit target.

81.47 Benchmarking Example: Attribute Access

Compare normal attributes and slots.

# bench_attr.py
import pyperf

runner = pyperf.Runner()

class Normal:
    def __init__(self):
        self.x = 1

class Slotted:
    __slots__ = ("x",)

    def __init__(self):
        self.x = 1

normal = Normal()
slotted = Slotted()

def read_normal():
    return normal.x

def read_slotted():
    return slotted.x

runner.bench_func("normal_attr", read_normal)
runner.bench_func("slotted_attr", read_slotted)

Run:

python bench_attr.py -o attr.json

This tests one narrow mechanism. It does not prove that all slotted classes are better for all applications.

81.48 Benchmarking Example: Function Call

Compare direct expression and helper function.

# bench_call.py
import pyperf

runner = pyperf.Runner()

def inc(x):
    return x + 1

def direct_loop():
    total = 0
    for i in range(10_000):
        total += i + 1
    return total

def call_loop():
    total = 0
    for i in range(10_000):
        total += inc(i)
    return total

runner.bench_func("direct_loop", direct_loop)
runner.bench_func("call_loop", call_loop)

This measures the cost of repeated Python calls in a loop.

It is useful for understanding call overhead. It does not mean helper functions should be avoided everywhere.

81.49 Benchmarking Example: Dictionary Lookup

# bench_dict.py
import pyperf

runner = pyperf.Runner()

d = {str(i): i for i in range(1000)}
keys = [str(i) for i in range(1000)]

def lookup_loop():
    total = 0
    for key in keys:
        total += d[key]
    return total

runner.bench_func("dict_lookup_loop", lookup_loop)

This benchmark includes:

string key hashing
dictionary lookup
loop overhead
integer addition

If you want only dictionary lookup, you need a narrower benchmark. If you want realistic dictionary use, this may be suitable.

81.50 Mental Model

A useful model:

Benchmarking is controlled comparison.

The core loop is:

define workload
measure baseline
measure candidate
compare distributions
explain change
confirm with profiling

A benchmark result is useful only when the workload, environment, and comparison are clear.

81.51 Chapter Summary

Benchmarking CPython requires discipline.

Use timeit for quick checks, pyperf for reliable measurement, and benchmark suites for broad coverage. Account for warmup, adaptive specialization, CPU behavior, garbage collection, allocation effects, and system noise.

Good benchmarking separates cold and warm behavior, distinguishes microbenchmarks from macrobenchmarks, records the environment, and compares distributions rather than single numbers.

For CPython work, benchmarks should be paired with profiling. Benchmarks show that performance changed. Profiling explains why.

81. Benchmarking CPython

81.1 Why Benchmarking Is Hard

81.2 Benchmarking vs Profiling

81.3 What to Benchmark

81.4 Baselines

81.5 Use pyperf for Serious Benchmarks

81.6 timeit

81.7 Warmup

81.8 Dead Code Elimination Is Less Relevant, But Still Think About Results

81.9 Avoid I/O in Microbenchmarks

81.10 Avoid Measuring Setup

81.11 Microbenchmarks

81.12 Macrobenchmarks

81.13 Throughput and Latency

81.14 Mean, Median, and Variance

81.15 Geometric Mean

81.16 Speedup and Slowdown

81.17 Comparing Python Versions

81.18 Comparing CPython Branches

81.19 Debug Builds Distort Speed

81.20 System Isolation

81.21 CPU Frequency

81.22 Hash Randomization

81.23 Garbage Collection Effects

81.24 Allocation Effects

81.25 Cache Effects

81.26 Branch Prediction Effects

81.27 Adaptive Specialization Effects

81.28 Cold Benchmarks

81.29 Warm Benchmarks

81.30 Benchmark Suites

81.31 The pyperformance Suite

81.32 Benchmarking Standard Library Changes

81.33 Benchmarking Interpreter Changes

81.34 Benchmarking Memory

81.35 Benchmarking Startup

81.36 Benchmarking Imports

81.37 Benchmarking C Extensions

81.38 Benchmarking Async Code

81.39 Benchmarking Threaded Code

81.40 Benchmarking Free-Threaded CPython

81.41 Benchmarking With Native Profilers

81.42 Reporting Results

81.43 Statistical Significance

81.44 Regression Hunting

81.45 Benchmark Reproducers

81.46 Benchmarking Pitfalls in Python Code

81.47 Benchmarking Example: Attribute Access

81.48 Benchmarking Example: Function Call

81.49 Benchmarking Example: Dictionary Lookup

81.50 Mental Model

81.51 Chapter Summary

81.5 Use `pyperf` for Serious Benchmarks

81.6 `timeit`

81.31 The `pyperformance` Suite