pyperformance benchmark suite, microbenchmark pitfalls, timer resolution, and interpreter warm-up effects.
Benchmarking CPython means measuring execution speed in a controlled, repeatable way. Profiling answers where time is spent. Benchmarking answers how much a change affects performance.
A benchmark should compare two states:
baseline
candidateThe baseline may be an older CPython build, an earlier version of a function, or a different implementation strategy. The candidate is the version being tested.
The goal is not to get one impressive number. The goal is to produce a trustworthy comparison.
81.1 Why Benchmarking Is Hard
Small timing differences are easy to misread.
A Python benchmark can be affected by:
CPU frequency scaling
thermal throttling
background processes
memory pressure
operating system scheduling
Python build options
compiler version
CPU cache state
adaptive interpreter warmup
garbage collection
imports
randomized hashing
I/O variabilityA single run rarely means much.
Reliable benchmarking reduces noise, repeats measurements, and compares distributions.
81.2 Benchmarking vs Profiling
Profiling and benchmarking are related but different.
| Activity | Main question | Example |
|---|---|---|
| Profiling | Where is time spent? | cProfile, perf, tracemalloc |
| Benchmarking | Did performance change? | pyperf, benchmark suites |
| Microbenchmarking | How fast is one small operation? | timeit, pyperf timeit |
| Macrobenchmarking | How fast is a realistic workload? | full application or suite |
A profiler may show that attribute lookup dominates a loop. A benchmark tells whether changing the object layout improved total runtime.
81.3 What to Benchmark
Benchmark the workload that matters.
Possible benchmark targets:
small language operation
standard library function
application hot path
import time
serialization workload
web request handler
test suite runtime
compiler performance
interpreter startup
C extension boundaryA small benchmark is easier to interpret. A large benchmark is more representative.
Good performance work often uses both:
microbenchmark:
isolate mechanism
macrobenchmark:
confirm real-world effect81.4 Baselines
A baseline must be explicit.
Bad:
this feels fasterBetter:
CPython main branch at commit A
candidate branch at commit B
same compiler
same machine
same benchmark commandA benchmark result without a baseline is just a measurement.
A useful comparison records:
Python version
git commit
compiler
optimization flags
operating system
CPU model
benchmark command
environment variables
number of runs81.5 Use pyperf for Serious Benchmarks
pyperf is the standard tool for reliable Python benchmarking.
It handles:
warmups
multiple worker processes
statistics
metadata
JSON output
result comparison
system tuning helpersExample:
python -m pyperf timeit \
-s 'xs = list(range(1000))' \
'sum(xs)'Save results:
python -m pyperf timeit \
-o baseline.json \
-s 'xs = list(range(1000))' \
'sum(xs)'Compare:
python -m pyperf compare_to baseline.json candidate.jsonThis is much better than reading one manual timer result.
81.6 timeit
timeit is useful for quick checks.
Example:
python -m timeit -s 'x = 1' 'x + 1'In Python code:
import timeit
duration = timeit.timeit(
"obj.x",
setup="""
class C:
pass
obj = C()
obj.x = 1
""",
number=10_000_000,
)
print(duration)timeit is convenient, but it gives less control than pyperf.
Use timeit for exploration. Use pyperf for claims.
81.7 Warmup
Modern CPython uses adaptive specialization.
This means early executions may differ from later executions.
Example:
def f(obj):
return obj.xThe first few executions may use generic attribute lookup. Later executions may use a specialized LOAD_ATTR path.
A benchmark must allow warmup.
Bad benchmark:
import time
start = time.perf_counter()
f(obj)
end = time.perf_counter()
print(end - start)This mostly measures one cold call.
Better:
for _ in range(10_000):
f(obj)
start = time.perf_counter()
for _ in range(1_000_000):
f(obj)
end = time.perf_counter()
print(end - start)pyperf manages warmup more carefully.
81.8 Dead Code Elimination Is Less Relevant, But Still Think About Results
CPython generally does not perform aggressive compiler dead-code elimination like a native optimizing compiler.
This means:
x + 1inside timeit is still executed.
However, benchmark structure still matters.
Bad:
timeit("f()", setup="def f(): return 1")This may mostly measure call overhead, not useful work.
Better:
timeit("total += f(i)", setup="def f(x): return x + 1\ntotal = 0\ni = 1")Even then, assignment scope and setup shape matter.
A benchmark should force the operation you care about to happen in the same shape as real code.
81.9 Avoid I/O in Microbenchmarks
I/O is noisy.
Avoid including these in tight microbenchmarks:
disk reads
network requests
database queries
printing
subprocesses
random sleepsFor I/O workloads, benchmark at a larger level and measure wall time, latency, throughput, and variance.
For interpreter mechanics, keep the benchmark CPU-bound and deterministic.
81.10 Avoid Measuring Setup
Keep setup outside the timed section.
Bad:
timeit("[i for i in range(1000)]")This measures both range iteration and list construction, which may be desired, but often setup accidentally dominates.
For lookup benchmarking:
timeit(
"d['key']",
setup="d = {'key': 1}",
)This measures lookup, not dictionary creation.
For construction benchmarking:
timeit(
"{str(i): i for i in range(1000)}",
)This intentionally measures construction.
Be explicit about what the benchmark includes.
81.11 Microbenchmarks
Microbenchmarks isolate small operations.
Examples:
local variable load
global lookup
attribute access
method call
list append
dictionary lookup
function call
integer addition
exception raisingExample:
python -m pyperf timeit \
-s 'class C: pass; obj = C(); obj.x = 1' \
'obj.x'Microbenchmarks are useful when testing a specific interpreter mechanism.
They are dangerous when used to claim application-level improvement.
A 20 percent improvement in one microbenchmark may produce no visible application speedup if that operation is not a dominant cost.
81.12 Macrobenchmarks
Macrobenchmarks measure larger workloads.
Examples:
run a web request handler
parse a large JSON file
render templates
run a test suite
import a package
compile many Python files
run a CLI command
execute a data-processing pipelineMacrobenchmarks include many interacting costs.
They are harder to explain, but more representative.
A good macrobenchmark should have:
fixed input data
fixed environment
repeatable command
clear metric
low external I/O variability81.13 Throughput and Latency
Different workloads need different metrics.
Throughput:
operations per second
requests per second
files processed per second
objects parsed per secondLatency:
time per request
time to first response
p50 latency
p95 latency
p99 latencyA change can improve throughput while worsening tail latency.
For services, benchmark distributions, not just averages.
81.14 Mean, Median, and Variance
Benchmark results are distributions.
Useful values:
| Metric | Meaning |
|---|---|
| Mean | Average across runs |
| Median | Middle result, less sensitive to outliers |
| Standard deviation | Spread of results |
| Min | Best observed result |
| Max | Worst observed result |
A small speedup with high variance is weak evidence.
Example:
baseline: 100 ms ± 5 ms
candidate: 98 ms ± 6 msThis difference may be noise.
A stronger result:
baseline: 100 ms ± 1 ms
candidate: 92 ms ± 1 msThe distributions are clearly separated.
81.15 Geometric Mean
Benchmark suites often use geometric mean for ratios.
If each benchmark produces a speed ratio:
candidate_time / baseline_timethe geometric mean summarizes multiplicative changes better than arithmetic mean.
Example:
benchmark A: 0.90x
benchmark B: 1.10x
benchmark C: 1.00xThe geometric mean treats relative changes consistently.
For CPython suite comparisons, avoid over-focusing on one aggregate. Look at individual benchmarks too.
81.16 Speedup and Slowdown
Use clear ratio language.
If baseline is 10 seconds and candidate is 8 seconds:
candidate is 1.25x as fast
candidate is 20 percent less timeCalculation:
speed ratio = baseline_time / candidate_time = 10 / 8 = 1.25
time reduction = (10 - 8) / 10 = 20 percentDo not confuse “25 percent faster” with “20 percent less time.” They are related but not identical.
81.17 Comparing Python Versions
When comparing Python versions, use the same benchmark suite and environment.
Example:
python3.12 -m pyperf timeit -o py312.json 'sum(range(1000))'
python3.13 -m pyperf timeit -o py313.json 'sum(range(1000))'
python3.13 -m pyperf compare_to py312.json py313.jsonRecord:
exact Python versions
build type
compiler
CPU
operating system
commandPython version changes may affect bytecode, specialization, object layout, and standard library behavior.
81.18 Comparing CPython Branches
For CPython development, compare builds from different branches or commits.
Typical workflow:
git checkout main
./configure --prefix=/tmp/py-main CFLAGS="-O3 -g"
make -j
make install
git checkout my-branch
./configure --prefix=/tmp/py-branch CFLAGS="-O3 -g"
make -j
make installThen run the same benchmark using both interpreters.
Avoid comparing:
debug build vs release build
different compiler flags
different system load
different dependency versionsThose differences can swamp the change being tested.
81.19 Debug Builds Distort Speed
CPython debug builds are useful for correctness work.
They add checks and diagnostics.
They may change:
object layout
reference counting overhead
assertion cost
allocator behavior
execution speedUse debug builds for finding bugs.
Use release-like builds for speed measurements.
A common performance build uses optimization and debug symbols:
./configure CFLAGS="-O3 -g"This keeps native profiling symbols while preserving realistic optimization.
81.20 System Isolation
Benchmark on a quiet system.
Good conditions:
few background tasks
stable CPU frequency
no thermal throttling
consistent power mode
enough free memory
same terminal environment
same filesystem state where relevantOn laptops, power and thermal behavior can heavily affect results.
For serious CPython work, use a dedicated machine or carefully controlled environment.
81.21 CPU Frequency
Modern CPUs change frequency dynamically.
This can distort benchmarks.
Examples:
turbo boost
thermal throttling
power saving mode
background load
laptop battery modeA candidate benchmark may look faster simply because the CPU was running at a higher frequency.
pyperf system tune can help on supported systems, but you still need to understand the machine.
81.22 Hash Randomization
Python randomizes string hashes by default.
This can affect dictionary and set workloads.
For reproducibility, you may set:
PYTHONHASHSEED=0This makes hash behavior deterministic.
However, fixed hash seeds may hide behavior that appears under normal randomized execution.
Use fixed seeds for repeatability when needed. Use varied seeds when testing robustness.
81.23 Garbage Collection Effects
Garbage collection can affect benchmarks.
A benchmark that creates many container cycles may trigger cyclic GC.
You can inspect GC behavior:
import gc
print(gc.get_count())
print(gc.get_stats())Some microbenchmarks disable GC:
import gc
gc.disable()
try:
run_benchmark()
finally:
gc.enable()This is valid only if GC is not part of what you want to measure.
For allocation-heavy real workloads, disabling GC may produce misleading results.
81.24 Allocation Effects
Allocation-heavy benchmarks are sensitive to allocator state.
A benchmark that creates many objects may be affected by:
free lists
pymalloc pools
arena reuse
system allocator behavior
memory fragmentationRepeated runs may become faster or slower as allocator state changes.
Use multiple worker processes to reduce carryover effects.
This is another reason pyperf is preferable.
81.25 Cache Effects
CPU cache state affects microbenchmarks.
A small benchmark may fit entirely in cache.
A real workload may have a much larger working set.
Example:
xs = list(range(100))may benchmark very differently from:
xs = list(range(10_000_000))Both test “loop over a list,” but their memory behavior differs.
Choose input sizes that match the question.
81.26 Branch Prediction Effects
Stable data can make branch prediction highly effective.
Example:
for x in xs:
if x > 0:
total += xIf every x is positive, the branch is predictable.
If signs are random, the branch may be less predictable.
CPython itself has many type and cache validation branches. Stable runtime types help both interpreter specialization and CPU branch prediction.
81.27 Adaptive Specialization Effects
Modern CPython rewrites bytecode execution paths based on runtime behavior.
This affects benchmarks of:
attribute access
global lookup
binary operations
function calls
method calls
subscript operationsTo inspect specialization:
import dis
def f(obj):
return obj.x + 1
for _ in range(10_000):
f(obj)
dis.dis(f, adaptive=True, show_caches=True)A benchmark should specify whether it measures cold or warm behavior.
Most throughput benchmarks should measure warm behavior.
Startup benchmarks may intentionally measure cold behavior.
81.28 Cold Benchmarks
Cold benchmarks measure first-run behavior.
Examples:
interpreter startup
first import
first request after process start
first function execution
first regex compile
first template renderCold benchmarks matter for:
CLIs
serverless functions
short-lived scripts
developer tools
test runnersAdaptive specialization may not help much if code runs only once.
81.29 Warm Benchmarks
Warm benchmarks measure steady-state behavior.
Examples:
long-running service
worker process
data pipeline
training loop
repeated request handlerWarm benchmarks should allow:
bytecode specialization
cache population
allocator stabilization
import completionThey answer a different question from cold benchmarks.
A system can have excellent warm throughput and poor cold startup.
81.30 Benchmark Suites
A benchmark suite collects multiple workloads.
For CPython, suites can include:
startup
regex
JSON
pickle
logging
template rendering
async workloads
numeric Python loops
object-heavy workloads
compiler workloadsA suite helps avoid optimizing one narrow case while slowing many others.
When interpreting a suite:
look at aggregate result
inspect individual wins
inspect individual regressions
explain outliersOne large regression may matter more than a small aggregate improvement.
81.31 The pyperformance Suite
CPython performance work often uses the pyperformance benchmark suite.
It contains a collection of Python workloads intended to track interpreter-level performance over time.
A typical workflow is:
pyperformance run -o baseline.json --python=/path/to/python-main
pyperformance run -o candidate.json --python=/path/to/python-branch
pyperformance compare baseline.json candidate.jsonThis provides broader coverage than a single microbenchmark.
Use it to catch unintended regressions.
81.32 Benchmarking Standard Library Changes
For standard library changes, benchmark both isolated functions and realistic use.
Example change: JSON encoder optimization.
Microbenchmark:
python -m pyperf timeit \
-s 'import json; data = {"x": list(range(1000))}' \
'json.dumps(data)'Macrobenchmark:
application workload that serializes real payloadsThe microbenchmark confirms the target improved. The macrobenchmark confirms the improvement matters.
81.33 Benchmarking Interpreter Changes
Interpreter changes can have broad effects.
Examples:
opcode handler changes
reference counting changes
dictionary layout changes
frame layout changes
allocator changes
inline cache changesThese require broad benchmark coverage.
A change that improves one opcode may regress another path through instruction cache pressure, branch behavior, or larger data structures.
Use both focused benchmarks and benchmark suites.
81.34 Benchmarking Memory
Speed is not the only metric.
Memory benchmarks measure:
peak RSS
allocated bytes
object count
arena count
working set size
GC pressureA change may improve speed by using more memory.
That tradeoff may or may not be acceptable.
Use tools such as:
tracemalloc
resource module
platform memory tools
external RSS measurement
heap profilers81.35 Benchmarking Startup
Startup benchmarking should isolate phases.
Useful commands:
python -S -c pass
python -c pass
python -X importtime -c "import package"Questions:
how much is interpreter startup?
how much is site import?
how much is application import?
how much is module-level work?Startup benchmarks are sensitive to filesystem cache and environment.
Run repeatedly and compare under controlled conditions.
81.36 Benchmarking Imports
Import benchmarks are important for developer tools and CLIs.
Use:
python -X importtime -c "import your_package"For repeated measurement:
python -m pyperf command \
-- python -c "import your_package"Import time can regress when modules add eager imports, perform runtime type work, or execute expensive module-level initialization.
81.37 Benchmarking C Extensions
C extension benchmarks should separate:
call overhead
argument conversion
native computation
data copying
GIL behavior
result constructionExample:
benchmark empty call
benchmark small input
benchmark large input
benchmark repeated calls
benchmark batched callA C extension may be fast internally but slow overall if it copies data or creates many Python objects.
81.38 Benchmarking Async Code
Async benchmarks need care.
Measure:
throughput
latency
event loop lag
task scheduling overhead
queue delay
I/O simulation realismAvoid fake benchmarks that await completed coroutines only, unless that is the mechanism under test.
For networked async code, use controlled local servers or mocks to reduce external noise.
81.39 Benchmarking Threaded Code
Thread benchmarks should distinguish:
CPU-bound Python code
I/O-bound waiting
native code that releases the GIL
lock contention
queue overheadTraditional CPython serializes Python bytecode under the GIL, so CPU-bound Python threads often do not scale.
Free-threaded builds require different benchmarks because synchronization and reference count behavior change.
81.40 Benchmarking Free-Threaded CPython
Free-threaded CPython benchmarking should measure both single-thread and multi-thread behavior.
Important dimensions:
single-thread overhead
multi-thread scaling
reference count contention
allocator contention
object sharing
lock granularity
C extension compatibilityA free-threaded build may improve parallel workloads while slowing single-thread workloads.
Benchmark both. Do not report only the favorable side.
81.41 Benchmarking With Native Profilers
When a benchmark changes, use profiling to explain why.
Example workflow:
benchmark shows 8 percent slowdown
run native profiler
inspect hot symbols
find increased dict lookup cost
inspect code change
create focused microbenchmark
fix or justify regressionBenchmarks detect changes. Profilers explain changes.
The two tools should be used together.
81.42 Reporting Results
A good benchmark report includes:
summary
baseline and candidate
environment
commands
raw result files
main wins
main regressions
interpretation
known limitationsAvoid vague claims.
Bad:
this is fasterGood:
On this machine, candidate reduces median runtime for benchmark X from 120.4 ms to 111.8 ms across 20 pyperf runs. Benchmark Y regresses from 80.1 ms to 83.0 ms. Raw pyperf JSON files are attached.81.43 Statistical Significance
Do not overstate tiny changes.
If a result is inside noise, say so.
Example:
candidate: 1.01x faster, but run-to-run variation is 1.5 percentThis is weak evidence.
pyperf compare_to helps identify meaningful differences, but human judgment still matters.
Look for stable, explainable changes.
81.44 Regression Hunting
When performance regresses, reduce the search space.
Useful workflow:
confirm regression
find affected benchmark
bisect commits
profile before and after
create smaller reproducer
identify mechanism
fix or document tradeoffFor CPython, git bisect plus a repeatable benchmark command is powerful.
A regression without a reproducer is hard to fix.
81.45 Benchmark Reproducers
A benchmark reproducer should be:
small enough to run easily
large enough to show the effect
deterministic
documented
independent of external servicesInclude setup, command, input data, and expected comparison.
A good reproducer lets another developer validate the result.
81.46 Benchmarking Pitfalls in Python Code
Common pitfalls:
| Pitfall | Example | Problem |
|---|---|---|
| Measuring import in setup accidentally | setup imports large package | Hides runtime cost |
| Measuring printing | print(x) in loop | Terminal I/O dominates |
| Measuring random data generation | generate input in timed code | Mixes setup and target |
| Too few iterations | one call | High noise |
| Overly tiny operation | x + 1 only | Timer overhead and dispatch dominate |
| Wrong scope | globals instead of locals | Measures lookup difference accidentally |
A good benchmark has a narrow, explicit target.
81.47 Benchmarking Example: Attribute Access
Compare normal attributes and slots.
# bench_attr.py
import pyperf
runner = pyperf.Runner()
class Normal:
def __init__(self):
self.x = 1
class Slotted:
__slots__ = ("x",)
def __init__(self):
self.x = 1
normal = Normal()
slotted = Slotted()
def read_normal():
return normal.x
def read_slotted():
return slotted.x
runner.bench_func("normal_attr", read_normal)
runner.bench_func("slotted_attr", read_slotted)Run:
python bench_attr.py -o attr.jsonThis tests one narrow mechanism. It does not prove that all slotted classes are better for all applications.
81.48 Benchmarking Example: Function Call
Compare direct expression and helper function.
# bench_call.py
import pyperf
runner = pyperf.Runner()
def inc(x):
return x + 1
def direct_loop():
total = 0
for i in range(10_000):
total += i + 1
return total
def call_loop():
total = 0
for i in range(10_000):
total += inc(i)
return total
runner.bench_func("direct_loop", direct_loop)
runner.bench_func("call_loop", call_loop)This measures the cost of repeated Python calls in a loop.
It is useful for understanding call overhead. It does not mean helper functions should be avoided everywhere.
81.49 Benchmarking Example: Dictionary Lookup
# bench_dict.py
import pyperf
runner = pyperf.Runner()
d = {str(i): i for i in range(1000)}
keys = [str(i) for i in range(1000)]
def lookup_loop():
total = 0
for key in keys:
total += d[key]
return total
runner.bench_func("dict_lookup_loop", lookup_loop)This benchmark includes:
string key hashing
dictionary lookup
loop overhead
integer additionIf you want only dictionary lookup, you need a narrower benchmark. If you want realistic dictionary use, this may be suitable.
81.50 Mental Model
A useful model:
Benchmarking is controlled comparison.The core loop is:
define workload
measure baseline
measure candidate
compare distributions
explain change
confirm with profilingA benchmark result is useful only when the workload, environment, and comparison are clear.
81.51 Chapter Summary
Benchmarking CPython requires discipline.
Use timeit for quick checks, pyperf for reliable measurement, and benchmark suites for broad coverage. Account for warmup, adaptive specialization, CPU behavior, garbage collection, allocation effects, and system noise.
Good benchmarking separates cold and warm behavior, distinguishes microbenchmarks from macrobenchmarks, records the environment, and compares distributions rather than single numbers.
For CPython work, benchmarks should be paired with profiling. Benchmarks show that performance changed. Profiling explains why.