# 81. Benchmarking CPython

# 81. Benchmarking CPython

Benchmarking CPython means measuring execution speed in a controlled, repeatable way. Profiling answers where time is spent. Benchmarking answers how much a change affects performance.

A benchmark should compare two states:

```text
baseline
candidate
```

The baseline may be an older CPython build, an earlier version of a function, or a different implementation strategy. The candidate is the version being tested.

The goal is not to get one impressive number. The goal is to produce a trustworthy comparison.

## 81.1 Why Benchmarking Is Hard

Small timing differences are easy to misread.

A Python benchmark can be affected by:

```text
CPU frequency scaling
thermal throttling
background processes
memory pressure
operating system scheduling
Python build options
compiler version
CPU cache state
adaptive interpreter warmup
garbage collection
imports
randomized hashing
I/O variability
```

A single run rarely means much.

Reliable benchmarking reduces noise, repeats measurements, and compares distributions.

## 81.2 Benchmarking vs Profiling

Profiling and benchmarking are related but different.

| Activity | Main question | Example |
|---|---|---|
| Profiling | Where is time spent? | `cProfile`, `perf`, `tracemalloc` |
| Benchmarking | Did performance change? | `pyperf`, benchmark suites |
| Microbenchmarking | How fast is one small operation? | `timeit`, `pyperf timeit` |
| Macrobenchmarking | How fast is a realistic workload? | full application or suite |

A profiler may show that attribute lookup dominates a loop. A benchmark tells whether changing the object layout improved total runtime.

## 81.3 What to Benchmark

Benchmark the workload that matters.

Possible benchmark targets:

```text
small language operation
standard library function
application hot path
import time
serialization workload
web request handler
test suite runtime
compiler performance
interpreter startup
C extension boundary
```

A small benchmark is easier to interpret. A large benchmark is more representative.

Good performance work often uses both:

```text
microbenchmark:
    isolate mechanism

macrobenchmark:
    confirm real-world effect
```

## 81.4 Baselines

A baseline must be explicit.

Bad:

```text
this feels faster
```

Better:

```text
CPython main branch at commit A
candidate branch at commit B
same compiler
same machine
same benchmark command
```

A benchmark result without a baseline is just a measurement.

A useful comparison records:

```text
Python version
git commit
compiler
optimization flags
operating system
CPU model
benchmark command
environment variables
number of runs
```

## 81.5 Use `pyperf` for Serious Benchmarks

`pyperf` is the standard tool for reliable Python benchmarking.

It handles:

```text
warmups
multiple worker processes
statistics
metadata
JSON output
result comparison
system tuning helpers
```

Example:

```bash
python -m pyperf timeit \
  -s 'xs = list(range(1000))' \
  'sum(xs)'
```

Save results:

```bash
python -m pyperf timeit \
  -o baseline.json \
  -s 'xs = list(range(1000))' \
  'sum(xs)'
```

Compare:

```bash
python -m pyperf compare_to baseline.json candidate.json
```

This is much better than reading one manual timer result.

## 81.6 `timeit`

`timeit` is useful for quick checks.

Example:

```bash
python -m timeit -s 'x = 1' 'x + 1'
```

In Python code:

```python
import timeit

duration = timeit.timeit(
    "obj.x",
    setup="""
class C:
    pass
obj = C()
obj.x = 1
""",
    number=10_000_000,
)

print(duration)
```

`timeit` is convenient, but it gives less control than `pyperf`.

Use `timeit` for exploration. Use `pyperf` for claims.

## 81.7 Warmup

Modern CPython uses adaptive specialization.

This means early executions may differ from later executions.

Example:

```python
def f(obj):
    return obj.x
```

The first few executions may use generic attribute lookup. Later executions may use a specialized `LOAD_ATTR` path.

A benchmark must allow warmup.

Bad benchmark:

```python
import time

start = time.perf_counter()
f(obj)
end = time.perf_counter()

print(end - start)
```

This mostly measures one cold call.

Better:

```python
for _ in range(10_000):
    f(obj)

start = time.perf_counter()
for _ in range(1_000_000):
    f(obj)
end = time.perf_counter()

print(end - start)
```

`pyperf` manages warmup more carefully.

## 81.8 Dead Code Elimination Is Less Relevant, But Still Think About Results

CPython generally does not perform aggressive compiler dead-code elimination like a native optimizing compiler.

This means:

```python
x + 1
```

inside `timeit` is still executed.

However, benchmark structure still matters.

Bad:

```python
timeit("f()", setup="def f(): return 1")
```

This may mostly measure call overhead, not useful work.

Better:

```python
timeit("total += f(i)", setup="def f(x): return x + 1\ntotal = 0\ni = 1")
```

Even then, assignment scope and setup shape matter.

A benchmark should force the operation you care about to happen in the same shape as real code.

## 81.9 Avoid I/O in Microbenchmarks

I/O is noisy.

Avoid including these in tight microbenchmarks:

```text
disk reads
network requests
database queries
printing
subprocesses
random sleeps
```

For I/O workloads, benchmark at a larger level and measure wall time, latency, throughput, and variance.

For interpreter mechanics, keep the benchmark CPU-bound and deterministic.

## 81.10 Avoid Measuring Setup

Keep setup outside the timed section.

Bad:

```python
timeit("[i for i in range(1000)]")
```

This measures both range iteration and list construction, which may be desired, but often setup accidentally dominates.

For lookup benchmarking:

```python
timeit(
    "d['key']",
    setup="d = {'key': 1}",
)
```

This measures lookup, not dictionary creation.

For construction benchmarking:

```python
timeit(
    "{str(i): i for i in range(1000)}",
)
```

This intentionally measures construction.

Be explicit about what the benchmark includes.

## 81.11 Microbenchmarks

Microbenchmarks isolate small operations.

Examples:

```text
local variable load
global lookup
attribute access
method call
list append
dictionary lookup
function call
integer addition
exception raising
```

Example:

```bash
python -m pyperf timeit \
  -s 'class C: pass; obj = C(); obj.x = 1' \
  'obj.x'
```

Microbenchmarks are useful when testing a specific interpreter mechanism.

They are dangerous when used to claim application-level improvement.

A 20 percent improvement in one microbenchmark may produce no visible application speedup if that operation is not a dominant cost.

## 81.12 Macrobenchmarks

Macrobenchmarks measure larger workloads.

Examples:

```text
run a web request handler
parse a large JSON file
render templates
run a test suite
import a package
compile many Python files
run a CLI command
execute a data-processing pipeline
```

Macrobenchmarks include many interacting costs.

They are harder to explain, but more representative.

A good macrobenchmark should have:

```text
fixed input data
fixed environment
repeatable command
clear metric
low external I/O variability
```

## 81.13 Throughput and Latency

Different workloads need different metrics.

Throughput:

```text
operations per second
requests per second
files processed per second
objects parsed per second
```

Latency:

```text
time per request
time to first response
p50 latency
p95 latency
p99 latency
```

A change can improve throughput while worsening tail latency.

For services, benchmark distributions, not just averages.

## 81.14 Mean, Median, and Variance

Benchmark results are distributions.

Useful values:

| Metric | Meaning |
|---|---|
| Mean | Average across runs |
| Median | Middle result, less sensitive to outliers |
| Standard deviation | Spread of results |
| Min | Best observed result |
| Max | Worst observed result |

A small speedup with high variance is weak evidence.

Example:

```text
baseline: 100 ms ± 5 ms
candidate: 98 ms ± 6 ms
```

This difference may be noise.

A stronger result:

```text
baseline: 100 ms ± 1 ms
candidate: 92 ms ± 1 ms
```

The distributions are clearly separated.

## 81.15 Geometric Mean

Benchmark suites often use geometric mean for ratios.

If each benchmark produces a speed ratio:

```text
candidate_time / baseline_time
```

the geometric mean summarizes multiplicative changes better than arithmetic mean.

Example:

```text
benchmark A: 0.90x
benchmark B: 1.10x
benchmark C: 1.00x
```

The geometric mean treats relative changes consistently.

For CPython suite comparisons, avoid over-focusing on one aggregate. Look at individual benchmarks too.

## 81.16 Speedup and Slowdown

Use clear ratio language.

If baseline is 10 seconds and candidate is 8 seconds:

```text
candidate is 1.25x as fast
candidate is 20 percent less time
```

Calculation:

```text
speed ratio = baseline_time / candidate_time = 10 / 8 = 1.25
time reduction = (10 - 8) / 10 = 20 percent
```

Do not confuse “25 percent faster” with “20 percent less time.” They are related but not identical.

## 81.17 Comparing Python Versions

When comparing Python versions, use the same benchmark suite and environment.

Example:

```bash
python3.12 -m pyperf timeit -o py312.json 'sum(range(1000))'
python3.13 -m pyperf timeit -o py313.json 'sum(range(1000))'

python3.13 -m pyperf compare_to py312.json py313.json
```

Record:

```text
exact Python versions
build type
compiler
CPU
operating system
command
```

Python version changes may affect bytecode, specialization, object layout, and standard library behavior.

## 81.18 Comparing CPython Branches

For CPython development, compare builds from different branches or commits.

Typical workflow:

```bash
git checkout main
./configure --prefix=/tmp/py-main CFLAGS="-O3 -g"
make -j
make install

git checkout my-branch
./configure --prefix=/tmp/py-branch CFLAGS="-O3 -g"
make -j
make install
```

Then run the same benchmark using both interpreters.

Avoid comparing:

```text
debug build vs release build
different compiler flags
different system load
different dependency versions
```

Those differences can swamp the change being tested.

## 81.19 Debug Builds Distort Speed

CPython debug builds are useful for correctness work.

They add checks and diagnostics.

They may change:

```text
object layout
reference counting overhead
assertion cost
allocator behavior
execution speed
```

Use debug builds for finding bugs.

Use release-like builds for speed measurements.

A common performance build uses optimization and debug symbols:

```bash
./configure CFLAGS="-O3 -g"
```

This keeps native profiling symbols while preserving realistic optimization.

## 81.20 System Isolation

Benchmark on a quiet system.

Good conditions:

```text
few background tasks
stable CPU frequency
no thermal throttling
consistent power mode
enough free memory
same terminal environment
same filesystem state where relevant
```

On laptops, power and thermal behavior can heavily affect results.

For serious CPython work, use a dedicated machine or carefully controlled environment.

## 81.21 CPU Frequency

Modern CPUs change frequency dynamically.

This can distort benchmarks.

Examples:

```text
turbo boost
thermal throttling
power saving mode
background load
laptop battery mode
```

A candidate benchmark may look faster simply because the CPU was running at a higher frequency.

`pyperf system tune` can help on supported systems, but you still need to understand the machine.

## 81.22 Hash Randomization

Python randomizes string hashes by default.

This can affect dictionary and set workloads.

For reproducibility, you may set:

```bash
PYTHONHASHSEED=0
```

This makes hash behavior deterministic.

However, fixed hash seeds may hide behavior that appears under normal randomized execution.

Use fixed seeds for repeatability when needed. Use varied seeds when testing robustness.

## 81.23 Garbage Collection Effects

Garbage collection can affect benchmarks.

A benchmark that creates many container cycles may trigger cyclic GC.

You can inspect GC behavior:

```python
import gc

print(gc.get_count())
print(gc.get_stats())
```

Some microbenchmarks disable GC:

```python
import gc
gc.disable()
try:
    run_benchmark()
finally:
    gc.enable()
```

This is valid only if GC is not part of what you want to measure.

For allocation-heavy real workloads, disabling GC may produce misleading results.

## 81.24 Allocation Effects

Allocation-heavy benchmarks are sensitive to allocator state.

A benchmark that creates many objects may be affected by:

```text
free lists
pymalloc pools
arena reuse
system allocator behavior
memory fragmentation
```

Repeated runs may become faster or slower as allocator state changes.

Use multiple worker processes to reduce carryover effects.

This is another reason `pyperf` is preferable.

## 81.25 Cache Effects

CPU cache state affects microbenchmarks.

A small benchmark may fit entirely in cache.

A real workload may have a much larger working set.

Example:

```python
xs = list(range(100))
```

may benchmark very differently from:

```python
xs = list(range(10_000_000))
```

Both test “loop over a list,” but their memory behavior differs.

Choose input sizes that match the question.

## 81.26 Branch Prediction Effects

Stable data can make branch prediction highly effective.

Example:

```python
for x in xs:
    if x > 0:
        total += x
```

If every `x` is positive, the branch is predictable.

If signs are random, the branch may be less predictable.

CPython itself has many type and cache validation branches. Stable runtime types help both interpreter specialization and CPU branch prediction.

## 81.27 Adaptive Specialization Effects

Modern CPython rewrites bytecode execution paths based on runtime behavior.

This affects benchmarks of:

```text
attribute access
global lookup
binary operations
function calls
method calls
subscript operations
```

To inspect specialization:

```python
import dis

def f(obj):
    return obj.x + 1

for _ in range(10_000):
    f(obj)

dis.dis(f, adaptive=True, show_caches=True)
```

A benchmark should specify whether it measures cold or warm behavior.

Most throughput benchmarks should measure warm behavior.

Startup benchmarks may intentionally measure cold behavior.

## 81.28 Cold Benchmarks

Cold benchmarks measure first-run behavior.

Examples:

```text
interpreter startup
first import
first request after process start
first function execution
first regex compile
first template render
```

Cold benchmarks matter for:

```text
CLIs
serverless functions
short-lived scripts
developer tools
test runners
```

Adaptive specialization may not help much if code runs only once.

## 81.29 Warm Benchmarks

Warm benchmarks measure steady-state behavior.

Examples:

```text
long-running service
worker process
data pipeline
training loop
repeated request handler
```

Warm benchmarks should allow:

```text
bytecode specialization
cache population
allocator stabilization
import completion
```

They answer a different question from cold benchmarks.

A system can have excellent warm throughput and poor cold startup.

## 81.30 Benchmark Suites

A benchmark suite collects multiple workloads.

For CPython, suites can include:

```text
startup
regex
JSON
pickle
logging
template rendering
async workloads
numeric Python loops
object-heavy workloads
compiler workloads
```

A suite helps avoid optimizing one narrow case while slowing many others.

When interpreting a suite:

```text
look at aggregate result
inspect individual wins
inspect individual regressions
explain outliers
```

One large regression may matter more than a small aggregate improvement.

## 81.31 The `pyperformance` Suite

CPython performance work often uses the `pyperformance` benchmark suite.

It contains a collection of Python workloads intended to track interpreter-level performance over time.

A typical workflow is:

```bash
pyperformance run -o baseline.json --python=/path/to/python-main
pyperformance run -o candidate.json --python=/path/to/python-branch
pyperformance compare baseline.json candidate.json
```

This provides broader coverage than a single microbenchmark.

Use it to catch unintended regressions.

## 81.32 Benchmarking Standard Library Changes

For standard library changes, benchmark both isolated functions and realistic use.

Example change: JSON encoder optimization.

Microbenchmark:

```bash
python -m pyperf timeit \
  -s 'import json; data = {"x": list(range(1000))}' \
  'json.dumps(data)'
```

Macrobenchmark:

```text
application workload that serializes real payloads
```

The microbenchmark confirms the target improved. The macrobenchmark confirms the improvement matters.

## 81.33 Benchmarking Interpreter Changes

Interpreter changes can have broad effects.

Examples:

```text
opcode handler changes
reference counting changes
dictionary layout changes
frame layout changes
allocator changes
inline cache changes
```

These require broad benchmark coverage.

A change that improves one opcode may regress another path through instruction cache pressure, branch behavior, or larger data structures.

Use both focused benchmarks and benchmark suites.

## 81.34 Benchmarking Memory

Speed is not the only metric.

Memory benchmarks measure:

```text
peak RSS
allocated bytes
object count
arena count
working set size
GC pressure
```

A change may improve speed by using more memory.

That tradeoff may or may not be acceptable.

Use tools such as:

```text
tracemalloc
resource module
platform memory tools
external RSS measurement
heap profilers
```

## 81.35 Benchmarking Startup

Startup benchmarking should isolate phases.

Useful commands:

```bash
python -S -c pass
python -c pass
python -X importtime -c "import package"
```

Questions:

```text
how much is interpreter startup?
how much is site import?
how much is application import?
how much is module-level work?
```

Startup benchmarks are sensitive to filesystem cache and environment.

Run repeatedly and compare under controlled conditions.

## 81.36 Benchmarking Imports

Import benchmarks are important for developer tools and CLIs.

Use:

```bash
python -X importtime -c "import your_package"
```

For repeated measurement:

```bash
python -m pyperf command \
  -- python -c "import your_package"
```

Import time can regress when modules add eager imports, perform runtime type work, or execute expensive module-level initialization.

## 81.37 Benchmarking C Extensions

C extension benchmarks should separate:

```text
call overhead
argument conversion
native computation
data copying
GIL behavior
result construction
```

Example:

```text
benchmark empty call
benchmark small input
benchmark large input
benchmark repeated calls
benchmark batched call
```

A C extension may be fast internally but slow overall if it copies data or creates many Python objects.

## 81.38 Benchmarking Async Code

Async benchmarks need care.

Measure:

```text
throughput
latency
event loop lag
task scheduling overhead
queue delay
I/O simulation realism
```

Avoid fake benchmarks that await completed coroutines only, unless that is the mechanism under test.

For networked async code, use controlled local servers or mocks to reduce external noise.

## 81.39 Benchmarking Threaded Code

Thread benchmarks should distinguish:

```text
CPU-bound Python code
I/O-bound waiting
native code that releases the GIL
lock contention
queue overhead
```

Traditional CPython serializes Python bytecode under the GIL, so CPU-bound Python threads often do not scale.

Free-threaded builds require different benchmarks because synchronization and reference count behavior change.

## 81.40 Benchmarking Free-Threaded CPython

Free-threaded CPython benchmarking should measure both single-thread and multi-thread behavior.

Important dimensions:

```text
single-thread overhead
multi-thread scaling
reference count contention
allocator contention
object sharing
lock granularity
C extension compatibility
```

A free-threaded build may improve parallel workloads while slowing single-thread workloads.

Benchmark both. Do not report only the favorable side.

## 81.41 Benchmarking With Native Profilers

When a benchmark changes, use profiling to explain why.

Example workflow:

```text
benchmark shows 8 percent slowdown
run native profiler
inspect hot symbols
find increased dict lookup cost
inspect code change
create focused microbenchmark
fix or justify regression
```

Benchmarks detect changes. Profilers explain changes.

The two tools should be used together.

## 81.42 Reporting Results

A good benchmark report includes:

```text
summary
baseline and candidate
environment
commands
raw result files
main wins
main regressions
interpretation
known limitations
```

Avoid vague claims.

Bad:

```text
this is faster
```

Good:

```text
On this machine, candidate reduces median runtime for benchmark X from 120.4 ms to 111.8 ms across 20 pyperf runs. Benchmark Y regresses from 80.1 ms to 83.0 ms. Raw pyperf JSON files are attached.
```

## 81.43 Statistical Significance

Do not overstate tiny changes.

If a result is inside noise, say so.

Example:

```text
candidate: 1.01x faster, but run-to-run variation is 1.5 percent
```

This is weak evidence.

`pyperf compare_to` helps identify meaningful differences, but human judgment still matters.

Look for stable, explainable changes.

## 81.44 Regression Hunting

When performance regresses, reduce the search space.

Useful workflow:

```text
confirm regression
find affected benchmark
bisect commits
profile before and after
create smaller reproducer
identify mechanism
fix or document tradeoff
```

For CPython, `git bisect` plus a repeatable benchmark command is powerful.

A regression without a reproducer is hard to fix.

## 81.45 Benchmark Reproducers

A benchmark reproducer should be:

```text
small enough to run easily
large enough to show the effect
deterministic
documented
independent of external services
```

Include setup, command, input data, and expected comparison.

A good reproducer lets another developer validate the result.

## 81.46 Benchmarking Pitfalls in Python Code

Common pitfalls:

| Pitfall | Example | Problem |
|---|---|---|
| Measuring import in setup accidentally | setup imports large package | Hides runtime cost |
| Measuring printing | `print(x)` in loop | Terminal I/O dominates |
| Measuring random data generation | generate input in timed code | Mixes setup and target |
| Too few iterations | one call | High noise |
| Overly tiny operation | `x + 1` only | Timer overhead and dispatch dominate |
| Wrong scope | globals instead of locals | Measures lookup difference accidentally |

A good benchmark has a narrow, explicit target.

## 81.47 Benchmarking Example: Attribute Access

Compare normal attributes and slots.

```python
# bench_attr.py
import pyperf

runner = pyperf.Runner()

class Normal:
    def __init__(self):
        self.x = 1

class Slotted:
    __slots__ = ("x",)

    def __init__(self):
        self.x = 1

normal = Normal()
slotted = Slotted()

def read_normal():
    return normal.x

def read_slotted():
    return slotted.x

runner.bench_func("normal_attr", read_normal)
runner.bench_func("slotted_attr", read_slotted)
```

Run:

```bash
python bench_attr.py -o attr.json
```

This tests one narrow mechanism. It does not prove that all slotted classes are better for all applications.

## 81.48 Benchmarking Example: Function Call

Compare direct expression and helper function.

```python
# bench_call.py
import pyperf

runner = pyperf.Runner()

def inc(x):
    return x + 1

def direct_loop():
    total = 0
    for i in range(10_000):
        total += i + 1
    return total

def call_loop():
    total = 0
    for i in range(10_000):
        total += inc(i)
    return total

runner.bench_func("direct_loop", direct_loop)
runner.bench_func("call_loop", call_loop)
```

This measures the cost of repeated Python calls in a loop.

It is useful for understanding call overhead. It does not mean helper functions should be avoided everywhere.

## 81.49 Benchmarking Example: Dictionary Lookup

```python
# bench_dict.py
import pyperf

runner = pyperf.Runner()

d = {str(i): i for i in range(1000)}
keys = [str(i) for i in range(1000)]

def lookup_loop():
    total = 0
    for key in keys:
        total += d[key]
    return total

runner.bench_func("dict_lookup_loop", lookup_loop)
```

This benchmark includes:

```text
string key hashing
dictionary lookup
loop overhead
integer addition
```

If you want only dictionary lookup, you need a narrower benchmark. If you want realistic dictionary use, this may be suitable.

## 81.50 Mental Model

A useful model:

```text
Benchmarking is controlled comparison.
```

The core loop is:

```text
define workload
measure baseline
measure candidate
compare distributions
explain change
confirm with profiling
```

A benchmark result is useful only when the workload, environment, and comparison are clear.

## 81.51 Chapter Summary

Benchmarking CPython requires discipline.

Use `timeit` for quick checks, `pyperf` for reliable measurement, and benchmark suites for broad coverage. Account for warmup, adaptive specialization, CPU behavior, garbage collection, allocation effects, and system noise.

Good benchmarking separates cold and warm behavior, distinguishes microbenchmarks from macrobenchmarks, records the environment, and compares distributions rather than single numbers.

For CPython work, benchmarks should be paired with profiling. Benchmarks show that performance changed. Profiling explains why.
