Using perf, Instruments, and py-spy to profile CPython; reading perf maps generated by the JIT.
Profiling CPython means measuring where time, memory, allocations, and interpreter work are spent. It is the practical bridge between a performance suspicion and a performance explanation.
A Python program can be slow for many different reasons:
too many Python function calls
slow attribute lookup
large dictionary workloads
excessive allocation
poor memory locality
I/O waiting
bad algorithmic complexity
C extension overhead
lock contention
garbage collection pausesProfiling separates these cases.
Without profiling, performance work often becomes guesswork.
80.1 What Profiling Measures
Profiling can measure several different layers.
| Layer | Measures | Typical tools |
|---|---|---|
| Python function level | Which Python functions consume time | cProfile, profile, pstats |
| Line level | Which source lines consume time | line_profiler, tracing tools |
| Allocation level | Where Python allocations happen | tracemalloc |
| Native level | C functions, CPU samples, cache misses | perf, Instruments, VTune |
| Event level | GC events, imports, calls, exceptions | audit hooks, tracing hooks |
| Benchmark level | Stable timing across runs | pyperf, timeit |
Different tools answer different questions. A single profile rarely explains everything.
80.2 Deterministic Profiling
Deterministic profiling records events as they happen.
For Python code, this usually means tracking:
function call
function return
exceptioncProfile is the standard deterministic profiler.
Example:
import cProfile
def work():
total = 0
for i in range(1_000_000):
total += i
return total
cProfile.run("work()")The profiler records function-level execution counts and time.
Deterministic profiling gives precise call counts, but adds overhead because it observes every function event.
80.3 cProfile
cProfile is implemented in C and has lower overhead than the pure Python profile module.
A typical workflow:
import cProfile
import pstats
prof = cProfile.Profile()
prof.enable()
work()
prof.disable()
stats = pstats.Stats(prof)
stats.sort_stats("cumtime")
stats.print_stats(20)Common columns:
| Column | Meaning |
|---|---|
ncalls | Number of calls |
tottime | Time spent in the function body, excluding subcalls |
percall | Time per call |
cumtime | Time spent in the function including subcalls |
filename:lineno(function) | Function identity |
cumtime finds high-level expensive operations. tottime finds functions whose own body is expensive.
80.4 Interpreting cProfile
Consider this simplified output:
ncalls tottime percall cumtime percall filename:lineno(function)
1000000 0.210 0.000 0.210 0.000 app.py:10(parse_item)
1 0.020 0.020 0.310 0.310 app.py:20(load_all)The function parse_item is called one million times. Even if each call is small, aggregate cost is large.
Common interpretations:
| Symptom | Likely issue |
|---|---|
Huge ncalls | Call overhead or too fine-grained functions |
High tottime | Expensive function body |
High cumtime, low tottime | Expensive callees |
| Builtins dominating | Data structure or C-level operation cost |
| Import machinery dominating | Startup or import-time work |
The profiler reports where time went, not why it went there. You still need source-level reasoning.
80.5 Sampling Profiling
Sampling profilers interrupt the program periodically and record where it is.
Instead of tracking every call, they ask:
where is the program right now?repeated many times.
If 40 percent of samples occur in one function, that function likely consumes about 40 percent of CPU time.
Sampling profilers have lower overhead than deterministic profilers and can observe native C frames.
Examples:
py-spy
scalene
Linux perf
macOS Instruments
AustinSampling is often better for production-like workloads because it perturbs the program less.
80.6 Deterministic vs Sampling Profiling
| Method | Strength | Weakness |
|---|---|---|
| Deterministic profiling | Exact Python call counts | Higher overhead, mostly Python-level |
| Sampling profiling | Lower overhead, can see native stack | Approximate, depends on sample rate |
Use deterministic profiling when call counts matter.
Use sampling profiling when you need low overhead or native visibility.
For CPython internals work, sampling profilers are often more useful because many costs occur in C code inside the interpreter.
80.7 Wall Time vs CPU Time
Wall time is elapsed real time.
CPU time is processor time consumed by the program.
A program can have high wall time and low CPU time when waiting on:
network I/O
disk I/O
locks
sleep
subprocesses
database callsA CPU-bound Python loop has high CPU time.
An I/O-bound server may spend most wall time waiting.
Choose the right metric. Optimizing Python bytecode will not fix a program that is blocked on a database query.
80.8 Profiling Function Calls
Function calls are a major CPython cost.
Example:
def inc(x):
return x + 1
total = 0
for i in range(1_000_000):
total += inc(i)cProfile will show inc called one million times.
If inc does little work, call overhead dominates.
Possible improvements:
inline tiny helper in hot path
batch work
move loop into C extension or vectorized library
reduce wrapper layersThe right choice depends on code clarity and measured impact.
80.9 Profiling Attribute Access
Attribute access usually does not appear as a separate Python function in cProfile.
Example:
for item in items:
total += item.priceThe time may appear inside the containing function.
To inspect attribute access, use bytecode and focused microbenchmarks:
import dis
def total_price(items):
total = 0
for item in items:
total += item.price
return total
dis.dis(total_price, adaptive=True, show_caches=True)Then measure variants with pyperf or timeit.
Attribute access cost is often visible as high time in a loop body rather than as a named profiler entry.
80.10 Profiling Allocation
Allocation overhead can dominate Python workloads.
Example:
rows = []
for item in source:
rows.append({"id": item.id, "name": item.name})This creates many dictionaries and strings.
Use tracemalloc to find allocation sites:
import tracemalloc
tracemalloc.start()
run_workload()
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics("lineno")
for stat in stats[:20]:
print(stat)This reports where memory was allocated.
Allocation profiling is essential when runtime and memory grow together.
80.11 tracemalloc Snapshots
tracemalloc can compare snapshots.
import tracemalloc
tracemalloc.start()
before = tracemalloc.take_snapshot()
run_workload()
after = tracemalloc.take_snapshot()
diff = after.compare_to(before, "lineno")
for stat in diff[:20]:
print(stat)This shows allocation growth between two points.
Useful questions:
Which lines allocate the most memory?
Which allocations survive?
Did a change reduce allocation count?
Is memory retained unexpectedly?tracemalloc tracks Python allocations, not all native memory from every C extension.
80.12 Profiling Garbage Collection
The cyclic garbage collector can affect latency and memory behavior.
You can inspect GC state:
import gc
print(gc.get_count())
print(gc.get_stats())You can attach callbacks:
import gc
import time
def on_gc(phase, info):
print(time.time(), phase, info)
gc.callbacks.append(on_gc)This helps identify whether GC collections occur during latency-sensitive paths.
Most objects are reclaimed by reference counting. The cyclic GC matters mainly for container cycles and long-lived object graphs.
80.13 Profiling Imports
Import time matters for CLIs, server cold starts, tests, and serverless functions.
CPython can report import timing:
python -X importtime -c "import your_package"This prints nested import timing.
Common problems:
large imports at module import time
optional dependencies imported eagerly
expensive initialization during import
slow filesystem access
heavy type-checking imports at runtimeImproving import time often means delaying optional work until it is needed.
80.14 Profiling Startup
Startup includes:
interpreter initialization
site module loading
environment processing
imports
module-level codeUseful commands:
python -S script.py
python -X importtime script.py-S skips automatic site import.
This helps distinguish interpreter startup cost from application import cost.
80.15 Profiling Native Code
Many CPython costs live below Python function level.
Examples:
dictionary lookup
attribute lookup
reference counting
bytecode dispatch
memory allocation
Unicode operations
regular expression engine
JSON parsing
C extension workPython-level profilers may show time inside one Python function but not expose internal C costs.
Native profilers such as Linux perf can sample C stacks:
perf record -g python script.py
perf reportThis can reveal time in CPython functions such as dictionary lookup, evaluation loop, Unicode operations, or allocator paths.
80.16 Profiling the Interpreter Itself
When profiling CPython internals, build choices matter.
A debug build changes performance.
A release build with symbols is often better for profiling:
./configure --with-pydebugis useful for correctness debugging, but slower.
For performance profiling, you often want an optimized build with debug symbols:
./configure CFLAGS="-O3 -g"
makeThen native profilers can resolve symbols while preserving realistic speed.
80.17 Debug Builds vs Release Builds
Debug builds add checks and change layout or behavior.
They are useful for:
assertions
reference leak checks
memory debugging
development diagnosticsBut they distort performance.
Use debug builds to find correctness bugs.
Use release-like builds to measure speed.
80.18 Microbenchmarking
Microbenchmarks isolate a small operation.
Example:
import timeit
setup = """
class C:
def __init__(self):
self.x = 1
obj = C()
"""
print(timeit.timeit("obj.x", setup=setup, number=10_000_000))Microbenchmarks are useful for understanding mechanisms.
They are risky for product decisions because they may not represent real workload behavior.
Use them to answer narrow questions, such as:
is slot access faster here?
does this call shape allocate?
does this lookup specialize?80.19 timeit
timeit reduces common timing mistakes.
It runs code many times and disables some noise sources.
Example:
python -m timeit 'x = 1 + 2'For setup:
python -m timeit -s 'xs=list(range(1000))' 'sum(xs)'timeit is good for quick local experiments.
For serious benchmarking, use pyperf.
80.20 pyperf
pyperf is designed for stable Python benchmarking.
It handles:
multiple processes
warmups
statistics
system noise reduction
JSON result storage
comparison toolsExample:
python -m pyperf timeit \
-s 'xs=list(range(1000))' \
'sum(xs)'Compare results:
python -m pyperf compare_to before.json after.jsonFor CPython performance work, pyperf is the standard benchmark harness.
80.21 Warmup Effects
Modern CPython has adaptive specialization.
This means first execution may be slower than later execution.
A benchmark should account for warmup.
Bad benchmark:
start = time.perf_counter()
f()
end = time.perf_counter()Better approach:
for _ in range(1000):
f()
start = time.perf_counter()
for _ in range(10000):
f()
end = time.perf_counter()Tools such as pyperf handle warmup more carefully.
80.22 Avoiding Benchmark Noise
Performance measurements are noisy.
Sources of noise:
CPU frequency scaling
thermal throttling
background processes
memory pressure
disk cache state
network variability
debug builds
virtual machines
container limits
different Python versionsGood practice:
run multiple times
compare distributions
pin Python version
record hardware
avoid mixed workloads
prefer isolated machines
use pyperf for serious runsSmall differences require careful evidence.
80.23 Profiling Algorithmic Complexity
Profiling does not replace algorithm analysis.
A slow function may be slow because it has the wrong complexity.
Example:
for x in xs:
if x in ys:
...If ys is a list, membership is linear.
Changing ys to a set can change the complexity:
ys_set = set(ys)
for x in xs:
if x in ys_set:
...This kind of improvement often dominates interpreter-level tuning.
80.24 Profiling Dictionaries
Dictionary-heavy workloads show up in several forms:
high time in object construction
high time in JSON processing
high time in attribute-heavy loops
high memory use from many mappings
native profiler samples in dict lookup functionsPython-level profiling may only show the containing function.
Use focused experiments to test alternatives:
slots instead of instance dicts
tuples instead of small dict records
arrays or columns instead of row dicts
local variables for repeated lookup80.25 Profiling Exceptions
Exceptions are expensive when raised frequently.
This pattern can be slow:
try:
value = d[key]
except KeyError:
value = defaultIf missing keys are common, consider:
value = d.get(key, default)But if misses are rare, exception-based lookup may be fine.
Profile the actual distribution.
Exceptions also affect tracebacks and frame handling, so frequent exception construction can be costly.
80.26 Profiling Regular Expressions
Regular expressions execute mostly in C, so Python-level profilers may show time inside re calls.
Example:
pattern.search(text)If regex dominates, inspect:
pattern complexity
catastrophic backtracking
precompilation
input size
number of callsUseful pattern:
import re
pattern = re.compile(r"...")
for text in texts:
if pattern.search(text):
...Precompilation avoids repeated parse and compile work.
80.27 Profiling I/O
I/O-bound programs need different profiling.
Symptoms:
high wall time
low CPU time
many waits
database or network functions dominateUse tools that show waiting, not only CPU.
For async programs, profile:
event loop latency
await points
network calls
connection pooling
DNS
serialization
backpressureOptimizing Python bytecode will have little effect if most time is spent waiting on external systems.
80.28 Profiling Async Code
Async code spreads work across tasks.
Traditional call graphs may be harder to read because logical flow crosses await points.
Useful things to measure:
task count
slow callbacks
event loop lag
await durations
queue sizes
connection pool waitsFor local inspection:
import asyncio
asyncio.run(main(), debug=True)Async profiling often requires application-level instrumentation in addition to profilers.
80.29 Profiling Threads
Threaded Python programs may be limited by:
I/O waits
lock contention
GIL serialization
native extension behavior
queue bottlenecksCPU-bound Python threads usually do not run Python bytecode in parallel under the traditional GIL.
Profiling threaded code requires looking at:
per-thread stacks
wall time vs CPU time
lock wait time
native extension GIL release behaviorSampling profilers are often better than deterministic profilers here.
80.30 Profiling Multiprocessing
Multiprocessing uses multiple processes.
Profiling must account for:
worker startup cost
serialization cost
IPC cost
data copying
process scheduling
load imbalanceA workload may be slow because it spends more time moving data than computing.
Measure workers and parent process separately when needed.
80.31 Flame Graphs
A flame graph visualizes stack samples.
Wide frames consume more sampled time.
This is useful for seeing:
hot call paths
unexpected wrappers
native extension cost
interpreter overhead
recursive behaviorA flame graph can reveal that the program spends time in places a flat profile hides.
Flat profiles answer “which functions.”
Flame graphs answer “which stacks.”
80.32 Profiling Memory Leaks
In CPython, many apparent leaks are retained references.
A true leak may be:
unreachable C memory not freed
reference cycle with finalizers
global cache growth
unbounded list or dict
C extension reference leakStart with tracemalloc snapshots.
Then inspect object ownership:
import gc
objs = gc.get_objects()For C extension work, use debug builds and reference leak test tools.
80.33 Reference Leak Checks
CPython development uses reference leak checks to detect incorrect reference ownership.
A reference leak means code increments a reference and fails to release it, or stores a reference unexpectedly.
In C extensions and CPython core code, this is serious.
Symptoms:
memory grows across repeated tests
objects never deallocate
reference counts increase unexpectedlyReference leak hunting often requires debug builds and repeated test runs.
80.34 Profiling C Extensions
C extensions can dominate runtime while appearing as one Python call.
Example:
result = extension.compute(data)Python-level profiler may show compute as expensive, but not the internal reason.
Use native profiling, extension-specific instrumentation, or debug builds.
Important questions:
does the extension release the GIL?
does it allocate many Python objects?
does it copy data unnecessarily?
does it use vectorcall or fastcall for hot callables?
does it convert data between formats repeatedly?80.35 Profiling Specialized Bytecode
Modern CPython allows inspection of adaptive bytecode.
Example:
import dis
def f(obj):
return obj.x + 1
for _ in range(10_000):
f(SomeObject())
dis.dis(f, adaptive=True, show_caches=True)Look for whether operations specialized.
Questions:
did LOAD_ATTR specialize?
did BINARY_OP specialize?
are caches present?
is the site unstable?This connects bytecode-level behavior to performance observations.
80.36 Profiling With Counters
Sometimes you need explicit counters.
Example:
hits = 0
misses = 0
for key in keys:
if key in cache:
hits += 1
else:
misses += 1Application-level counters can explain profiles better than tool output alone.
Measure domain facts:
cache hit rates
batch sizes
queue depths
number of objects
number of queries
number of retries
input sizesPerformance is often determined by workload shape.
80.37 Building a Profiling Workflow
A good workflow is:
define the workload
measure baseline
identify dominant cost
form one hypothesis
change one thing
measure again
compare results
keep or revertAvoid changing several variables at once.
For CPython-level work, also record:
Python version
build flags
platform
CPU model
compiler
environment variables
benchmark commandWithout this metadata, results are hard to reproduce.
80.38 Common Profiling Mistakes
| Mistake | Problem |
|---|---|
| Measuring one run | Too noisy |
| Ignoring warmup | Misses adaptive specialization |
| Using debug builds for speed claims | Distorts performance |
| Optimizing non-dominant code | No visible improvement |
| Confusing wall time and CPU time | Wrong diagnosis |
| Trusting microbenchmarks alone | May not match real workload |
| Ignoring allocation | Memory churn can dominate |
| Ignoring native code | Python profiler may hide C cost |
Profiling is measurement discipline, not just tool usage.
80.39 Reading CPython Profiling Paths
Relevant source areas:
| File | Purpose |
|---|---|
Modules/_lsprof.c | cProfile implementation |
Lib/profile.py | Pure Python profiler |
Lib/pstats.py | Profile statistics formatting |
Lib/tracemalloc.py | Python interface to allocation tracing |
Python/ceval.c | Trace and profile hook integration |
Python/sysmodule.c | sys.setprofile, sys.settrace interfaces |
Modules/gcmodule.c | GC statistics and callbacks |
Profiling hooks are wired into interpreter execution, especially calls, returns, exceptions, and line events.
80.40 Mental Model
A useful model:
Profiling maps symptoms to mechanisms.Slow code is the symptom.
The mechanism may be:
too many calls
too many objects
poor data layout
slow lookup
I/O waiting
bad complexity
native extension cost
GC pressureA good profile narrows the search space.
A good benchmark verifies the fix.
80.41 Chapter Summary
Profiling CPython requires measuring at the right layer.
Use cProfile for Python call costs, sampling profilers for native and low-overhead views, tracemalloc for allocation behavior, import timing for startup, and pyperf for reliable benchmark comparisons.
Performance work should proceed from evidence:
measure baseline
identify dominant cost
explain mechanism
make targeted change
measure againCPython performance depends on interpreter dispatch, object layout, dictionaries, attribute lookup, function calls, allocation, and memory locality. Profiling is how those mechanisms become visible.