80. Profiling CPython

Profiling CPython means measuring where time, memory, allocations, and interpreter work are spent. It is the practical bridge between a performance suspicion and a performance explanation.

A Python program can be slow for many different reasons:

too many Python function calls
slow attribute lookup
large dictionary workloads
excessive allocation
poor memory locality
I/O waiting
bad algorithmic complexity
C extension overhead
lock contention
garbage collection pauses

Profiling separates these cases.

Without profiling, performance work often becomes guesswork.

80.1 What Profiling Measures

Profiling can measure several different layers.

Layer	Measures	Typical tools
Python function level	Which Python functions consume time	`cProfile`, `profile`, `pstats`
Line level	Which source lines consume time	`line_profiler`, tracing tools
Allocation level	Where Python allocations happen	`tracemalloc`
Native level	C functions, CPU samples, cache misses	`perf`, Instruments, VTune
Event level	GC events, imports, calls, exceptions	audit hooks, tracing hooks
Benchmark level	Stable timing across runs	`pyperf`, `timeit`

Different tools answer different questions. A single profile rarely explains everything.

80.2 Deterministic Profiling

Deterministic profiling records events as they happen.

For Python code, this usually means tracking:

function call
function return
exception

cProfile is the standard deterministic profiler.

Example:

import cProfile

def work():
    total = 0
    for i in range(1_000_000):
        total += i
    return total

cProfile.run("work()")

The profiler records function-level execution counts and time.

Deterministic profiling gives precise call counts, but adds overhead because it observes every function event.

80.3 `cProfile`

cProfile is implemented in C and has lower overhead than the pure Python profile module.

A typical workflow:

import cProfile
import pstats

prof = cProfile.Profile()
prof.enable()

work()

prof.disable()

stats = pstats.Stats(prof)
stats.sort_stats("cumtime")
stats.print_stats(20)

Common columns:

Column	Meaning
`ncalls`	Number of calls
`tottime`	Time spent in the function body, excluding subcalls
`percall`	Time per call
`cumtime`	Time spent in the function including subcalls
`filename:lineno(function)`	Function identity

cumtime finds high-level expensive operations. tottime finds functions whose own body is expensive.

80.4 Interpreting `cProfile`

Consider this simplified output:

ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
1000000 0.210    0.000    0.210    0.000    app.py:10(parse_item)
1       0.020    0.020    0.310    0.310    app.py:20(load_all)

The function parse_item is called one million times. Even if each call is small, aggregate cost is large.

Common interpretations:

Symptom	Likely issue
Huge `ncalls`	Call overhead or too fine-grained functions
High `tottime`	Expensive function body
High `cumtime`, low `tottime`	Expensive callees
Builtins dominating	Data structure or C-level operation cost
Import machinery dominating	Startup or import-time work

The profiler reports where time went, not why it went there. You still need source-level reasoning.

80.5 Sampling Profiling

Sampling profilers interrupt the program periodically and record where it is.

Instead of tracking every call, they ask:

where is the program right now?

repeated many times.

If 40 percent of samples occur in one function, that function likely consumes about 40 percent of CPU time.

Sampling profilers have lower overhead than deterministic profilers and can observe native C frames.

Examples:

py-spy
scalene
Linux perf
macOS Instruments
Austin

Sampling is often better for production-like workloads because it perturbs the program less.

80.6 Deterministic vs Sampling Profiling

Method	Strength	Weakness
Deterministic profiling	Exact Python call counts	Higher overhead, mostly Python-level
Sampling profiling	Lower overhead, can see native stack	Approximate, depends on sample rate

Use deterministic profiling when call counts matter.

Use sampling profiling when you need low overhead or native visibility.

For CPython internals work, sampling profilers are often more useful because many costs occur in C code inside the interpreter.

80.7 Wall Time vs CPU Time

Wall time is elapsed real time.

CPU time is processor time consumed by the program.

A program can have high wall time and low CPU time when waiting on:

network I/O
disk I/O
locks
sleep
subprocesses
database calls

A CPU-bound Python loop has high CPU time.

An I/O-bound server may spend most wall time waiting.

Choose the right metric. Optimizing Python bytecode will not fix a program that is blocked on a database query.

80.8 Profiling Function Calls

Function calls are a major CPython cost.

Example:

def inc(x):
    return x + 1

total = 0
for i in range(1_000_000):
    total += inc(i)

cProfile will show inc called one million times.

If inc does little work, call overhead dominates.

Possible improvements:

inline tiny helper in hot path
batch work
move loop into C extension or vectorized library
reduce wrapper layers

The right choice depends on code clarity and measured impact.

80.9 Profiling Attribute Access

Attribute access usually does not appear as a separate Python function in cProfile.

Example:

for item in items:
    total += item.price

The time may appear inside the containing function.

To inspect attribute access, use bytecode and focused microbenchmarks:

import dis

def total_price(items):
    total = 0
    for item in items:
        total += item.price
    return total

dis.dis(total_price, adaptive=True, show_caches=True)

Then measure variants with pyperf or timeit.

Attribute access cost is often visible as high time in a loop body rather than as a named profiler entry.

80.10 Profiling Allocation

Allocation overhead can dominate Python workloads.

Example:

rows = []
for item in source:
    rows.append({"id": item.id, "name": item.name})

This creates many dictionaries and strings.

Use tracemalloc to find allocation sites:

import tracemalloc

tracemalloc.start()

run_workload()

snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics("lineno")

for stat in stats[:20]:
    print(stat)

This reports where memory was allocated.

Allocation profiling is essential when runtime and memory grow together.

80.11 `tracemalloc` Snapshots

tracemalloc can compare snapshots.

import tracemalloc

tracemalloc.start()

before = tracemalloc.take_snapshot()
run_workload()
after = tracemalloc.take_snapshot()

diff = after.compare_to(before, "lineno")

for stat in diff[:20]:
    print(stat)

This shows allocation growth between two points.

Useful questions:

Which lines allocate the most memory?
Which allocations survive?
Did a change reduce allocation count?
Is memory retained unexpectedly?

tracemalloc tracks Python allocations, not all native memory from every C extension.

80.12 Profiling Garbage Collection

The cyclic garbage collector can affect latency and memory behavior.

You can inspect GC state:

import gc

print(gc.get_count())
print(gc.get_stats())

You can attach callbacks:

import gc
import time

def on_gc(phase, info):
    print(time.time(), phase, info)

gc.callbacks.append(on_gc)

This helps identify whether GC collections occur during latency-sensitive paths.

Most objects are reclaimed by reference counting. The cyclic GC matters mainly for container cycles and long-lived object graphs.

80.13 Profiling Imports

Import time matters for CLIs, server cold starts, tests, and serverless functions.

CPython can report import timing:

python -X importtime -c "import your_package"

This prints nested import timing.

Common problems:

large imports at module import time
optional dependencies imported eagerly
expensive initialization during import
slow filesystem access
heavy type-checking imports at runtime

Improving import time often means delaying optional work until it is needed.

80.14 Profiling Startup

Startup includes:

interpreter initialization
site module loading
environment processing
imports
module-level code

Useful commands:

python -S script.py
python -X importtime script.py

-S skips automatic site import.

This helps distinguish interpreter startup cost from application import cost.

80.15 Profiling Native Code

Many CPython costs live below Python function level.

Examples:

dictionary lookup
attribute lookup
reference counting
bytecode dispatch
memory allocation
Unicode operations
regular expression engine
JSON parsing
C extension work

Python-level profilers may show time inside one Python function but not expose internal C costs.

Native profilers such as Linux perf can sample C stacks:

perf record -g python script.py
perf report

This can reveal time in CPython functions such as dictionary lookup, evaluation loop, Unicode operations, or allocator paths.

80.16 Profiling the Interpreter Itself

When profiling CPython internals, build choices matter.

A debug build changes performance.

A release build with symbols is often better for profiling:

./configure --with-pydebug

is useful for correctness debugging, but slower.

For performance profiling, you often want an optimized build with debug symbols:

./configure CFLAGS="-O3 -g"
make

Then native profilers can resolve symbols while preserving realistic speed.

80.17 Debug Builds vs Release Builds

Debug builds add checks and change layout or behavior.

They are useful for:

assertions
reference leak checks
memory debugging
development diagnostics

But they distort performance.

Use debug builds to find correctness bugs.

Use release-like builds to measure speed.

80.18 Microbenchmarking

Microbenchmarks isolate a small operation.

Example:

import timeit

setup = """
class C:
    def __init__(self):
        self.x = 1
obj = C()
"""

print(timeit.timeit("obj.x", setup=setup, number=10_000_000))

Microbenchmarks are useful for understanding mechanisms.

They are risky for product decisions because they may not represent real workload behavior.

Use them to answer narrow questions, such as:

is slot access faster here?
does this call shape allocate?
does this lookup specialize?

80.19 `timeit`

timeit reduces common timing mistakes.

It runs code many times and disables some noise sources.

Example:

python -m timeit 'x = 1 + 2'

For setup:

python -m timeit -s 'xs=list(range(1000))' 'sum(xs)'

timeit is good for quick local experiments.

For serious benchmarking, use pyperf.

80.20 `pyperf`

pyperf is designed for stable Python benchmarking.

It handles:

multiple processes
warmups
statistics
system noise reduction
JSON result storage
comparison tools

Example:

python -m pyperf timeit \
  -s 'xs=list(range(1000))' \
  'sum(xs)'

Compare results:

python -m pyperf compare_to before.json after.json

For CPython performance work, pyperf is the standard benchmark harness.

80.21 Warmup Effects

Modern CPython has adaptive specialization.

This means first execution may be slower than later execution.

A benchmark should account for warmup.

Bad benchmark:

start = time.perf_counter()
f()
end = time.perf_counter()

Better approach:

for _ in range(1000):
    f()

start = time.perf_counter()
for _ in range(10000):
    f()
end = time.perf_counter()

Tools such as pyperf handle warmup more carefully.

80.22 Avoiding Benchmark Noise

Performance measurements are noisy.

Sources of noise:

CPU frequency scaling
thermal throttling
background processes
memory pressure
disk cache state
network variability
debug builds
virtual machines
container limits
different Python versions

Good practice:

run multiple times
compare distributions
pin Python version
record hardware
avoid mixed workloads
prefer isolated machines
use pyperf for serious runs

Small differences require careful evidence.

80.23 Profiling Algorithmic Complexity

Profiling does not replace algorithm analysis.

A slow function may be slow because it has the wrong complexity.

Example:

for x in xs:
    if x in ys:
        ...

If ys is a list, membership is linear.

Changing ys to a set can change the complexity:

ys_set = set(ys)

for x in xs:
    if x in ys_set:
        ...

This kind of improvement often dominates interpreter-level tuning.

80.24 Profiling Dictionaries

Dictionary-heavy workloads show up in several forms:

high time in object construction
high time in JSON processing
high time in attribute-heavy loops
high memory use from many mappings
native profiler samples in dict lookup functions

Python-level profiling may only show the containing function.

Use focused experiments to test alternatives:

slots instead of instance dicts
tuples instead of small dict records
arrays or columns instead of row dicts
local variables for repeated lookup

80.25 Profiling Exceptions

Exceptions are expensive when raised frequently.

This pattern can be slow:

try:
    value = d[key]
except KeyError:
    value = default

If missing keys are common, consider:

value = d.get(key, default)

But if misses are rare, exception-based lookup may be fine.

Profile the actual distribution.

Exceptions also affect tracebacks and frame handling, so frequent exception construction can be costly.

80.26 Profiling Regular Expressions

Regular expressions execute mostly in C, so Python-level profilers may show time inside re calls.

Example:

pattern.search(text)

If regex dominates, inspect:

pattern complexity
catastrophic backtracking
precompilation
input size
number of calls

Useful pattern:

import re

pattern = re.compile(r"...")

for text in texts:
    if pattern.search(text):
        ...

Precompilation avoids repeated parse and compile work.

80.27 Profiling I/O

I/O-bound programs need different profiling.

Symptoms:

high wall time
low CPU time
many waits
database or network functions dominate

Use tools that show waiting, not only CPU.

For async programs, profile:

event loop latency
await points
network calls
connection pooling
DNS
serialization
backpressure

Optimizing Python bytecode will have little effect if most time is spent waiting on external systems.

80.28 Profiling Async Code

Async code spreads work across tasks.

Traditional call graphs may be harder to read because logical flow crosses await points.

Useful things to measure:

task count
slow callbacks
event loop lag
await durations
queue sizes
connection pool waits

For local inspection:

import asyncio

asyncio.run(main(), debug=True)

Async profiling often requires application-level instrumentation in addition to profilers.

80.29 Profiling Threads

Threaded Python programs may be limited by:

I/O waits
lock contention
GIL serialization
native extension behavior
queue bottlenecks

CPU-bound Python threads usually do not run Python bytecode in parallel under the traditional GIL.

Profiling threaded code requires looking at:

per-thread stacks
wall time vs CPU time
lock wait time
native extension GIL release behavior

Sampling profilers are often better than deterministic profilers here.

80.30 Profiling Multiprocessing

Multiprocessing uses multiple processes.

Profiling must account for:

worker startup cost
serialization cost
IPC cost
data copying
process scheduling
load imbalance

A workload may be slow because it spends more time moving data than computing.

Measure workers and parent process separately when needed.

80.31 Flame Graphs

A flame graph visualizes stack samples.

Wide frames consume more sampled time.

This is useful for seeing:

hot call paths
unexpected wrappers
native extension cost
interpreter overhead
recursive behavior

A flame graph can reveal that the program spends time in places a flat profile hides.

Flat profiles answer “which functions.”

Flame graphs answer “which stacks.”

80.32 Profiling Memory Leaks

In CPython, many apparent leaks are retained references.

A true leak may be:

unreachable C memory not freed
reference cycle with finalizers
global cache growth
unbounded list or dict
C extension reference leak

Start with tracemalloc snapshots.

Then inspect object ownership:

import gc

objs = gc.get_objects()

For C extension work, use debug builds and reference leak test tools.

80.33 Reference Leak Checks

CPython development uses reference leak checks to detect incorrect reference ownership.

A reference leak means code increments a reference and fails to release it, or stores a reference unexpectedly.

In C extensions and CPython core code, this is serious.

Symptoms:

memory grows across repeated tests
objects never deallocate
reference counts increase unexpectedly

Reference leak hunting often requires debug builds and repeated test runs.

80.34 Profiling C Extensions

C extensions can dominate runtime while appearing as one Python call.

Example:

result = extension.compute(data)

Python-level profiler may show compute as expensive, but not the internal reason.

Use native profiling, extension-specific instrumentation, or debug builds.

Important questions:

does the extension release the GIL?
does it allocate many Python objects?
does it copy data unnecessarily?
does it use vectorcall or fastcall for hot callables?
does it convert data between formats repeatedly?

80.35 Profiling Specialized Bytecode

Modern CPython allows inspection of adaptive bytecode.

Example:

import dis

def f(obj):
    return obj.x + 1

for _ in range(10_000):
    f(SomeObject())

dis.dis(f, adaptive=True, show_caches=True)

Look for whether operations specialized.

Questions:

did LOAD_ATTR specialize?
did BINARY_OP specialize?
are caches present?
is the site unstable?

This connects bytecode-level behavior to performance observations.

80.36 Profiling With Counters

Sometimes you need explicit counters.

Example:

hits = 0
misses = 0

for key in keys:
    if key in cache:
        hits += 1
    else:
        misses += 1

Application-level counters can explain profiles better than tool output alone.

Measure domain facts:

cache hit rates
batch sizes
queue depths
number of objects
number of queries
number of retries
input sizes

Performance is often determined by workload shape.

80.37 Building a Profiling Workflow

A good workflow is:

define the workload
measure baseline
identify dominant cost
form one hypothesis
change one thing
measure again
compare results
keep or revert

Avoid changing several variables at once.

For CPython-level work, also record:

Python version
build flags
platform
CPU model
compiler
environment variables
benchmark command

Without this metadata, results are hard to reproduce.

80.38 Common Profiling Mistakes

Mistake	Problem
Measuring one run	Too noisy
Ignoring warmup	Misses adaptive specialization
Using debug builds for speed claims	Distorts performance
Optimizing non-dominant code	No visible improvement
Confusing wall time and CPU time	Wrong diagnosis
Trusting microbenchmarks alone	May not match real workload
Ignoring allocation	Memory churn can dominate
Ignoring native code	Python profiler may hide C cost

Profiling is measurement discipline, not just tool usage.

80.39 Reading CPython Profiling Paths

Relevant source areas:

File	Purpose
`Modules/_lsprof.c`	`cProfile` implementation
`Lib/profile.py`	Pure Python profiler
`Lib/pstats.py`	Profile statistics formatting
`Lib/tracemalloc.py`	Python interface to allocation tracing
`Python/ceval.c`	Trace and profile hook integration
`Python/sysmodule.c`	`sys.setprofile`, `sys.settrace` interfaces
`Modules/gcmodule.c`	GC statistics and callbacks

Profiling hooks are wired into interpreter execution, especially calls, returns, exceptions, and line events.

80.40 Mental Model

A useful model:

Profiling maps symptoms to mechanisms.

Slow code is the symptom.

The mechanism may be:

too many calls
too many objects
poor data layout
slow lookup
I/O waiting
bad complexity
native extension cost
GC pressure

A good profile narrows the search space.

A good benchmark verifies the fix.

80.41 Chapter Summary

Profiling CPython requires measuring at the right layer.

Use cProfile for Python call costs, sampling profilers for native and low-overhead views, tracemalloc for allocation behavior, import timing for startup, and pyperf for reliable benchmark comparisons.

Performance work should proceed from evidence:

measure baseline
identify dominant cost
explain mechanism
make targeted change
measure again

CPython performance depends on interpreter dispatch, object layout, dictionaries, attribute lookup, function calls, allocation, and memory locality. Profiling is how those mechanisms become visible.

80. Profiling CPython

80.1 What Profiling Measures

80.2 Deterministic Profiling

80.3 cProfile

80.4 Interpreting cProfile

80.5 Sampling Profiling

80.6 Deterministic vs Sampling Profiling

80.7 Wall Time vs CPU Time

80.8 Profiling Function Calls

80.9 Profiling Attribute Access

80.10 Profiling Allocation

80.11 tracemalloc Snapshots

80.12 Profiling Garbage Collection

80.13 Profiling Imports

80.14 Profiling Startup

80.15 Profiling Native Code

80.16 Profiling the Interpreter Itself

80.17 Debug Builds vs Release Builds

80.18 Microbenchmarking

80.19 timeit

80.20 pyperf

80.21 Warmup Effects

80.22 Avoiding Benchmark Noise

80.23 Profiling Algorithmic Complexity

80.24 Profiling Dictionaries

80.25 Profiling Exceptions

80.26 Profiling Regular Expressions

80.27 Profiling I/O

80.28 Profiling Async Code

80.29 Profiling Threads

80.30 Profiling Multiprocessing

80.31 Flame Graphs

80.32 Profiling Memory Leaks

80.33 Reference Leak Checks

80.34 Profiling C Extensions

80.35 Profiling Specialized Bytecode

80.36 Profiling With Counters

80.37 Building a Profiling Workflow

80.38 Common Profiling Mistakes

80.39 Reading CPython Profiling Paths

80.40 Mental Model

80.41 Chapter Summary

80.3 `cProfile`

80.4 Interpreting `cProfile`

80.11 `tracemalloc` Snapshots

80.19 `timeit`

80.20 `pyperf`