# 80. Profiling CPython

# 80. Profiling CPython

Profiling CPython means measuring where time, memory, allocations, and interpreter work are spent. It is the practical bridge between a performance suspicion and a performance explanation.

A Python program can be slow for many different reasons:

```text
too many Python function calls
slow attribute lookup
large dictionary workloads
excessive allocation
poor memory locality
I/O waiting
bad algorithmic complexity
C extension overhead
lock contention
garbage collection pauses
```

Profiling separates these cases.

Without profiling, performance work often becomes guesswork.

## 80.1 What Profiling Measures

Profiling can measure several different layers.

| Layer | Measures | Typical tools |
|---|---|---|
| Python function level | Which Python functions consume time | `cProfile`, `profile`, `pstats` |
| Line level | Which source lines consume time | `line_profiler`, tracing tools |
| Allocation level | Where Python allocations happen | `tracemalloc` |
| Native level | C functions, CPU samples, cache misses | `perf`, Instruments, VTune |
| Event level | GC events, imports, calls, exceptions | audit hooks, tracing hooks |
| Benchmark level | Stable timing across runs | `pyperf`, `timeit` |

Different tools answer different questions. A single profile rarely explains everything.

## 80.2 Deterministic Profiling

Deterministic profiling records events as they happen.

For Python code, this usually means tracking:

```text
function call
function return
exception
```

`cProfile` is the standard deterministic profiler.

Example:

```python
import cProfile

def work():
    total = 0
    for i in range(1_000_000):
        total += i
    return total

cProfile.run("work()")
```

The profiler records function-level execution counts and time.

Deterministic profiling gives precise call counts, but adds overhead because it observes every function event.

## 80.3 `cProfile`

`cProfile` is implemented in C and has lower overhead than the pure Python `profile` module.

A typical workflow:

```python
import cProfile
import pstats

prof = cProfile.Profile()
prof.enable()

work()

prof.disable()

stats = pstats.Stats(prof)
stats.sort_stats("cumtime")
stats.print_stats(20)
```

Common columns:

| Column | Meaning |
|---|---|
| `ncalls` | Number of calls |
| `tottime` | Time spent in the function body, excluding subcalls |
| `percall` | Time per call |
| `cumtime` | Time spent in the function including subcalls |
| `filename:lineno(function)` | Function identity |

`cumtime` finds high-level expensive operations. `tottime` finds functions whose own body is expensive.

## 80.4 Interpreting `cProfile`

Consider this simplified output:

```text
ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
1000000 0.210    0.000    0.210    0.000    app.py:10(parse_item)
1       0.020    0.020    0.310    0.310    app.py:20(load_all)
```

The function `parse_item` is called one million times. Even if each call is small, aggregate cost is large.

Common interpretations:

| Symptom | Likely issue |
|---|---|
| Huge `ncalls` | Call overhead or too fine-grained functions |
| High `tottime` | Expensive function body |
| High `cumtime`, low `tottime` | Expensive callees |
| Builtins dominating | Data structure or C-level operation cost |
| Import machinery dominating | Startup or import-time work |

The profiler reports where time went, not why it went there. You still need source-level reasoning.

## 80.5 Sampling Profiling

Sampling profilers interrupt the program periodically and record where it is.

Instead of tracking every call, they ask:

```text
where is the program right now?
```

repeated many times.

If 40 percent of samples occur in one function, that function likely consumes about 40 percent of CPU time.

Sampling profilers have lower overhead than deterministic profilers and can observe native C frames.

Examples:

```text
py-spy
scalene
Linux perf
macOS Instruments
Austin
```

Sampling is often better for production-like workloads because it perturbs the program less.

## 80.6 Deterministic vs Sampling Profiling

| Method | Strength | Weakness |
|---|---|---|
| Deterministic profiling | Exact Python call counts | Higher overhead, mostly Python-level |
| Sampling profiling | Lower overhead, can see native stack | Approximate, depends on sample rate |

Use deterministic profiling when call counts matter.

Use sampling profiling when you need low overhead or native visibility.

For CPython internals work, sampling profilers are often more useful because many costs occur in C code inside the interpreter.

## 80.7 Wall Time vs CPU Time

Wall time is elapsed real time.

CPU time is processor time consumed by the program.

A program can have high wall time and low CPU time when waiting on:

```text
network I/O
disk I/O
locks
sleep
subprocesses
database calls
```

A CPU-bound Python loop has high CPU time.

An I/O-bound server may spend most wall time waiting.

Choose the right metric. Optimizing Python bytecode will not fix a program that is blocked on a database query.

## 80.8 Profiling Function Calls

Function calls are a major CPython cost.

Example:

```python
def inc(x):
    return x + 1

total = 0
for i in range(1_000_000):
    total += inc(i)
```

`cProfile` will show `inc` called one million times.

If `inc` does little work, call overhead dominates.

Possible improvements:

```text
inline tiny helper in hot path
batch work
move loop into C extension or vectorized library
reduce wrapper layers
```

The right choice depends on code clarity and measured impact.

## 80.9 Profiling Attribute Access

Attribute access usually does not appear as a separate Python function in `cProfile`.

Example:

```python
for item in items:
    total += item.price
```

The time may appear inside the containing function.

To inspect attribute access, use bytecode and focused microbenchmarks:

```python
import dis

def total_price(items):
    total = 0
    for item in items:
        total += item.price
    return total

dis.dis(total_price, adaptive=True, show_caches=True)
```

Then measure variants with `pyperf` or `timeit`.

Attribute access cost is often visible as high time in a loop body rather than as a named profiler entry.

## 80.10 Profiling Allocation

Allocation overhead can dominate Python workloads.

Example:

```python
rows = []
for item in source:
    rows.append({"id": item.id, "name": item.name})
```

This creates many dictionaries and strings.

Use `tracemalloc` to find allocation sites:

```python
import tracemalloc

tracemalloc.start()

run_workload()

snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics("lineno")

for stat in stats[:20]:
    print(stat)
```

This reports where memory was allocated.

Allocation profiling is essential when runtime and memory grow together.

## 80.11 `tracemalloc` Snapshots

`tracemalloc` can compare snapshots.

```python
import tracemalloc

tracemalloc.start()

before = tracemalloc.take_snapshot()
run_workload()
after = tracemalloc.take_snapshot()

diff = after.compare_to(before, "lineno")

for stat in diff[:20]:
    print(stat)
```

This shows allocation growth between two points.

Useful questions:

```text
Which lines allocate the most memory?
Which allocations survive?
Did a change reduce allocation count?
Is memory retained unexpectedly?
```

`tracemalloc` tracks Python allocations, not all native memory from every C extension.

## 80.12 Profiling Garbage Collection

The cyclic garbage collector can affect latency and memory behavior.

You can inspect GC state:

```python
import gc

print(gc.get_count())
print(gc.get_stats())
```

You can attach callbacks:

```python
import gc
import time

def on_gc(phase, info):
    print(time.time(), phase, info)

gc.callbacks.append(on_gc)
```

This helps identify whether GC collections occur during latency-sensitive paths.

Most objects are reclaimed by reference counting. The cyclic GC matters mainly for container cycles and long-lived object graphs.

## 80.13 Profiling Imports

Import time matters for CLIs, server cold starts, tests, and serverless functions.

CPython can report import timing:

```bash
python -X importtime -c "import your_package"
```

This prints nested import timing.

Common problems:

```text
large imports at module import time
optional dependencies imported eagerly
expensive initialization during import
slow filesystem access
heavy type-checking imports at runtime
```

Improving import time often means delaying optional work until it is needed.

## 80.14 Profiling Startup

Startup includes:

```text
interpreter initialization
site module loading
environment processing
imports
module-level code
```

Useful commands:

```bash
python -S script.py
python -X importtime script.py
```

`-S` skips automatic `site` import.

This helps distinguish interpreter startup cost from application import cost.

## 80.15 Profiling Native Code

Many CPython costs live below Python function level.

Examples:

```text
dictionary lookup
attribute lookup
reference counting
bytecode dispatch
memory allocation
Unicode operations
regular expression engine
JSON parsing
C extension work
```

Python-level profilers may show time inside one Python function but not expose internal C costs.

Native profilers such as Linux `perf` can sample C stacks:

```bash
perf record -g python script.py
perf report
```

This can reveal time in CPython functions such as dictionary lookup, evaluation loop, Unicode operations, or allocator paths.

## 80.16 Profiling the Interpreter Itself

When profiling CPython internals, build choices matter.

A debug build changes performance.

A release build with symbols is often better for profiling:

```bash
./configure --with-pydebug
```

is useful for correctness debugging, but slower.

For performance profiling, you often want an optimized build with debug symbols:

```bash
./configure CFLAGS="-O3 -g"
make
```

Then native profilers can resolve symbols while preserving realistic speed.

## 80.17 Debug Builds vs Release Builds

Debug builds add checks and change layout or behavior.

They are useful for:

```text
assertions
reference leak checks
memory debugging
development diagnostics
```

But they distort performance.

Use debug builds to find correctness bugs.

Use release-like builds to measure speed.

## 80.18 Microbenchmarking

Microbenchmarks isolate a small operation.

Example:

```python
import timeit

setup = """
class C:
    def __init__(self):
        self.x = 1
obj = C()
"""

print(timeit.timeit("obj.x", setup=setup, number=10_000_000))
```

Microbenchmarks are useful for understanding mechanisms.

They are risky for product decisions because they may not represent real workload behavior.

Use them to answer narrow questions, such as:

```text
is slot access faster here?
does this call shape allocate?
does this lookup specialize?
```

## 80.19 `timeit`

`timeit` reduces common timing mistakes.

It runs code many times and disables some noise sources.

Example:

```python
python -m timeit 'x = 1 + 2'
```

For setup:

```bash
python -m timeit -s 'xs=list(range(1000))' 'sum(xs)'
```

`timeit` is good for quick local experiments.

For serious benchmarking, use `pyperf`.

## 80.20 `pyperf`

`pyperf` is designed for stable Python benchmarking.

It handles:

```text
multiple processes
warmups
statistics
system noise reduction
JSON result storage
comparison tools
```

Example:

```bash
python -m pyperf timeit \
  -s 'xs=list(range(1000))' \
  'sum(xs)'
```

Compare results:

```bash
python -m pyperf compare_to before.json after.json
```

For CPython performance work, `pyperf` is the standard benchmark harness.

## 80.21 Warmup Effects

Modern CPython has adaptive specialization.

This means first execution may be slower than later execution.

A benchmark should account for warmup.

Bad benchmark:

```python
start = time.perf_counter()
f()
end = time.perf_counter()
```

Better approach:

```python
for _ in range(1000):
    f()

start = time.perf_counter()
for _ in range(10000):
    f()
end = time.perf_counter()
```

Tools such as `pyperf` handle warmup more carefully.

## 80.22 Avoiding Benchmark Noise

Performance measurements are noisy.

Sources of noise:

```text
CPU frequency scaling
thermal throttling
background processes
memory pressure
disk cache state
network variability
debug builds
virtual machines
container limits
different Python versions
```

Good practice:

```text
run multiple times
compare distributions
pin Python version
record hardware
avoid mixed workloads
prefer isolated machines
use pyperf for serious runs
```

Small differences require careful evidence.

## 80.23 Profiling Algorithmic Complexity

Profiling does not replace algorithm analysis.

A slow function may be slow because it has the wrong complexity.

Example:

```python
for x in xs:
    if x in ys:
        ...
```

If `ys` is a list, membership is linear.

Changing `ys` to a set can change the complexity:

```python
ys_set = set(ys)

for x in xs:
    if x in ys_set:
        ...
```

This kind of improvement often dominates interpreter-level tuning.

## 80.24 Profiling Dictionaries

Dictionary-heavy workloads show up in several forms:

```text
high time in object construction
high time in JSON processing
high time in attribute-heavy loops
high memory use from many mappings
native profiler samples in dict lookup functions
```

Python-level profiling may only show the containing function.

Use focused experiments to test alternatives:

```text
slots instead of instance dicts
tuples instead of small dict records
arrays or columns instead of row dicts
local variables for repeated lookup
```

## 80.25 Profiling Exceptions

Exceptions are expensive when raised frequently.

This pattern can be slow:

```python
try:
    value = d[key]
except KeyError:
    value = default
```

If missing keys are common, consider:

```python
value = d.get(key, default)
```

But if misses are rare, exception-based lookup may be fine.

Profile the actual distribution.

Exceptions also affect tracebacks and frame handling, so frequent exception construction can be costly.

## 80.26 Profiling Regular Expressions

Regular expressions execute mostly in C, so Python-level profilers may show time inside `re` calls.

Example:

```python
pattern.search(text)
```

If regex dominates, inspect:

```text
pattern complexity
catastrophic backtracking
precompilation
input size
number of calls
```

Useful pattern:

```python
import re

pattern = re.compile(r"...")

for text in texts:
    if pattern.search(text):
        ...
```

Precompilation avoids repeated parse and compile work.

## 80.27 Profiling I/O

I/O-bound programs need different profiling.

Symptoms:

```text
high wall time
low CPU time
many waits
database or network functions dominate
```

Use tools that show waiting, not only CPU.

For async programs, profile:

```text
event loop latency
await points
network calls
connection pooling
DNS
serialization
backpressure
```

Optimizing Python bytecode will have little effect if most time is spent waiting on external systems.

## 80.28 Profiling Async Code

Async code spreads work across tasks.

Traditional call graphs may be harder to read because logical flow crosses await points.

Useful things to measure:

```text
task count
slow callbacks
event loop lag
await durations
queue sizes
connection pool waits
```

For local inspection:

```python
import asyncio

asyncio.run(main(), debug=True)
```

Async profiling often requires application-level instrumentation in addition to profilers.

## 80.29 Profiling Threads

Threaded Python programs may be limited by:

```text
I/O waits
lock contention
GIL serialization
native extension behavior
queue bottlenecks
```

CPU-bound Python threads usually do not run Python bytecode in parallel under the traditional GIL.

Profiling threaded code requires looking at:

```text
per-thread stacks
wall time vs CPU time
lock wait time
native extension GIL release behavior
```

Sampling profilers are often better than deterministic profilers here.

## 80.30 Profiling Multiprocessing

Multiprocessing uses multiple processes.

Profiling must account for:

```text
worker startup cost
serialization cost
IPC cost
data copying
process scheduling
load imbalance
```

A workload may be slow because it spends more time moving data than computing.

Measure workers and parent process separately when needed.

## 80.31 Flame Graphs

A flame graph visualizes stack samples.

Wide frames consume more sampled time.

This is useful for seeing:

```text
hot call paths
unexpected wrappers
native extension cost
interpreter overhead
recursive behavior
```

A flame graph can reveal that the program spends time in places a flat profile hides.

Flat profiles answer “which functions.”

Flame graphs answer “which stacks.”

## 80.32 Profiling Memory Leaks

In CPython, many apparent leaks are retained references.

A true leak may be:

```text
unreachable C memory not freed
reference cycle with finalizers
global cache growth
unbounded list or dict
C extension reference leak
```

Start with `tracemalloc` snapshots.

Then inspect object ownership:

```python
import gc

objs = gc.get_objects()
```

For C extension work, use debug builds and reference leak test tools.

## 80.33 Reference Leak Checks

CPython development uses reference leak checks to detect incorrect reference ownership.

A reference leak means code increments a reference and fails to release it, or stores a reference unexpectedly.

In C extensions and CPython core code, this is serious.

Symptoms:

```text
memory grows across repeated tests
objects never deallocate
reference counts increase unexpectedly
```

Reference leak hunting often requires debug builds and repeated test runs.

## 80.34 Profiling C Extensions

C extensions can dominate runtime while appearing as one Python call.

Example:

```python
result = extension.compute(data)
```

Python-level profiler may show `compute` as expensive, but not the internal reason.

Use native profiling, extension-specific instrumentation, or debug builds.

Important questions:

```text
does the extension release the GIL?
does it allocate many Python objects?
does it copy data unnecessarily?
does it use vectorcall or fastcall for hot callables?
does it convert data between formats repeatedly?
```

## 80.35 Profiling Specialized Bytecode

Modern CPython allows inspection of adaptive bytecode.

Example:

```python
import dis

def f(obj):
    return obj.x + 1

for _ in range(10_000):
    f(SomeObject())

dis.dis(f, adaptive=True, show_caches=True)
```

Look for whether operations specialized.

Questions:

```text
did LOAD_ATTR specialize?
did BINARY_OP specialize?
are caches present?
is the site unstable?
```

This connects bytecode-level behavior to performance observations.

## 80.36 Profiling With Counters

Sometimes you need explicit counters.

Example:

```python
hits = 0
misses = 0

for key in keys:
    if key in cache:
        hits += 1
    else:
        misses += 1
```

Application-level counters can explain profiles better than tool output alone.

Measure domain facts:

```text
cache hit rates
batch sizes
queue depths
number of objects
number of queries
number of retries
input sizes
```

Performance is often determined by workload shape.

## 80.37 Building a Profiling Workflow

A good workflow is:

```text
define the workload
measure baseline
identify dominant cost
form one hypothesis
change one thing
measure again
compare results
keep or revert
```

Avoid changing several variables at once.

For CPython-level work, also record:

```text
Python version
build flags
platform
CPU model
compiler
environment variables
benchmark command
```

Without this metadata, results are hard to reproduce.

## 80.38 Common Profiling Mistakes

| Mistake | Problem |
|---|---|
| Measuring one run | Too noisy |
| Ignoring warmup | Misses adaptive specialization |
| Using debug builds for speed claims | Distorts performance |
| Optimizing non-dominant code | No visible improvement |
| Confusing wall time and CPU time | Wrong diagnosis |
| Trusting microbenchmarks alone | May not match real workload |
| Ignoring allocation | Memory churn can dominate |
| Ignoring native code | Python profiler may hide C cost |

Profiling is measurement discipline, not just tool usage.

## 80.39 Reading CPython Profiling Paths

Relevant source areas:

| File | Purpose |
|---|---|
| `Modules/_lsprof.c` | `cProfile` implementation |
| `Lib/profile.py` | Pure Python profiler |
| `Lib/pstats.py` | Profile statistics formatting |
| `Lib/tracemalloc.py` | Python interface to allocation tracing |
| `Python/ceval.c` | Trace and profile hook integration |
| `Python/sysmodule.c` | `sys.setprofile`, `sys.settrace` interfaces |
| `Modules/gcmodule.c` | GC statistics and callbacks |

Profiling hooks are wired into interpreter execution, especially calls, returns, exceptions, and line events.

## 80.40 Mental Model

A useful model:

```text
Profiling maps symptoms to mechanisms.
```

Slow code is the symptom.

The mechanism may be:

```text
too many calls
too many objects
poor data layout
slow lookup
I/O waiting
bad complexity
native extension cost
GC pressure
```

A good profile narrows the search space.

A good benchmark verifies the fix.

## 80.41 Chapter Summary

Profiling CPython requires measuring at the right layer.

Use `cProfile` for Python call costs, sampling profilers for native and low-overhead views, `tracemalloc` for allocation behavior, import timing for startup, and `pyperf` for reliable benchmark comparisons.

Performance work should proceed from evidence:

```text
measure baseline
identify dominant cost
explain mechanism
make targeted change
measure again
```

CPython performance depends on interpreter dispatch, object layout, dictionaries, attribute lookup, function calls, allocation, and memory locality. Profiling is how those mechanisms become visible.
