1.24 Benchmarking

Benchmarking measures how an implementation behaves on real inputs and real hardware. Complexity analysis tells you how an algorithm should scale. Benchmarking tells you what the constants, memory effects, and implementation details actually do.

A benchmark should answer a specific question. It should not merely produce a number. Useful questions include whether one implementation is faster than another, whether the running time grows as expected, whether memory usage fits the constraint, or whether a change caused a regression.

Problem

You have an implemented algorithm and need to measure its practical performance without confusing noise, setup cost, or unrelated work with the algorithm itself.

Solution

Design the benchmark around a clear hypothesis.

Use this structure:

1. Define the operation being measured.
2. Generate or load representative inputs.
3. Separate setup cost from measured cost.
4. Run enough iterations to reduce noise.
5. Report input size, time, memory, and environment.
6. Compare results against the expected complexity.

For example, do not benchmark sorting by including random array generation inside the timed section unless generation is part of the operation you want to measure.

Example: Sorting Benchmark

Poor benchmark:

start timer
A = random array of length n
sort(A)
stop timer

This measures both input generation and sorting.

Better benchmark:

A = random array of length n

start timer
B = copy(A)
sort(B)
stop timer

If copying should not be measured either, prepare all input copies before timing:

inputs = many copies of random arrays

start timer
for A in inputs:
  sort(A)
stop timer

The benchmark should match the question. If the question is “How fast is sorting?”, exclude generation. If the question is “How fast is the whole pipeline?”, include generation.

Input Selection

Benchmark inputs should cover more than one size and more than one shape.

For sorting:

Input Shape	Why It Matters
Random	Typical baseline
Already sorted	Best case for some algorithms, bad for poor quicksort pivots
Reverse sorted	Common adversarial order
Many duplicates	Tests partitioning and stability
Nearly sorted	Common real-world pattern

For graph algorithms:

Input Shape	Why It Matters
Sparse graph	Tests adjacency-list behavior
Dense graph	Tests memory and edge scanning
Long chain	Tests recursion depth and path behavior
Star graph	Tests high-degree vertex behavior
Disconnected graph	Tests component handling

Scaling Tests

A benchmark should check how time grows as input size grows.

Example sizes:

n = 1_000
n = 2_000
n = 4_000
n = 8_000
n = 16_000

Doubling input size makes growth easier to inspect.

Expected patterns:

Complexity	Doubling (n) Roughly Multiplies Time By
(O(n))	2
(O(n \log n))	Slightly more than 2
(O(n^2))	4
(O(n^3))	8

These are rough checks, not exact laws. Cache effects, memory allocation, and constant factors can distort small inputs.

Warmup and Noise

Modern systems introduce measurement noise through caching, scheduling, dynamic frequency scaling, garbage collection, and background processes.

A practical benchmark should:

run several times
discard obvious warmup effects
report median or percentile values
avoid measuring tiny operations once
keep the machine environment stable

For very small operations, repeat the operation many times inside one timed region.

start timer
repeat 1_000_000 times:
  operation()
stop timer

Then divide by the number of repetitions if per-operation cost is needed.

Memory Benchmarking

Time is not the only metric. Some algorithms fail because of memory.

Measure:

peak memory
number of allocations
temporary buffer size
recursion depth
cache locality when relevant

For example, merge sort and heap sort both run in (O(n \log n)) time, but merge sort commonly uses (O(n)) auxiliary space while heap sort can sort in place. Depending on memory constraints, this difference may dominate.

Comparing Algorithms

When comparing two algorithms, keep everything else equal:

same input data
same machine
same compiler or runtime settings
same measurement method
same output requirements

If one algorithm returns only a cost and another reconstructs a witness, the comparison is unfair unless the problem requires only the cost.

Also check that both implementations are correct on the benchmark inputs. A fast wrong answer is not an algorithmic improvement.

Common Pitfalls

Do not benchmark only one input size. A constant-factor win at small (n) may disappear at large (n).

Do not benchmark only random inputs. Many algorithms have structured worst cases.

Do not include setup cost unless the setup is part of the measured operation.

Do not trust a single run. Report stable summaries.

Do not compare algorithms with different output contracts.

Do not ignore memory. Time and memory often trade against each other.

Takeaway

Benchmarking connects asymptotic analysis to actual execution. Define the question, isolate the measured operation, use representative and adversarial inputs, test scaling behavior, and report enough context for the result to be interpreted.