Benchmarking measures how an implementation behaves on real inputs and real hardware.
Benchmarking measures how an implementation behaves on real inputs and real hardware. Complexity analysis tells you how an algorithm should scale. Benchmarking tells you what the constants, memory effects, and implementation details actually do.
A benchmark should answer a specific question. It should not merely produce a number. Useful questions include whether one implementation is faster than another, whether the running time grows as expected, whether memory usage fits the constraint, or whether a change caused a regression.
Problem
You have an implemented algorithm and need to measure its practical performance without confusing noise, setup cost, or unrelated work with the algorithm itself.
Solution
Design the benchmark around a clear hypothesis.
Use this structure:
1. Define the operation being measured.
2. Generate or load representative inputs.
3. Separate setup cost from measured cost.
4. Run enough iterations to reduce noise.
5. Report input size, time, memory, and environment.
6. Compare results against the expected complexity.For example, do not benchmark sorting by including random array generation inside the timed section unless generation is part of the operation you want to measure.
Example: Sorting Benchmark
Poor benchmark:
start timer
A = random array of length n
sort(A)
stop timerThis measures both input generation and sorting.
Better benchmark:
A = random array of length n
start timer
B = copy(A)
sort(B)
stop timerIf copying should not be measured either, prepare all input copies before timing:
inputs = many copies of random arrays
start timer
for A in inputs:
sort(A)
stop timerThe benchmark should match the question. If the question is “How fast is sorting?”, exclude generation. If the question is “How fast is the whole pipeline?”, include generation.
Input Selection
Benchmark inputs should cover more than one size and more than one shape.
For sorting:
| Input Shape | Why It Matters |
|---|---|
| Random | Typical baseline |
| Already sorted | Best case for some algorithms, bad for poor quicksort pivots |
| Reverse sorted | Common adversarial order |
| Many duplicates | Tests partitioning and stability |
| Nearly sorted | Common real-world pattern |
For graph algorithms:
| Input Shape | Why It Matters |
|---|---|
| Sparse graph | Tests adjacency-list behavior |
| Dense graph | Tests memory and edge scanning |
| Long chain | Tests recursion depth and path behavior |
| Star graph | Tests high-degree vertex behavior |
| Disconnected graph | Tests component handling |
Scaling Tests
A benchmark should check how time grows as input size grows.
Example sizes:
n = 1_000
n = 2_000
n = 4_000
n = 8_000
n = 16_000Doubling input size makes growth easier to inspect.
Expected patterns:
| Complexity | Doubling (n) Roughly Multiplies Time By |
|---|---|
| (O(n)) | 2 |
| (O(n \log n)) | Slightly more than 2 |
| (O(n^2)) | 4 |
| (O(n^3)) | 8 |
These are rough checks, not exact laws. Cache effects, memory allocation, and constant factors can distort small inputs.
Warmup and Noise
Modern systems introduce measurement noise through caching, scheduling, dynamic frequency scaling, garbage collection, and background processes.
A practical benchmark should:
run several times
discard obvious warmup effects
report median or percentile values
avoid measuring tiny operations once
keep the machine environment stableFor very small operations, repeat the operation many times inside one timed region.
start timer
repeat 1_000_000 times:
operation()
stop timerThen divide by the number of repetitions if per-operation cost is needed.
Memory Benchmarking
Time is not the only metric. Some algorithms fail because of memory.
Measure:
peak memory
number of allocations
temporary buffer size
recursion depth
cache locality when relevantFor example, merge sort and heap sort both run in (O(n \log n)) time, but merge sort commonly uses (O(n)) auxiliary space while heap sort can sort in place. Depending on memory constraints, this difference may dominate.
Comparing Algorithms
When comparing two algorithms, keep everything else equal:
same input data
same machine
same compiler or runtime settings
same measurement method
same output requirementsIf one algorithm returns only a cost and another reconstructs a witness, the comparison is unfair unless the problem requires only the cost.
Also check that both implementations are correct on the benchmark inputs. A fast wrong answer is not an algorithmic improvement.
Common Pitfalls
Do not benchmark only one input size. A constant-factor win at small (n) may disappear at large (n).
Do not benchmark only random inputs. Many algorithms have structured worst cases.
Do not include setup cost unless the setup is part of the measured operation.
Do not trust a single run. Report stable summaries.
Do not compare algorithms with different output contracts.
Do not ignore memory. Time and memory often trade against each other.
Takeaway
Benchmarking connects asymptotic analysis to actual execution. Define the question, isolate the measured operation, use representative and adversarial inputs, test scaling behavior, and report enough context for the result to be interpreted.