# Benchmark the Right Thing

### Benchmark Methodology

Benchmarking means measuring how fast code runs.

A benchmark answers a narrow question:

How long does this operation take under these conditions?

That last part matters. A benchmark is only useful when the conditions are clear. Different inputs, build modes, machines, allocators, and operating system states can produce different results.

Good benchmarking is not only running a timer. It is designing a measurement that tells the truth.

## Benchmark the Right Thing

Before writing a benchmark, define the question.

Bad question:

Is this code fast?

Better question:

How many bytes per second can this parser process on a 100 MB JSON file in `ReleaseFast` mode?

Better question:

How many requests per second can this server handle with 1,000 concurrent connections?

Better question:

How many nanoseconds does this small function take when called 10 million times?

A vague benchmark gives vague results.

A precise benchmark gives useful results.

## Use Release Builds

Never benchmark Debug mode unless you are measuring Debug mode itself.

Debug builds include extra checks and less optimization.

Use:

```bash
zig build-exe main.zig -O ReleaseFast
```

or, when you want safety checks with optimization:

```bash
zig build-exe main.zig -O ReleaseSafe
```

Record the mode with the result.

A result without the build mode is incomplete.

## Make the Work Large Enough

A benchmark must run long enough to measure.

This is too small:

```zig
const start = std.time.nanoTimestamp();
const x = add(1, 2);
const end = std.time.nanoTimestamp();

std.debug.print("{} ns\n", .{end - start});
```

The timer overhead may be larger than the work.

Instead, repeat the work many times:

```zig
const std = @import("std");

fn add(a: u64, b: u64) u64 {
    return a + b;
}

pub fn main() void {
    const iterations = 100_000_000;

    const start = std.time.nanoTimestamp();

    var sum: u64 = 0;
    for (0..iterations) |i| {
        sum += add(i, 1);
    }

    const end = std.time.nanoTimestamp();

    std.debug.print("sum: {}\n", .{sum});
    std.debug.print("elapsed: {} ns\n", .{end - start});
}
```

The benchmark runs enough work for the timing to be meaningful.

## Prevent Dead Code Elimination

Optimizing compilers remove useless work.

This benchmark may be invalid:

```zig
for (0..1000000) |i| {
    _ = i * i;
}
```

The result is unused. The compiler may remove the loop entirely.

Use the result in a visible way:

```zig
var sum: u64 = 0;

for (0..1000000) |i| {
    sum += i * i;
}

std.debug.print("{}\n", .{sum});
```

Now the compiler must preserve the computation.

## Separate Setup from Measurement

Do not include setup time unless setup is part of what you want to measure.

Bad:

```zig
const start = std.time.nanoTimestamp();

const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);
process(input);

const end = std.time.nanoTimestamp();
```

This measures allocation, input generation, and processing together.

Better:

```zig
const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);

const start = std.time.nanoTimestamp();
process(input);
const end = std.time.nanoTimestamp();
```

Now the timed region measures only `process`.

Sometimes you do want total end-to-end time. That is fine. Just name it correctly.

## Run Benchmarks Multiple Times

One run is not enough.

Performance varies because of:

- operating system scheduling
- CPU frequency scaling
- background processes
- disk cache
- memory layout
- thermal throttling

Run multiple trials.

Example output:

```text
run 1: 104 ms
run 2: 101 ms
run 3: 103 ms
run 4: 102 ms
run 5: 101 ms
```

This is more trustworthy than one number.

## Look at Distribution, Not Only Average

The average can hide important behavior.

Suppose you measure request latency:

```text
average: 10 ms
```

That sounds good.

But the distribution may be:

```text
p50: 5 ms
p95: 40 ms
p99: 200 ms
```

For servers, games, databases, and interactive tools, tail latency matters.

A program that is usually fast but sometimes very slow may still be bad.

## Compare Against a Baseline

A benchmark needs comparison.

Example:

```text
old parser: 800 MB/s
new parser: 1.2 GB/s
```

Now the result has meaning.

Without a baseline, “1.2 GB/s” may be good or bad depending on the workload and hardware.

Good comparisons include:

- old implementation vs new implementation
- scalar version vs SIMD version
- heap allocation version vs buffer reuse version
- different data layouts
- different algorithms

## Keep Inputs Realistic

Microbenchmarks are useful, but they can lie.

Example:

```zig
const input = "hello";
```

A parser that is fast on `"hello"` may be slow on real files.

Use realistic inputs:

- small input
- medium input
- large input
- common case
- worst case
- malformed input when relevant

Benchmarking only the happy path gives incomplete information.

## Control the Environment

For serious benchmarks, control as much as possible.

Useful practices:

- close unnecessary programs
- use the same machine
- use the same compiler version
- use the same build mode
- use the same input files
- avoid measuring over network when testing CPU work
- pin CPU frequency if needed
- run enough iterations

You do not need extreme rigor for every small test, but you should know what can affect the result.

## Record Hardware and Software

A benchmark result should include context.

At minimum, record:

| Field | Example |
|---|---|
| CPU | Apple M3 Pro, Ryzen 7950X, etc. |
| RAM | 32 GB |
| OS | Linux, macOS, Windows |
| Zig version | 0.16.0 |
| Build mode | ReleaseFast |
| Input | 100 MB JSON file |
| Command | `zig build-exe main.zig -O ReleaseFast` |

Without this, another person cannot reproduce the result.

## Measure Throughput and Latency

Two common performance measurements are throughput and latency.

Throughput measures amount of work per time.

```text
MB/s
requests/s
items/s
frames/s
```

Latency measures time for one operation.

```text
milliseconds per request
nanoseconds per item
seconds per file
```

A batch processor often cares about throughput.

An interactive program often cares about latency.

A server usually cares about both.

## Avoid Misleading Units

Use units that match the task.

For file processing:

```text
MB/s
```

For function calls:

```text
ns/op
```

For servers:

```text
requests/s
p95 latency
p99 latency
```

For memory:

```text
bytes allocated per operation
allocations per operation
peak memory
```

Good units make results easier to understand.

## Benchmark Memory Too

Time is not the only performance metric.

A faster version may use much more memory.

Example:

| Version | Time | Peak Memory |
|---|---:|---:|
| A | 100 ms | 10 MB |
| B | 70 ms | 500 MB |

Version B is faster, but may be unacceptable.

Track memory when it matters.

Important memory metrics:

- peak memory
- allocation count
- bytes allocated
- cache misses
- working set size

## Avoid Benchmarking the Wrong Layer

Suppose you want to measure parsing speed.

Bad benchmark:

```text
read file from disk + parse + print output
```

This measures disk and printing too.

Better:

```text
load file once
then measure parser only
```

But if your real product reads files from disk, also run an end-to-end benchmark.

Use both:

- component benchmark
- end-to-end benchmark

They answer different questions.

## Beware I/O Benchmarks

I/O benchmarks are difficult.

File benchmarks are affected by:

- OS page cache
- disk type
- filesystem
- file size
- write buffering
- compression
- background disk activity

Network benchmarks are affected by:

- latency
- packet loss
- kernel tuning
- TLS
- connection reuse
- remote server behavior

When possible, isolate CPU work from I/O work. Then separately measure end-to-end behavior.

## Benchmark Algorithmic Complexity

A benchmark should test scaling.

Do not only test one input size.

Example:

| Input Size | Time |
|---:|---:|
| 1,000 | 1 ms |
| 10,000 | 10 ms |
| 100,000 | 100 ms |
| 1,000,000 | 1,000 ms |

This suggests linear behavior.

But this result is different:

| Input Size | Time |
|---:|---:|
| 1,000 | 1 ms |
| 10,000 | 100 ms |
| 100,000 | 10,000 ms |
| 1,000,000 | 1,000,000 ms |

That suggests quadratic behavior.

Scaling matters more than one isolated number.

## Do Not Trust Tiny Differences

A 1% improvement may be noise.

Example:

```text
old: 100.2 ms
new: 99.8 ms
```

That is probably not meaningful unless you have careful repeated measurements.

A 30% improvement is easier to trust.

But even then, verify.

Good benchmarking is skeptical.

## Use Profiling with Benchmarking

Benchmarking tells you whether something improved.

Profiling tells you why.

Use both.

Example:

Benchmark result:

```text
new version is 25% faster
```

Profiler result:

```text
allocation time dropped from 40% to 5%
```

Now you know the reason.

## A Simple Benchmark Harness

Here is a small pattern you can adapt:

```zig
const std = @import("std");

fn work(input: []const u8) usize {
    var count: usize = 0;

    for (input) |ch| {
        if (ch == 'x') {
            count += 1;
        }
    }

    return count;
}

pub fn main() !void {
    var data: [1024 * 1024]u8 = undefined;

    for (&data, 0..) |*byte, i| {
        byte.* = if (i % 17 == 0) 'x' else 'a';
    }

    const iterations = 1000;

    var total: usize = 0;

    const start = std.time.nanoTimestamp();

    for (0..iterations) |_| {
        total += work(data[0..]);
    }

    const end = std.time.nanoTimestamp();

    const elapsed_ns = end - start;
    const bytes_processed = data.len * iterations;

    std.debug.print("total: {}\n", .{total});
    std.debug.print("elapsed: {} ns\n", .{elapsed_ns});
    std.debug.print("bytes: {}\n", .{bytes_processed});
}
```

This benchmark:

- prepares input before timing
- repeats work many times
- uses the result
- records elapsed time
- exposes enough data to calculate throughput

## Mental Model

A benchmark is an experiment.

A good experiment has:

- a clear question
- controlled inputs
- realistic conditions
- repeated trials
- meaningful units
- a baseline
- recorded environment
- skepticism about tiny differences

In Zig, performance is visible and controllable, but you still need disciplined measurement.

Fast code starts with correct measurement.