Case Study 1: Reusing a Temporary Buffer

Performance Case Studies

Performance ideas become clearer when you see them inside real code.

This section walks through several small case studies. Each one starts with simple code, finds the likely cost, and improves the design.

Case Study 1: Reusing a Temporary Buffer

A common slow pattern is allocating temporary memory inside a loop.

fn processAll(allocator: std.mem.Allocator, items: []const Item) !void {
    for (items) |item| {
        const scratch = try allocator.alloc(u8, 4096);
        defer allocator.free(scratch);

        try processOne(item, scratch);
    }
}

This allocates once per item.

A better version allocates once:

fn processAll(allocator: std.mem.Allocator, items: []const Item) !void {
    const scratch = try allocator.alloc(u8, 4096);
    defer allocator.free(scratch);

    for (items) |item| {
        try processOne(item, scratch);
    }
}

The improvement is simple: move allocation out of the hot loop.

The important question is lifetime. If processOne only needs temporary memory during each call, one reused buffer is enough.

Case Study 2: Avoiding Copies in a Parser

A parser often reads a large input buffer and extracts tokens.

Bad design:

const Token = struct {
    text: []u8,
};

fn makeToken(allocator: std.mem.Allocator, input: []const u8) !Token {
    return .{
        .text = try allocator.dupe(u8, input),
    };
}

This copies every token.

A better design stores a slice into the original input:

const Token = struct {
    text: []const u8,
};

fn makeToken(input: []const u8, start: usize, end: usize) Token {
    return .{
        .text = input[start..end],
    };
}

No allocation. No copy.

The cost is a lifetime rule: the original input must remain alive while the tokens are used.

That is usually acceptable for parsers. Read the file once, keep the buffer alive, and let tokens point into it.

Case Study 3: Reserving ArrayList Capacity

Dynamic arrays grow when they need more room.

fn collect(allocator: std.mem.Allocator, values: []const u32) !std.ArrayList(u32) {
    var list = std.ArrayList(u32).init(allocator);

    for (values) |v| {
        try list.append(v);
    }

    return list;
}

This works, but append may reallocate several times.

If you know the final size, reserve capacity:

fn collect(allocator: std.mem.Allocator, values: []const u32) !std.ArrayList(u32) {
    var list = std.ArrayList(u32).init(allocator);
    errdefer list.deinit();

    try list.ensureTotalCapacity(values.len);

    for (values) |v| {
        try list.append(v);
    }

    return list;
}

Now the list has enough memory before the loop begins.

This reduces allocation count and avoids repeated copying during growth.

Case Study 4: Hot and Cold Data

Suppose you have a user record:

const User = struct {
    id: u64,
    age: u8,
    name: [128]u8,
    email: [256]u8,
};

Now you often run this:

fn countAdults(users: []const User) usize {
    var count: usize = 0;

    for (users) |user| {
        if (user.age >= 18) {
            count += 1;
        }
    }

    return count;
}

The loop only needs age, but each User is large. The CPU may pull unrelated name and email data into cache.

A better layout separates frequently used fields:

const UserHot = struct {
    id: u64,
    age: u8,
};

const UserCold = struct {
    name: [128]u8,
    email: [256]u8,
};

Then keep parallel arrays or indexed records.

const Users = struct {
    hot: []UserHot,
    cold: []UserCold,
};

Now the adult-counting loop touches only compact hot data.

This can improve cache locality in large datasets.

Case Study 5: Removing a Branch from a Hot Loop

Suppose you update active entities:

fn updateAll(entities: []Entity) void {
    for (entities) |*entity| {
        if (entity.active) {
            update(entity);
        }
    }
}

If active and inactive entities are mixed randomly, the branch may be unpredictable.

A better design stores active entities separately:

fn updateActive(active_entities: []Entity) void {
    for (active_entities) |*entity| {
        update(entity);
    }
}

Now the branch disappears.

This also improves locality because the loop only touches entities that need work.

The strongest branch optimization is often data organization.

Case Study 6: Batch Output

Printing inside a loop is often slow.

for (items) |item| {
    try stdout.print("item: {}\n", .{item.id});
}

Each print may eventually cause expensive output work.

A better version buffers output:

var buffer = std.ArrayList(u8).init(allocator);
defer buffer.deinit();

try buffer.ensureTotalCapacity(items.len * 16);

for (items) |item| {
    try buffer.writer().print("item: {}\n", .{item.id});
}

try stdout.writeAll(buffer.items);

The program builds output in memory and writes it in one larger operation.

This reduces I/O overhead.

Case Study 7: Stack Buffer Instead of Heap Allocation

Small fixed-size temporary memory often belongs on the stack.

Heap version:

fn formatId(allocator: std.mem.Allocator, id: u64) ![]u8 {
    return try std.fmt.allocPrint(allocator, "id={}", .{id});
}

Stack-buffer version:

fn formatId(buffer: []u8, id: u64) ![]u8 {
    return try std.fmt.bufPrint(buffer, "id={}", .{id});
}

Call it like this:

var buffer: [64]u8 = undefined;
const text = try formatId(buffer[0..], 1234);

No heap allocation is needed.

The caller owns the storage, and the function clearly says how memory is handled.

Case Study 8: Arena for One Request

Suppose a server handles requests. Each request needs many temporary allocations.

Instead of freeing every object separately, use an arena per request.

fn handleRequest(parent_allocator: std.mem.Allocator, request: Request) !void {
    var arena = std.heap.ArenaAllocator.init(parent_allocator);
    defer arena.deinit();

    const allocator = arena.allocator();

    const parsed = try parseRequest(allocator, request.body);
    const result = try buildResponse(allocator, parsed);

    try sendResponse(result);
}

All temporary memory is freed together when the request ends.

This design is useful when many objects share the same lifetime.

The tradeoff is that memory is not released individually before the request finishes.

Case Study 9: Compile-Time Configuration

Suppose logging is controlled by a runtime boolean:

fn process(debug: bool, value: i32) void {
    if (debug) {
        std.debug.print("value={}\n", .{value});
    }

    use(value);
}

If debug never changes during a build, make it compile-time:

fn process(comptime debug: bool, value: i32) void {
    if (debug) {
        std.debug.print("value={}\n", .{value});
    }

    use(value);
}

When called with false, the logging branch can disappear from generated code.

process(false, 42);

Compile-time knowledge can remove runtime cost.

Case Study 10: Choosing the Right Representation

Suppose you represent a graph with heap-allocated nodes and pointers:

const Node = struct {
    value: u32,
    edges: []*Node,
};

This is flexible, but it may scatter memory across the heap.

A more compact representation uses arrays and indices:

const Edge = struct {
    to: usize,
};

const Node = struct {
    value: u32,
    first_edge: usize,
    edge_count: usize,
};

Edges live in one dense array.

const Graph = struct {
    nodes: []Node,
    edges: []Edge,
};

This representation can be faster for traversal because memory is contiguous.

It is also easier to serialize.

The tradeoff is that mutation may require more careful management.

How to Think Through a Case

Most performance improvements follow the same pattern.

First, identify the hot path.

Then ask what cost appears there:

Cost	Typical Fix
Repeated allocation	Reuse buffers, reserve capacity, use arenas
Large copies	Use slices, pointers, or caller-provided output
Cache misses	Use contiguous data and compact hot fields
Branch misprediction	Group data or remove branches from hot loops
I/O overhead	Batch reads and writes
Runtime decisions	Move stable choices to `comptime`

Do not apply every fix everywhere.

Apply the fix that matches the measured bottleneck.

Final Rule

Performance is not one trick.

It is a habit of seeing cost.

In Zig, the most important costs are usually visible:

Where does memory come from?

How many times is allocation called?

Is data copied or borrowed?

Is data contiguous?

Does this branch run millions of times?

Can this decision happen at compile time?

Once you can answer those questions, Zig gives you the tools to improve the program directly.