Skip to content

Efficient Text Processing

Efficient text processing means working with text without doing unnecessary allocation, copying, or decoding.

Efficient text processing means working with text without doing unnecessary allocation, copying, or decoding.

In Zig, text is usually a byte slice:

[]const u8

That means most text processing starts with a simple idea:

read bytes first
allocate only when needed
decode UTF-8 only when needed

This fits Zig well because Zig makes memory and ownership visible.

Prefer Slices Over Copies

A slice lets you refer to part of a string without copying it.

const text = "hello zig";

const first = text[0..5];
const second = text[6..9];

Now:

first  = hello
second = zig

No new string was created. first and second both point into text.

This is cheap because a slice is only:

pointer + length

Use this style when parsing, splitting, scanning, or inspecting text.

Example: Get the First Word

const std = @import("std");

fn firstWord(text: []const u8) []const u8 {
    for (text, 0..) |byte, index| {
        if (byte == ' ') {
            return text[0..index];
        }
    }

    return text;
}

pub fn main() void {
    const text = "hello zig";
    const word = firstWord(text);

    std.debug.print("{s}\n", .{word});
}

Output:

hello

The function returns a slice into the original text. It does not allocate.

That is efficient, but it also means the returned slice is valid only while the original text is valid.

Avoid Building Strings Too Early

Many programs waste work by creating new strings before they need to.

For example, suppose you want to check whether a file path ends with .zig.

You do not need to copy the path. You can inspect the existing bytes.

const std = @import("std");

fn isZigFile(path: []const u8) bool {
    return std.mem.endsWith(u8, path, ".zig");
}

pub fn main() void {
    std.debug.print("{}\n", .{isZigFile("main.zig")});
    std.debug.print("{}\n", .{isZigFile("main.c")});
}

Output:

true
false

This function does not allocate, does not copy, and does not decode Unicode. It only compares bytes.

Use std.mem for Byte Slice Work

The std.mem namespace contains many useful functions for byte slices.

Common examples:

FunctionPurpose
std.mem.eqlCompare two slices
std.mem.startsWithCheck prefix
std.mem.endsWithCheck suffix
std.mem.indexOfFind a slice inside another slice
std.mem.splitScalarSplit by one byte
std.mem.tokenizeScalarSplit while skipping empty parts
std.mem.trimRemove bytes from both ends

Example:

const std = @import("std");

pub fn main() void {
    const text = "name=zig";

    if (std.mem.indexOf(u8, text, "=")) |index| {
        const key = text[0..index];
        const value = text[index + 1 ..];

        std.debug.print("key = {s}\n", .{key});
        std.debug.print("value = {s}\n", .{value});
    }
}

Output:

key = name
value = zig

Again, key and value are slices into text. No allocation happens.

Split Without Allocation

Splitting text does not need to create new strings.

const std = @import("std");

pub fn main() void {
    const path = "usr/local/bin";

    var it = std.mem.splitScalar(u8, path, '/');

    while (it.next()) |part| {
        std.debug.print("{s}\n", .{part});
    }
}

Output:

usr
local
bin

Each part is a slice into path.

This is efficient because the iterator only tracks positions.

splitScalar vs tokenizeScalar

splitScalar keeps empty fields.

const std = @import("std");

pub fn main() void {
    const text = "a,,b";

    var it = std.mem.splitScalar(u8, text, ',');

    while (it.next()) |part| {
        std.debug.print("[{s}]\n", .{part});
    }
}

Output:

[a]
[]
[b]

The empty part between the two commas is preserved.

tokenizeScalar skips empty fields.

const std = @import("std");

pub fn main() void {
    const text = "a,,b";

    var it = std.mem.tokenizeScalar(u8, text, ',');

    while (it.next()) |part| {
        std.debug.print("[{s}]\n", .{part});
    }
}

Output:

[a]
[b]

Use splitScalar when empty fields matter, such as CSV-like data. Use tokenizeScalar when repeated separators should be ignored, such as simple whitespace tokenization.

Trim Without Allocation

Trimming can also return a slice.

const std = @import("std");

pub fn main() void {
    const line = "   hello zig   ";

    const trimmed = std.mem.trim(u8, line, " ");

    std.debug.print("[{s}]\n", .{trimmed});
}

Output:

[hello zig]

trimmed points into line. It does not allocate a new string.

You can trim several bytes:

const trimmed = std.mem.trim(u8, line, " \t\r\n");

This removes spaces, tabs, carriage returns, and newlines from both ends.

Scan Once When Possible

A common performance rule is: avoid reading the same text many times.

For example, this counts lines:

fn countLines(text: []const u8) usize {
    var count: usize = 0;

    for (text) |byte| {
        if (byte == '\n') {
            count += 1;
        }
    }

    return count;
}

This is efficient because it scans once from left to right.

If the input may not end with a newline, you may want to count the final line too:

fn countLines(text: []const u8) usize {
    if (text.len == 0) return 0;

    var count: usize = 1;

    for (text) |byte| {
        if (byte == '\n') {
            count += 1;
        }
    }

    return count;
}

This version treats non-empty text as having at least one line.

Parse Without Copying

Suppose you parse key-value lines:

name=zig
version=0.16
mode=debug

You can parse each line using slices.

const std = @import("std");

fn printKeyValue(line: []const u8) void {
    if (std.mem.indexOf(u8, line, "=")) |index| {
        const key = line[0..index];
        const value = line[index + 1 ..];

        std.debug.print("key={s}, value={s}\n", .{ key, value });
    }
}

pub fn main() void {
    const text =
        \\name=zig
        \\version=0.16
        \\mode=debug
    ;

    var lines = std.mem.splitScalar(u8, text, '\n');

    while (lines.next()) |line| {
        if (line.len == 0) continue;
        printKeyValue(line);
    }
}

Output:

key=name, value=zig
key=version, value=0.16
key=mode, value=debug

The parser does not allocate. Each key and value is a slice into the original text.

Allocate Only for Owned Results

Sometimes you need a result that outlives the input. Then you should allocate or copy.

For example, this function returns a slice into the input:

fn extension(path: []const u8) ?[]const u8 {
    if (std.mem.lastIndexOfScalar(u8, path, '.')) |index| {
        return path[index + 1 ..];
    }

    return null;
}

That is fine when the caller keeps path alive.

But if you need to store the extension after the original path is gone, make a copy:

fn copyExtension(
    allocator: std.mem.Allocator,
    path: []const u8,
) !?[]u8 {
    const ext = extension(path) orelse return null;

    const copy = try allocator.alloc(u8, ext.len);
    @memcpy(copy, ext);

    return copy;
}

The caller owns the returned copy and must free it.

Reuse Buffers

If you build temporary text repeatedly, reuse a buffer or ArrayList.

const std = @import("std");

pub fn main() !void {
    var buffer: [128]u8 = undefined;

    for (0..3) |i| {
        const message = try std.fmt.bufPrint(buffer[0..], "item {}", .{i});
        std.debug.print("{s}\n", .{message});
    }
}

Output:

item 0
item 1
item 2

The same stack buffer is reused for each message.

This avoids heap allocation.

For variable-size output, reuse an ArrayList:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    var text = std.ArrayList(u8).init(allocator);
    defer text.deinit();

    for (0..3) |i| {
        text.clearRetainingCapacity();

        try text.writer().print("item {}", .{i});
        std.debug.print("{s}\n", .{text.items});
    }
}

The list keeps its allocated capacity and reuses it.

Avoid Holding Slices Across Reallocation

This is a common bug:

const old = text.items;

try text.appendSlice("more data");

// old may now be invalid

An ArrayList may reallocate when it grows. If it reallocates, old slices into its storage may become invalid.

Use text.items again after operations that may grow the list.

try text.appendSlice("more data");

const current = text.items;

This rule matters for efficient code because efficient code often keeps references. Keep them only as long as the underlying memory is stable.

Know When UTF-8 Decoding Is Needed

Many text tasks are byte tasks:

check file extension
split path by slash
parse ASCII protocol headers
find newline
compare command names
trim spaces

For these, byte operations are correct and fast.

Some tasks need Unicode-aware processing:

count user-visible characters
move cursor by character
uppercase multilingual text
validate user text
slice without breaking code points
display aligned columns with non-ASCII text

For these, use UTF-8 validation and decoding.

Do not decode Unicode when byte processing is enough. Do not use byte processing when Unicode meaning matters.

Example: Validate Before Unicode Processing

const std = @import("std");

fn printCodepoints(text: []const u8) !void {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}

pub fn main() !void {
    const text = "Aé你";
    try printCodepoints(text);
}

This checks that the text is valid UTF-8 before iterating over code points.

Use Writers for Streaming Output

If output may become large, you do not always need to build one big string first.

You can write directly to a writer.

For example, this function writes CSV-style output:

const std = @import("std");

fn writeCsvRow(writer: anytype, name: []const u8, score: u32) !void {
    try writer.print("\"{s}\",{}\n", .{ name, score });
}

You can write to an ArrayList:

try writeCsvRow(text.writer(), "Ada", 95);

Or to another writer, such as a file writer.

The function does not care where the output goes. This avoids unnecessary intermediate strings.

Complete Example

const std = @import("std");

fn parseLine(line: []const u8) ?struct {
    key: []const u8,
    value: []const u8,
} {
    const trimmed = std.mem.trim(u8, line, " \t\r\n");

    if (trimmed.len == 0) return null;

    const index = std.mem.indexOfScalar(u8, trimmed, '=') orelse return null;

    return .{
        .key = std.mem.trim(u8, trimmed[0..index], " \t"),
        .value = std.mem.trim(u8, trimmed[index + 1 ..], " \t"),
    };
}

pub fn main() void {
    const text =
        \\ name = zig
        \\ version = 0.16
        \\ mode = debug
    ;

    var lines = std.mem.splitScalar(u8, text, '\n');

    while (lines.next()) |line| {
        if (parseLine(line)) |entry| {
            std.debug.print("{s} -> {s}\n", .{ entry.key, entry.value });
        }
    }
}

Output:

name -> zig
version -> 0.16
mode -> debug

This example does not allocate. It uses slices into the original input.

Summary

Efficient text processing in Zig is mostly about restraint.

Use slices instead of copies. Use std.mem for byte-level text work. Allocate only when the result must be owned or must outlive the input. Reuse buffers when building temporary text. Use writers when output can be streamed.

For ASCII-like protocols and file formats, byte processing is often enough. For human language text, validate and decode UTF-8 when Unicode meaning matters.

Zig gives you the tools, but it does not hide the cost. That is the point: you can see when text is borrowed, copied, allocated, decoded, or written.