# Efficient Text Processing

### Efficient Text Processing

Efficient text processing means working with text without doing unnecessary allocation, copying, or decoding.

In Zig, text is usually a byte slice:

```zig
[]const u8
```

That means most text processing starts with a simple idea:

```text
read bytes first
allocate only when needed
decode UTF-8 only when needed
```

This fits Zig well because Zig makes memory and ownership visible.

#### Prefer Slices Over Copies

A slice lets you refer to part of a string without copying it.

```zig
const text = "hello zig";

const first = text[0..5];
const second = text[6..9];
```

Now:

```text
first  = hello
second = zig
```

No new string was created. `first` and `second` both point into `text`.

This is cheap because a slice is only:

```text
pointer + length
```

Use this style when parsing, splitting, scanning, or inspecting text.

#### Example: Get the First Word

```zig
const std = @import("std");

fn firstWord(text: []const u8) []const u8 {
    for (text, 0..) |byte, index| {
        if (byte == ' ') {
            return text[0..index];
        }
    }

    return text;
}

pub fn main() void {
    const text = "hello zig";
    const word = firstWord(text);

    std.debug.print("{s}\n", .{word});
}
```

Output:

```text
hello
```

The function returns a slice into the original text. It does not allocate.

That is efficient, but it also means the returned slice is valid only while the original text is valid.

#### Avoid Building Strings Too Early

Many programs waste work by creating new strings before they need to.

For example, suppose you want to check whether a file path ends with `.zig`.

You do not need to copy the path. You can inspect the existing bytes.

```zig
const std = @import("std");

fn isZigFile(path: []const u8) bool {
    return std.mem.endsWith(u8, path, ".zig");
}

pub fn main() void {
    std.debug.print("{}\n", .{isZigFile("main.zig")});
    std.debug.print("{}\n", .{isZigFile("main.c")});
}
```

Output:

```text
true
false
```

This function does not allocate, does not copy, and does not decode Unicode. It only compares bytes.

#### Use `std.mem` for Byte Slice Work

The `std.mem` namespace contains many useful functions for byte slices.

Common examples:

| Function | Purpose |
|---|---|
| `std.mem.eql` | Compare two slices |
| `std.mem.startsWith` | Check prefix |
| `std.mem.endsWith` | Check suffix |
| `std.mem.indexOf` | Find a slice inside another slice |
| `std.mem.splitScalar` | Split by one byte |
| `std.mem.tokenizeScalar` | Split while skipping empty parts |
| `std.mem.trim` | Remove bytes from both ends |

Example:

```zig
const std = @import("std");

pub fn main() void {
    const text = "name=zig";

    if (std.mem.indexOf(u8, text, "=")) |index| {
        const key = text[0..index];
        const value = text[index + 1 ..];

        std.debug.print("key = {s}\n", .{key});
        std.debug.print("value = {s}\n", .{value});
    }
}
```

Output:

```text
key = name
value = zig
```

Again, `key` and `value` are slices into `text`. No allocation happens.

#### Split Without Allocation

Splitting text does not need to create new strings.

```zig
const std = @import("std");

pub fn main() void {
    const path = "usr/local/bin";

    var it = std.mem.splitScalar(u8, path, '/');

    while (it.next()) |part| {
        std.debug.print("{s}\n", .{part});
    }
}
```

Output:

```text
usr
local
bin
```

Each `part` is a slice into `path`.

This is efficient because the iterator only tracks positions.

#### `splitScalar` vs `tokenizeScalar`

`splitScalar` keeps empty fields.

```zig
const std = @import("std");

pub fn main() void {
    const text = "a,,b";

    var it = std.mem.splitScalar(u8, text, ',');

    while (it.next()) |part| {
        std.debug.print("[{s}]\n", .{part});
    }
}
```

Output:

```text
[a]
[]
[b]
```

The empty part between the two commas is preserved.

`tokenizeScalar` skips empty fields.

```zig
const std = @import("std");

pub fn main() void {
    const text = "a,,b";

    var it = std.mem.tokenizeScalar(u8, text, ',');

    while (it.next()) |part| {
        std.debug.print("[{s}]\n", .{part});
    }
}
```

Output:

```text
[a]
[b]
```

Use `splitScalar` when empty fields matter, such as CSV-like data. Use `tokenizeScalar` when repeated separators should be ignored, such as simple whitespace tokenization.

#### Trim Without Allocation

Trimming can also return a slice.

```zig
const std = @import("std");

pub fn main() void {
    const line = "   hello zig   ";

    const trimmed = std.mem.trim(u8, line, " ");

    std.debug.print("[{s}]\n", .{trimmed});
}
```

Output:

```text
[hello zig]
```

`trimmed` points into `line`. It does not allocate a new string.

You can trim several bytes:

```zig
const trimmed = std.mem.trim(u8, line, " \t\r\n");
```

This removes spaces, tabs, carriage returns, and newlines from both ends.

#### Scan Once When Possible

A common performance rule is: avoid reading the same text many times.

For example, this counts lines:

```zig
fn countLines(text: []const u8) usize {
    var count: usize = 0;

    for (text) |byte| {
        if (byte == '\n') {
            count += 1;
        }
    }

    return count;
}
```

This is efficient because it scans once from left to right.

If the input may not end with a newline, you may want to count the final line too:

```zig
fn countLines(text: []const u8) usize {
    if (text.len == 0) return 0;

    var count: usize = 1;

    for (text) |byte| {
        if (byte == '\n') {
            count += 1;
        }
    }

    return count;
}
```

This version treats non-empty text as having at least one line.

#### Parse Without Copying

Suppose you parse key-value lines:

```text
name=zig
version=0.16
mode=debug
```

You can parse each line using slices.

```zig
const std = @import("std");

fn printKeyValue(line: []const u8) void {
    if (std.mem.indexOf(u8, line, "=")) |index| {
        const key = line[0..index];
        const value = line[index + 1 ..];

        std.debug.print("key={s}, value={s}\n", .{ key, value });
    }
}

pub fn main() void {
    const text =
        \\name=zig
        \\version=0.16
        \\mode=debug
    ;

    var lines = std.mem.splitScalar(u8, text, '\n');

    while (lines.next()) |line| {
        if (line.len == 0) continue;
        printKeyValue(line);
    }
}
```

Output:

```text
key=name, value=zig
key=version, value=0.16
key=mode, value=debug
```

The parser does not allocate. Each key and value is a slice into the original text.

#### Allocate Only for Owned Results

Sometimes you need a result that outlives the input. Then you should allocate or copy.

For example, this function returns a slice into the input:

```zig
fn extension(path: []const u8) ?[]const u8 {
    if (std.mem.lastIndexOfScalar(u8, path, '.')) |index| {
        return path[index + 1 ..];
    }

    return null;
}
```

That is fine when the caller keeps `path` alive.

But if you need to store the extension after the original path is gone, make a copy:

```zig
fn copyExtension(
    allocator: std.mem.Allocator,
    path: []const u8,
) !?[]u8 {
    const ext = extension(path) orelse return null;

    const copy = try allocator.alloc(u8, ext.len);
    @memcpy(copy, ext);

    return copy;
}
```

The caller owns the returned copy and must free it.

#### Reuse Buffers

If you build temporary text repeatedly, reuse a buffer or `ArrayList`.

```zig
const std = @import("std");

pub fn main() !void {
    var buffer: [128]u8 = undefined;

    for (0..3) |i| {
        const message = try std.fmt.bufPrint(buffer[0..], "item {}", .{i});
        std.debug.print("{s}\n", .{message});
    }
}
```

Output:

```text
item 0
item 1
item 2
```

The same stack buffer is reused for each message.

This avoids heap allocation.

For variable-size output, reuse an `ArrayList`:

```zig
const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    var text = std.ArrayList(u8).init(allocator);
    defer text.deinit();

    for (0..3) |i| {
        text.clearRetainingCapacity();

        try text.writer().print("item {}", .{i});
        std.debug.print("{s}\n", .{text.items});
    }
}
```

The list keeps its allocated capacity and reuses it.

#### Avoid Holding Slices Across Reallocation

This is a common bug:

```zig
const old = text.items;

try text.appendSlice("more data");

// old may now be invalid
```

An `ArrayList` may reallocate when it grows. If it reallocates, old slices into its storage may become invalid.

Use `text.items` again after operations that may grow the list.

```zig
try text.appendSlice("more data");

const current = text.items;
```

This rule matters for efficient code because efficient code often keeps references. Keep them only as long as the underlying memory is stable.

#### Know When UTF-8 Decoding Is Needed

Many text tasks are byte tasks:

```text
check file extension
split path by slash
parse ASCII protocol headers
find newline
compare command names
trim spaces
```

For these, byte operations are correct and fast.

Some tasks need Unicode-aware processing:

```text
count user-visible characters
move cursor by character
uppercase multilingual text
validate user text
slice without breaking code points
display aligned columns with non-ASCII text
```

For these, use UTF-8 validation and decoding.

Do not decode Unicode when byte processing is enough. Do not use byte processing when Unicode meaning matters.

#### Example: Validate Before Unicode Processing

```zig
const std = @import("std");

fn printCodepoints(text: []const u8) !void {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}

pub fn main() !void {
    const text = "Aé你";
    try printCodepoints(text);
}
```

This checks that the text is valid UTF-8 before iterating over code points.

#### Use Writers for Streaming Output

If output may become large, you do not always need to build one big string first.

You can write directly to a writer.

For example, this function writes CSV-style output:

```zig
const std = @import("std");

fn writeCsvRow(writer: anytype, name: []const u8, score: u32) !void {
    try writer.print("\"{s}\",{}\n", .{ name, score });
}
```

You can write to an `ArrayList`:

```zig
try writeCsvRow(text.writer(), "Ada", 95);
```

Or to another writer, such as a file writer.

The function does not care where the output goes. This avoids unnecessary intermediate strings.

#### Complete Example

```zig
const std = @import("std");

fn parseLine(line: []const u8) ?struct {
    key: []const u8,
    value: []const u8,
} {
    const trimmed = std.mem.trim(u8, line, " \t\r\n");

    if (trimmed.len == 0) return null;

    const index = std.mem.indexOfScalar(u8, trimmed, '=') orelse return null;

    return .{
        .key = std.mem.trim(u8, trimmed[0..index], " \t"),
        .value = std.mem.trim(u8, trimmed[index + 1 ..], " \t"),
    };
}

pub fn main() void {
    const text =
        \\ name = zig
        \\ version = 0.16
        \\ mode = debug
    ;

    var lines = std.mem.splitScalar(u8, text, '\n');

    while (lines.next()) |line| {
        if (parseLine(line)) |entry| {
            std.debug.print("{s} -> {s}\n", .{ entry.key, entry.value });
        }
    }
}
```

Output:

```text
name -> zig
version -> 0.16
mode -> debug
```

This example does not allocate. It uses slices into the original input.

#### Summary

Efficient text processing in Zig is mostly about restraint.

Use slices instead of copies. Use `std.mem` for byte-level text work. Allocate only when the result must be owned or must outlive the input. Reuse buffers when building temporary text. Use writers when output can be streamed.

For ASCII-like protocols and file formats, byte processing is often enough. For human language text, validate and decode UTF-8 when Unicode meaning matters.

Zig gives you the tools, but it does not hide the cost. That is the point: you can see when text is borrowed, copied, allocated, decoded, or written.