UTF-8 Handling

Zig strings are bytes. UTF-8 is one way to interpret those bytes as text.

The string "hello" is simple. Each character uses one byte.

const s = "hello";

The byte length is five.

const std = @import("std");

pub fn main() void {
    const s = "hello";

    std.debug.print("{d}\n", .{s.len});
}

The output is:

A non-ASCII character may use more than one byte.

const std = @import("std");

pub fn main() void {
    const s = "é";

    std.debug.print("{d}\n", .{s.len});
}

The output is:

The value s.len is a byte count. It is not a character count.

This matters when indexing:

const std = @import("std");

pub fn main() void {
    const s = "é";

    std.debug.print("{d}\n", .{s[0]});
    std.debug.print("{d}\n", .{s[1]});
}

The output is:

195
169

The expression s[0] does not mean the first character. It means the first byte.

For ASCII text, byte indexing and character indexing often look the same. For UTF-8 text, they are different.

A UTF-8 code point is decoded from one or more bytes. Zig does not decode it automatically when you index a string. You must ask for that work.

The standard library provides UTF-8 helpers in std.unicode.

const std = @import("std");

pub fn main() void {
    const s = "hé";

    var view = std.unicode.Utf8View.init(s) catch {
        std.debug.print("invalid utf-8\n", .{});
        return;
    };

    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}

The output is:

U+68
U+E9

The first code point is h. The second is é.

The call:

std.unicode.Utf8View.init(s)

checks that the byte slice is valid UTF-8. It can fail, so the example uses catch.

After the view is created, the iterator walks through code points, not raw bytes.

This is different from:

for (s) |b| {
    ...
}

That loop walks through bytes.

UTF-8 validity is a property of the byte sequence. A []const u8 may contain valid UTF-8, or it may contain arbitrary bytes. The type does not say which. If a function requires text, validate it or document that the caller must pass valid UTF-8.

A small function can count code points:

const std = @import("std");

fn countCodepoints(s: []const u8) !usize {
    var view = try std.unicode.Utf8View.init(s);
    var it = view.iterator();

    var n: usize = 0;
    while (it.nextCodepoint()) |_| {
        n += 1;
    }

    return n;
}

pub fn main() !void {
    const s = "hé";

    const n = try countCodepoints(s);

    std.debug.print("{d}\n", .{n});
}

The output is:

The return type:

!usize

means the function returns either an error or a usize.

The try expression returns early if UTF-8 validation fails. Otherwise it unwraps the usize.

Do not confuse code points with what a person sees as one character. Some visible characters are made from several code points. For example, a letter plus a combining mark may display as one glyph. Emoji sequences can also contain several code points.

For many systems programs, code points are enough. For full human text handling, use a Unicode library that understands grapheme clusters, normalization, case mapping, and locale rules.

Zig keeps the base language simple. A string is bytes. UTF-8 handling is explicit.

Exercises.

Exercise 6-21. Print the byte length of "zig" and "日本".

Exercise 6-22. Print each byte of "é" in hexadecimal.

Exercise 6-23. Use std.unicode.Utf8View to print the code points of "hé".

Exercise 6-24. Write a function that returns the number of UTF-8 code points in a []const u8.