Zig strings are bytes. UTF-8 is one way to interpret those bytes as text.
The string "hello" is simple. Each character uses one byte.
const s = "hello";The byte length is five.
const std = @import("std");
pub fn main() void {
const s = "hello";
std.debug.print("{d}\n", .{s.len});
}The output is:
5A non-ASCII character may use more than one byte.
const std = @import("std");
pub fn main() void {
const s = "é";
std.debug.print("{d}\n", .{s.len});
}The output is:
2The value s.len is a byte count. It is not a character count.
This matters when indexing:
const std = @import("std");
pub fn main() void {
const s = "é";
std.debug.print("{d}\n", .{s[0]});
std.debug.print("{d}\n", .{s[1]});
}The output is:
195
169The expression s[0] does not mean the first character. It means the first byte.
For ASCII text, byte indexing and character indexing often look the same. For UTF-8 text, they are different.
A UTF-8 code point is decoded from one or more bytes. Zig does not decode it automatically when you index a string. You must ask for that work.
The standard library provides UTF-8 helpers in std.unicode.
const std = @import("std");
pub fn main() void {
const s = "hé";
var view = std.unicode.Utf8View.init(s) catch {
std.debug.print("invalid utf-8\n", .{});
return;
};
var it = view.iterator();
while (it.nextCodepoint()) |cp| {
std.debug.print("U+{X}\n", .{cp});
}
}The output is:
U+68
U+E9The first code point is h. The second is é.
The call:
std.unicode.Utf8View.init(s)checks that the byte slice is valid UTF-8. It can fail, so the example uses catch.
After the view is created, the iterator walks through code points, not raw bytes.
This is different from:
for (s) |b| {
...
}That loop walks through bytes.
UTF-8 validity is a property of the byte sequence. A []const u8 may contain valid UTF-8, or it may contain arbitrary bytes. The type does not say which. If a function requires text, validate it or document that the caller must pass valid UTF-8.
A small function can count code points:
const std = @import("std");
fn countCodepoints(s: []const u8) !usize {
var view = try std.unicode.Utf8View.init(s);
var it = view.iterator();
var n: usize = 0;
while (it.nextCodepoint()) |_| {
n += 1;
}
return n;
}
pub fn main() !void {
const s = "hé";
const n = try countCodepoints(s);
std.debug.print("{d}\n", .{n});
}The output is:
2The return type:
!usizemeans the function returns either an error or a usize.
The try expression returns early if UTF-8 validation fails. Otherwise it unwraps the usize.
Do not confuse code points with what a person sees as one character. Some visible characters are made from several code points. For example, a letter plus a combining mark may display as one glyph. Emoji sequences can also contain several code points.
For many systems programs, code points are enough. For full human text handling, use a Unicode library that understands grapheme clusters, normalization, case mapping, and locale rules.
Zig keeps the base language simple. A string is bytes. UTF-8 handling is explicit.
Exercises.
Exercise 6-21. Print the byte length of "zig" and "日本".
Exercise 6-22. Print each byte of "é" in hexadecimal.
Exercise 6-23. Use std.unicode.Utf8View to print the code points of "hé".
Exercise 6-24. Write a function that returns the number of UTF-8 code points in a []const u8.