# UTF-8 Handling

### UTF-8 Handling

Zig strings are bytes. UTF-8 is one way to interpret those bytes as text.

The string `"hello"` is simple. Each character uses one byte.

```zig
const s = "hello";
```

The byte length is five.

```zig
const std = @import("std");

pub fn main() void {
    const s = "hello";

    std.debug.print("{d}\n", .{s.len});
}
```

The output is:

```text
5
```

A non-ASCII character may use more than one byte.

```zig
const std = @import("std");

pub fn main() void {
    const s = "é";

    std.debug.print("{d}\n", .{s.len});
}
```

The output is:

```text
2
```

The value `s.len` is a byte count. It is not a character count.

This matters when indexing:

```zig
const std = @import("std");

pub fn main() void {
    const s = "é";

    std.debug.print("{d}\n", .{s[0]});
    std.debug.print("{d}\n", .{s[1]});
}
```

The output is:

```text
195
169
```

The expression `s[0]` does not mean the first character. It means the first byte.

For ASCII text, byte indexing and character indexing often look the same. For UTF-8 text, they are different.

A UTF-8 code point is decoded from one or more bytes. Zig does not decode it automatically when you index a string. You must ask for that work.

The standard library provides UTF-8 helpers in `std.unicode`.

```zig
const std = @import("std");

pub fn main() void {
    const s = "hé";

    var view = std.unicode.Utf8View.init(s) catch {
        std.debug.print("invalid utf-8\n", .{});
        return;
    };

    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}
```

The output is:

```text
U+68
U+E9
```

The first code point is `h`. The second is `é`.

The call:

```zig
std.unicode.Utf8View.init(s)
```

checks that the byte slice is valid UTF-8. It can fail, so the example uses `catch`.

After the view is created, the iterator walks through code points, not raw bytes.

This is different from:

```zig
for (s) |b| {
    ...
}
```

That loop walks through bytes.

UTF-8 validity is a property of the byte sequence. A `[]const u8` may contain valid UTF-8, or it may contain arbitrary bytes. The type does not say which. If a function requires text, validate it or document that the caller must pass valid UTF-8.

A small function can count code points:

```zig
const std = @import("std");

fn countCodepoints(s: []const u8) !usize {
    var view = try std.unicode.Utf8View.init(s);
    var it = view.iterator();

    var n: usize = 0;
    while (it.nextCodepoint()) |_| {
        n += 1;
    }

    return n;
}

pub fn main() !void {
    const s = "hé";

    const n = try countCodepoints(s);

    std.debug.print("{d}\n", .{n});
}
```

The output is:

```text
2
```

The return type:

```zig
!usize
```

means the function returns either an error or a `usize`.

The `try` expression returns early if UTF-8 validation fails. Otherwise it unwraps the `usize`.

Do not confuse code points with what a person sees as one character. Some visible characters are made from several code points. For example, a letter plus a combining mark may display as one glyph. Emoji sequences can also contain several code points.

For many systems programs, code points are enough. For full human text handling, use a Unicode library that understands grapheme clusters, normalization, case mapping, and locale rules.

Zig keeps the base language simple. A string is bytes. UTF-8 handling is explicit.

Exercises.

Exercise 6-21. Print the byte length of `"zig"` and `"日本"`.

Exercise 6-22. Print each byte of `"é"` in hexadecimal.

Exercise 6-23. Use `std.unicode.Utf8View` to print the code points of `"hé"`.

Exercise 6-24. Write a function that returns the number of UTF-8 code points in a `[]const u8`.