# UTF-8 Processing

### UTF-8 Processing

Zig string data is usually stored as UTF-8 bytes.

That means a string is not a list of “characters” in the simple beginner sense. It is a sequence of bytes, and those bytes encode text.

```zig
const text = "Zig";
```

This string has 3 bytes:

```text
Z i g
```

But this string:

```zig
const text = "é";
```

has 2 bytes, even though it looks like 1 character.

That is the first rule of UTF-8 in Zig:

```text
text.len counts bytes, not human-visible characters
```

#### ASCII Is the Simple Case

ASCII text uses one byte per character.

```zig
const text = "hello";
```

The length is 5 bytes:

```zig
std.debug.print("{}\n", .{text.len});
```

Output:

```text
5
```

For plain English letters, digits, spaces, and many punctuation marks, byte length and character count match.

```text
h e l l o
1 1 1 1 1 bytes
```

That is why simple examples often look easy.

#### Non-ASCII Text Uses More Bytes

UTF-8 uses more than one byte for many characters.

```zig
const text = "é";
```

The visible text has 1 character, but the byte length is 2.

```zig
const std = @import("std");

pub fn main() void {
    const text = "é";
    std.debug.print("{}\n", .{text.len});
}
```

Output:

```text
2
```

Another example:

```zig
const text = "你好";
```

Each Chinese character uses 3 bytes in UTF-8.

```zig
std.debug.print("{}\n", .{text.len});
```

Output:

```text
6
```

So this string has 2 visible characters but 6 bytes.

#### Iterating Over Bytes

A normal `for` loop over a string gives you bytes.

```zig
const std = @import("std");

pub fn main() void {
    const text = "é";

    for (text) |byte| {
        std.debug.print("{}\n", .{byte});
    }
}
```

Output:

```text
195
169
```

Those two bytes together encode `é`.

The loop does not give you a character. It gives you one `u8` at a time.

This is correct and useful when you are processing bytes, file formats, protocols, ASCII text, or raw input.

But it is not enough for full Unicode text processing.

#### Printing Bytes as Hex

When learning UTF-8, it helps to print bytes in hexadecimal.

```zig
const std = @import("std");

pub fn main() void {
    const text = "é";

    for (text) |byte| {
        std.debug.print("{x} ", .{byte});
    }

    std.debug.print("\n", .{});
}
```

Output:

```text
c3 a9
```

The UTF-8 encoding of `é` is:

```text
c3 a9
```

For ASCII:

```zig
const text = "A";
```

the byte is:

```text
41
```

The visible character `A` is one byte in UTF-8.

#### Code Points

Unicode assigns numbers to characters. These numbers are called code points.

For example:

```text
A   U+0041
é   U+00E9
你  U+4F60
```

UTF-8 is one way to encode those code points as bytes.

A code point may use:

```text
1 byte
2 bytes
3 bytes
4 bytes
```

in UTF-8.

So there are three related ideas:

| Idea | Meaning |
|---|---|
| Byte | A raw `u8` value |
| Code point | A Unicode number such as `U+00E9` |
| UTF-8 | Encoding that stores code points as bytes |

Zig strings are byte slices. If you need Unicode meaning, you decode the bytes.

#### Decoding UTF-8

Zig’s standard library provides Unicode helpers.

A common operation is decoding one code point from UTF-8.

```zig
const std = @import("std");

pub fn main() !void {
    const text = "é";

    const cp = try std.unicode.utf8Decode(text);

    std.debug.print("U+{X}\n", .{cp});
}
```

Output:

```text
U+E9
```

This decodes the byte slice `"é"` into one Unicode code point.

The function can fail because not every byte sequence is valid UTF-8. That is why the code uses `try`.

#### Iterating Over UTF-8 Code Points

For real text, you often want to walk code point by code point.

One simple approach is to use a UTF-8 view.

```zig
const std = @import("std");

pub fn main() !void {
    const text = "Aé你";

    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}
```

Output:

```text
U+41
U+E9
U+4F60
```

The string has 3 code points:

```text
A
é
你
```

But its byte length is larger than 3.

```zig
std.debug.print("{}\n", .{text.len});
```

The byte length is:

```text
6
```

because:

```text
A   1 byte
é   2 bytes
你  3 bytes
```

#### Validating UTF-8

Not every `[]u8` is valid UTF-8.

A byte slice may come from a file, network socket, database, or binary protocol. It may contain invalid text.

You can validate it:

```zig
const std = @import("std");

pub fn main() void {
    const data = [_]u8{ 0xff, 0xfe, 0xfd };

    const ok = std.unicode.utf8ValidateSlice(data[0..]);

    std.debug.print("{}\n", .{ok});
}
```

Output:

```text
false
```

For a string literal:

```zig
const text = "hello";
```

validation succeeds:

```zig
const ok = std.unicode.utf8ValidateSlice(text);
```

because Zig string literals are valid UTF-8.

#### Slicing UTF-8 Text

Be careful when slicing text.

This is safe:

```zig
const text = "hello";
const part = text[0..2];
```

The result is:

```text
he
```

ASCII characters are one byte each.

But with non-ASCII text:

```zig
const text = "é";
```

The string has two bytes.

This is valid:

```zig
const whole = text[0..2];
```

This is not valid UTF-8 text:

```zig
const broken = text[0..1];
```

The slice `broken` contains only the first byte of a two-byte character.

It is still a valid byte slice. But it is not valid UTF-8.

That distinction matters:

```text
[]const u8 can hold any bytes
valid UTF-8 is a rule about what those bytes mean
```

#### Byte Indexes vs Text Positions

When you search or slice Zig strings, indexes are usually byte indexes.

For example:

```zig
const text = "AéZ";
```

The bytes are:

```text
A      é        Z
41     c3 a9    5a
```

The byte indexes are:

| Byte Index | Byte |
|---:|---|
| 0 | `41` |
| 1 | `c3` |
| 2 | `a9` |
| 3 | `5a` |

So:

```zig
text[0..1]
```

is `"A"`.

```zig
text[1..3]
```

is `"é"`.

```zig
text[3..4]
```

is `"Z"`.

But:

```zig
text[1..2]
```

splits `é` and gives invalid UTF-8.

#### Counting Code Points

You can count code points by decoding UTF-8.

```zig
const std = @import("std");

fn countCodepoints(text: []const u8) !usize {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    var count: usize = 0;
    while (it.nextCodepoint()) |_| {
        count += 1;
    }

    return count;
}

pub fn main() !void {
    const text = "Aé你";

    std.debug.print("bytes = {}\n", .{text.len});
    std.debug.print("code points = {}\n", .{try countCodepoints(text)});
}
```

Output:

```text
bytes = 6
code points = 3
```

This is often closer to what beginners mean by “characters,” but even code points are not always the same as visible characters.

#### Grapheme Clusters

A visible character can be made of multiple code points.

For example, a letter plus an accent can be represented as separate code points. Some emoji are also made from several code points joined together.

These user-visible units are called grapheme clusters.

This means there are several possible “lengths” for text:

| Question | Example Meaning |
|---|---|
| How many bytes? | Storage size |
| How many code points? | Unicode scalar values |
| How many grapheme clusters? | User-visible characters |

Zig’s basic string type does not hide this complexity. It gives you bytes. You choose the level of text processing you need.

For many systems programs, byte processing is enough. For a text editor, terminal UI, search engine, or multilingual application, you need deeper Unicode handling.

#### ASCII-Only Processing

Many programs intentionally process only ASCII.

For example, HTTP headers, many config keys, file extensions, identifiers, and protocol tokens often use ASCII rules.

ASCII lowercase conversion:

```zig
fn asciiToUpper(byte: u8) u8 {
    if (byte >= 'a' and byte <= 'z') {
        return byte - ('a' - 'A');
    }

    return byte;
}
```

Apply it to a mutable slice:

```zig
fn uppercaseAscii(text: []u8) void {
    for (text) |*byte| {
        byte.* = asciiToUpper(byte.*);
    }
}
```

This is fine when the input is known to be ASCII.

Do not use this as full Unicode uppercase conversion. Unicode case conversion is more complex.

#### Complete Example

```zig
const std = @import("std");

fn printBytes(text: []const u8) void {
    std.debug.print("bytes: ", .{});

    for (text) |byte| {
        std.debug.print("{x} ", .{byte});
    }

    std.debug.print("\n", .{});
}

fn printCodepoints(text: []const u8) !void {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    std.debug.print("code points: ", .{});

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X} ", .{cp});
    }

    std.debug.print("\n", .{});
}

pub fn main() !void {
    const text = "Aé你";

    std.debug.print("text = {s}\n", .{text});
    std.debug.print("byte length = {}\n", .{text.len});

    printBytes(text);
    try printCodepoints(text);
}
```

Output:

```text
text = Aé你
byte length = 6
bytes: 41 c3 a9 e4 bd a0
code points: U+41 U+E9 U+4F60
```

This example shows the central idea:

```text
Zig stores text as bytes
UTF-8 explains how those bytes represent Unicode code points
```

#### Common Mistake: Assuming `.len` Counts Characters

This code:

```zig
const text = "你好";
std.debug.print("{}\n", .{text.len});
```

prints:

```text
6
```

not:

```text
2
```

The result is 6 because `.len` counts bytes.

#### Common Mistake: Slicing in the Middle of a Code Point

This is wrong if you expect valid UTF-8:

```zig
const text = "é";
const broken = text[0..1];
```

The slice exists, but it does not contain valid UTF-8.

Use UTF-8-aware logic when slicing human text.

#### Common Mistake: Treating ASCII Helpers as Unicode Helpers

This function:

```zig
fn uppercaseAscii(text: []u8) void {
    for (text) |*byte| {
        if (byte.* >= 'a' and byte.* <= 'z') {
            byte.* -= 32;
        }
    }
}
```

only handles ASCII letters.

It does not correctly uppercase all Unicode text.

Name functions honestly. `uppercaseAscii` is a better name than `uppercase`.

#### Summary

Zig strings are byte slices.

Most text is represented as:

```zig
[]const u8
```

The bytes are usually UTF-8, but the type itself is still a byte slice.

`.len` counts bytes. A normal `for` loop gives bytes. Slicing uses byte indexes.

For Unicode-aware processing, validate and decode UTF-8 using `std.unicode`.

The beginner rule is simple: treat strings as bytes first, and decode them only when you need Unicode meaning.