Skip to content

UTF-8 Processing

Zig string data is usually stored as UTF-8 bytes.

Zig string data is usually stored as UTF-8 bytes.

That means a string is not a list of “characters” in the simple beginner sense. It is a sequence of bytes, and those bytes encode text.

const text = "Zig";

This string has 3 bytes:

Z i g

But this string:

const text = "é";

has 2 bytes, even though it looks like 1 character.

That is the first rule of UTF-8 in Zig:

text.len counts bytes, not human-visible characters

ASCII Is the Simple Case

ASCII text uses one byte per character.

const text = "hello";

The length is 5 bytes:

std.debug.print("{}\n", .{text.len});

Output:

5

For plain English letters, digits, spaces, and many punctuation marks, byte length and character count match.

h e l l o
1 1 1 1 1 bytes

That is why simple examples often look easy.

Non-ASCII Text Uses More Bytes

UTF-8 uses more than one byte for many characters.

const text = "é";

The visible text has 1 character, but the byte length is 2.

const std = @import("std");

pub fn main() void {
    const text = "é";
    std.debug.print("{}\n", .{text.len});
}

Output:

2

Another example:

const text = "你好";

Each Chinese character uses 3 bytes in UTF-8.

std.debug.print("{}\n", .{text.len});

Output:

6

So this string has 2 visible characters but 6 bytes.

Iterating Over Bytes

A normal for loop over a string gives you bytes.

const std = @import("std");

pub fn main() void {
    const text = "é";

    for (text) |byte| {
        std.debug.print("{}\n", .{byte});
    }
}

Output:

195
169

Those two bytes together encode é.

The loop does not give you a character. It gives you one u8 at a time.

This is correct and useful when you are processing bytes, file formats, protocols, ASCII text, or raw input.

But it is not enough for full Unicode text processing.

Printing Bytes as Hex

When learning UTF-8, it helps to print bytes in hexadecimal.

const std = @import("std");

pub fn main() void {
    const text = "é";

    for (text) |byte| {
        std.debug.print("{x} ", .{byte});
    }

    std.debug.print("\n", .{});
}

Output:

c3 a9

The UTF-8 encoding of é is:

c3 a9

For ASCII:

const text = "A";

the byte is:

41

The visible character A is one byte in UTF-8.

Code Points

Unicode assigns numbers to characters. These numbers are called code points.

For example:

A   U+0041
é   U+00E9
你  U+4F60

UTF-8 is one way to encode those code points as bytes.

A code point may use:

1 byte
2 bytes
3 bytes
4 bytes

in UTF-8.

So there are three related ideas:

IdeaMeaning
ByteA raw u8 value
Code pointA Unicode number such as U+00E9
UTF-8Encoding that stores code points as bytes

Zig strings are byte slices. If you need Unicode meaning, you decode the bytes.

Decoding UTF-8

Zig’s standard library provides Unicode helpers.

A common operation is decoding one code point from UTF-8.

const std = @import("std");

pub fn main() !void {
    const text = "é";

    const cp = try std.unicode.utf8Decode(text);

    std.debug.print("U+{X}\n", .{cp});
}

Output:

U+E9

This decodes the byte slice "é" into one Unicode code point.

The function can fail because not every byte sequence is valid UTF-8. That is why the code uses try.

Iterating Over UTF-8 Code Points

For real text, you often want to walk code point by code point.

One simple approach is to use a UTF-8 view.

const std = @import("std");

pub fn main() !void {
    const text = "Aé你";

    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X}\n", .{cp});
    }
}

Output:

U+41
U+E9
U+4F60

The string has 3 code points:

A
é

But its byte length is larger than 3.

std.debug.print("{}\n", .{text.len});

The byte length is:

6

because:

A   1 byte
é   2 bytes
你  3 bytes

Validating UTF-8

Not every []u8 is valid UTF-8.

A byte slice may come from a file, network socket, database, or binary protocol. It may contain invalid text.

You can validate it:

const std = @import("std");

pub fn main() void {
    const data = [_]u8{ 0xff, 0xfe, 0xfd };

    const ok = std.unicode.utf8ValidateSlice(data[0..]);

    std.debug.print("{}\n", .{ok});
}

Output:

false

For a string literal:

const text = "hello";

validation succeeds:

const ok = std.unicode.utf8ValidateSlice(text);

because Zig string literals are valid UTF-8.

Slicing UTF-8 Text

Be careful when slicing text.

This is safe:

const text = "hello";
const part = text[0..2];

The result is:

he

ASCII characters are one byte each.

But with non-ASCII text:

const text = "é";

The string has two bytes.

This is valid:

const whole = text[0..2];

This is not valid UTF-8 text:

const broken = text[0..1];

The slice broken contains only the first byte of a two-byte character.

It is still a valid byte slice. But it is not valid UTF-8.

That distinction matters:

[]const u8 can hold any bytes
valid UTF-8 is a rule about what those bytes mean

Byte Indexes vs Text Positions

When you search or slice Zig strings, indexes are usually byte indexes.

For example:

const text = "AéZ";

The bytes are:

A      é        Z
41     c3 a9    5a

The byte indexes are:

Byte IndexByte
041
1c3
2a9
35a

So:

text[0..1]

is "A".

text[1..3]

is "é".

text[3..4]

is "Z".

But:

text[1..2]

splits é and gives invalid UTF-8.

Counting Code Points

You can count code points by decoding UTF-8.

const std = @import("std");

fn countCodepoints(text: []const u8) !usize {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    var count: usize = 0;
    while (it.nextCodepoint()) |_| {
        count += 1;
    }

    return count;
}

pub fn main() !void {
    const text = "Aé你";

    std.debug.print("bytes = {}\n", .{text.len});
    std.debug.print("code points = {}\n", .{try countCodepoints(text)});
}

Output:

bytes = 6
code points = 3

This is often closer to what beginners mean by “characters,” but even code points are not always the same as visible characters.

Grapheme Clusters

A visible character can be made of multiple code points.

For example, a letter plus an accent can be represented as separate code points. Some emoji are also made from several code points joined together.

These user-visible units are called grapheme clusters.

This means there are several possible “lengths” for text:

QuestionExample Meaning
How many bytes?Storage size
How many code points?Unicode scalar values
How many grapheme clusters?User-visible characters

Zig’s basic string type does not hide this complexity. It gives you bytes. You choose the level of text processing you need.

For many systems programs, byte processing is enough. For a text editor, terminal UI, search engine, or multilingual application, you need deeper Unicode handling.

ASCII-Only Processing

Many programs intentionally process only ASCII.

For example, HTTP headers, many config keys, file extensions, identifiers, and protocol tokens often use ASCII rules.

ASCII lowercase conversion:

fn asciiToUpper(byte: u8) u8 {
    if (byte >= 'a' and byte <= 'z') {
        return byte - ('a' - 'A');
    }

    return byte;
}

Apply it to a mutable slice:

fn uppercaseAscii(text: []u8) void {
    for (text) |*byte| {
        byte.* = asciiToUpper(byte.*);
    }
}

This is fine when the input is known to be ASCII.

Do not use this as full Unicode uppercase conversion. Unicode case conversion is more complex.

Complete Example

const std = @import("std");

fn printBytes(text: []const u8) void {
    std.debug.print("bytes: ", .{});

    for (text) |byte| {
        std.debug.print("{x} ", .{byte});
    }

    std.debug.print("\n", .{});
}

fn printCodepoints(text: []const u8) !void {
    var view = try std.unicode.Utf8View.init(text);
    var it = view.iterator();

    std.debug.print("code points: ", .{});

    while (it.nextCodepoint()) |cp| {
        std.debug.print("U+{X} ", .{cp});
    }

    std.debug.print("\n", .{});
}

pub fn main() !void {
    const text = "Aé你";

    std.debug.print("text = {s}\n", .{text});
    std.debug.print("byte length = {}\n", .{text.len});

    printBytes(text);
    try printCodepoints(text);
}

Output:

text = Aé你
byte length = 6
bytes: 41 c3 a9 e4 bd a0
code points: U+41 U+E9 U+4F60

This example shows the central idea:

Zig stores text as bytes
UTF-8 explains how those bytes represent Unicode code points

Common Mistake: Assuming .len Counts Characters

This code:

const text = "你好";
std.debug.print("{}\n", .{text.len});

prints:

6

not:

2

The result is 6 because .len counts bytes.

Common Mistake: Slicing in the Middle of a Code Point

This is wrong if you expect valid UTF-8:

const text = "é";
const broken = text[0..1];

The slice exists, but it does not contain valid UTF-8.

Use UTF-8-aware logic when slicing human text.

Common Mistake: Treating ASCII Helpers as Unicode Helpers

This function:

fn uppercaseAscii(text: []u8) void {
    for (text) |*byte| {
        if (byte.* >= 'a' and byte.* <= 'z') {
            byte.* -= 32;
        }
    }
}

only handles ASCII letters.

It does not correctly uppercase all Unicode text.

Name functions honestly. uppercaseAscii is a better name than uppercase.

Summary

Zig strings are byte slices.

Most text is represented as:

[]const u8

The bytes are usually UTF-8, but the type itself is still a byte slice.

.len counts bytes. A normal for loop gives bytes. Slicing uses byte indexes.

For Unicode-aware processing, validate and decode UTF-8 using std.unicode.

The beginner rule is simple: treat strings as bytes first, and decode them only when you need Unicode meaning.