Zig string data is usually stored as UTF-8 bytes.
That means a string is not a list of “characters” in the simple beginner sense. It is a sequence of bytes, and those bytes encode text.
const text = "Zig";This string has 3 bytes:
Z i gBut this string:
const text = "é";has 2 bytes, even though it looks like 1 character.
That is the first rule of UTF-8 in Zig:
text.len counts bytes, not human-visible charactersASCII Is the Simple Case
ASCII text uses one byte per character.
const text = "hello";The length is 5 bytes:
std.debug.print("{}\n", .{text.len});Output:
5For plain English letters, digits, spaces, and many punctuation marks, byte length and character count match.
h e l l o
1 1 1 1 1 bytesThat is why simple examples often look easy.
Non-ASCII Text Uses More Bytes
UTF-8 uses more than one byte for many characters.
const text = "é";The visible text has 1 character, but the byte length is 2.
const std = @import("std");
pub fn main() void {
const text = "é";
std.debug.print("{}\n", .{text.len});
}Output:
2Another example:
const text = "你好";Each Chinese character uses 3 bytes in UTF-8.
std.debug.print("{}\n", .{text.len});Output:
6So this string has 2 visible characters but 6 bytes.
Iterating Over Bytes
A normal for loop over a string gives you bytes.
const std = @import("std");
pub fn main() void {
const text = "é";
for (text) |byte| {
std.debug.print("{}\n", .{byte});
}
}Output:
195
169Those two bytes together encode é.
The loop does not give you a character. It gives you one u8 at a time.
This is correct and useful when you are processing bytes, file formats, protocols, ASCII text, or raw input.
But it is not enough for full Unicode text processing.
Printing Bytes as Hex
When learning UTF-8, it helps to print bytes in hexadecimal.
const std = @import("std");
pub fn main() void {
const text = "é";
for (text) |byte| {
std.debug.print("{x} ", .{byte});
}
std.debug.print("\n", .{});
}Output:
c3 a9The UTF-8 encoding of é is:
c3 a9For ASCII:
const text = "A";the byte is:
41The visible character A is one byte in UTF-8.
Code Points
Unicode assigns numbers to characters. These numbers are called code points.
For example:
A U+0041
é U+00E9
你 U+4F60UTF-8 is one way to encode those code points as bytes.
A code point may use:
1 byte
2 bytes
3 bytes
4 bytesin UTF-8.
So there are three related ideas:
| Idea | Meaning |
|---|---|
| Byte | A raw u8 value |
| Code point | A Unicode number such as U+00E9 |
| UTF-8 | Encoding that stores code points as bytes |
Zig strings are byte slices. If you need Unicode meaning, you decode the bytes.
Decoding UTF-8
Zig’s standard library provides Unicode helpers.
A common operation is decoding one code point from UTF-8.
const std = @import("std");
pub fn main() !void {
const text = "é";
const cp = try std.unicode.utf8Decode(text);
std.debug.print("U+{X}\n", .{cp});
}Output:
U+E9This decodes the byte slice "é" into one Unicode code point.
The function can fail because not every byte sequence is valid UTF-8. That is why the code uses try.
Iterating Over UTF-8 Code Points
For real text, you often want to walk code point by code point.
One simple approach is to use a UTF-8 view.
const std = @import("std");
pub fn main() !void {
const text = "Aé你";
var view = try std.unicode.Utf8View.init(text);
var it = view.iterator();
while (it.nextCodepoint()) |cp| {
std.debug.print("U+{X}\n", .{cp});
}
}Output:
U+41
U+E9
U+4F60The string has 3 code points:
A
é
你But its byte length is larger than 3.
std.debug.print("{}\n", .{text.len});The byte length is:
6because:
A 1 byte
é 2 bytes
你 3 bytesValidating UTF-8
Not every []u8 is valid UTF-8.
A byte slice may come from a file, network socket, database, or binary protocol. It may contain invalid text.
You can validate it:
const std = @import("std");
pub fn main() void {
const data = [_]u8{ 0xff, 0xfe, 0xfd };
const ok = std.unicode.utf8ValidateSlice(data[0..]);
std.debug.print("{}\n", .{ok});
}Output:
falseFor a string literal:
const text = "hello";validation succeeds:
const ok = std.unicode.utf8ValidateSlice(text);because Zig string literals are valid UTF-8.
Slicing UTF-8 Text
Be careful when slicing text.
This is safe:
const text = "hello";
const part = text[0..2];The result is:
heASCII characters are one byte each.
But with non-ASCII text:
const text = "é";The string has two bytes.
This is valid:
const whole = text[0..2];This is not valid UTF-8 text:
const broken = text[0..1];The slice broken contains only the first byte of a two-byte character.
It is still a valid byte slice. But it is not valid UTF-8.
That distinction matters:
[]const u8 can hold any bytes
valid UTF-8 is a rule about what those bytes meanByte Indexes vs Text Positions
When you search or slice Zig strings, indexes are usually byte indexes.
For example:
const text = "AéZ";The bytes are:
A é Z
41 c3 a9 5aThe byte indexes are:
| Byte Index | Byte |
|---|---|
| 0 | 41 |
| 1 | c3 |
| 2 | a9 |
| 3 | 5a |
So:
text[0..1]is "A".
text[1..3]is "é".
text[3..4]is "Z".
But:
text[1..2]splits é and gives invalid UTF-8.
Counting Code Points
You can count code points by decoding UTF-8.
const std = @import("std");
fn countCodepoints(text: []const u8) !usize {
var view = try std.unicode.Utf8View.init(text);
var it = view.iterator();
var count: usize = 0;
while (it.nextCodepoint()) |_| {
count += 1;
}
return count;
}
pub fn main() !void {
const text = "Aé你";
std.debug.print("bytes = {}\n", .{text.len});
std.debug.print("code points = {}\n", .{try countCodepoints(text)});
}Output:
bytes = 6
code points = 3This is often closer to what beginners mean by “characters,” but even code points are not always the same as visible characters.
Grapheme Clusters
A visible character can be made of multiple code points.
For example, a letter plus an accent can be represented as separate code points. Some emoji are also made from several code points joined together.
These user-visible units are called grapheme clusters.
This means there are several possible “lengths” for text:
| Question | Example Meaning |
|---|---|
| How many bytes? | Storage size |
| How many code points? | Unicode scalar values |
| How many grapheme clusters? | User-visible characters |
Zig’s basic string type does not hide this complexity. It gives you bytes. You choose the level of text processing you need.
For many systems programs, byte processing is enough. For a text editor, terminal UI, search engine, or multilingual application, you need deeper Unicode handling.
ASCII-Only Processing
Many programs intentionally process only ASCII.
For example, HTTP headers, many config keys, file extensions, identifiers, and protocol tokens often use ASCII rules.
ASCII lowercase conversion:
fn asciiToUpper(byte: u8) u8 {
if (byte >= 'a' and byte <= 'z') {
return byte - ('a' - 'A');
}
return byte;
}Apply it to a mutable slice:
fn uppercaseAscii(text: []u8) void {
for (text) |*byte| {
byte.* = asciiToUpper(byte.*);
}
}This is fine when the input is known to be ASCII.
Do not use this as full Unicode uppercase conversion. Unicode case conversion is more complex.
Complete Example
const std = @import("std");
fn printBytes(text: []const u8) void {
std.debug.print("bytes: ", .{});
for (text) |byte| {
std.debug.print("{x} ", .{byte});
}
std.debug.print("\n", .{});
}
fn printCodepoints(text: []const u8) !void {
var view = try std.unicode.Utf8View.init(text);
var it = view.iterator();
std.debug.print("code points: ", .{});
while (it.nextCodepoint()) |cp| {
std.debug.print("U+{X} ", .{cp});
}
std.debug.print("\n", .{});
}
pub fn main() !void {
const text = "Aé你";
std.debug.print("text = {s}\n", .{text});
std.debug.print("byte length = {}\n", .{text.len});
printBytes(text);
try printCodepoints(text);
}Output:
text = Aé你
byte length = 6
bytes: 41 c3 a9 e4 bd a0
code points: U+41 U+E9 U+4F60This example shows the central idea:
Zig stores text as bytes
UTF-8 explains how those bytes represent Unicode code pointsCommon Mistake: Assuming .len Counts Characters
This code:
const text = "你好";
std.debug.print("{}\n", .{text.len});prints:
6not:
2The result is 6 because .len counts bytes.
Common Mistake: Slicing in the Middle of a Code Point
This is wrong if you expect valid UTF-8:
const text = "é";
const broken = text[0..1];The slice exists, but it does not contain valid UTF-8.
Use UTF-8-aware logic when slicing human text.
Common Mistake: Treating ASCII Helpers as Unicode Helpers
This function:
fn uppercaseAscii(text: []u8) void {
for (text) |*byte| {
if (byte.* >= 'a' and byte.* <= 'z') {
byte.* -= 32;
}
}
}only handles ASCII letters.
It does not correctly uppercase all Unicode text.
Name functions honestly. uppercaseAscii is a better name than uppercase.
Summary
Zig strings are byte slices.
Most text is represented as:
[]const u8The bytes are usually UTF-8, but the type itself is still a byte slice.
.len counts bytes. A normal for loop gives bytes. Slicing uses byte indexes.
For Unicode-aware processing, validate and decode UTF-8 using std.unicode.
The beginner rule is simple: treat strings as bytes first, and decode them only when you need Unicode meaning.