# Binary File Formats

### Binary File Formats

A binary file format stores data as bytes with a specific structure.

Text files store data as readable characters:

```text
name: Alice
age: 30
```

Binary files store data in compact byte layouts:

```text
41 4C 49 43 45 1E
```

Those bytes may represent a name, an integer, a timestamp, an image, a database page, an executable header, or anything else. The bytes only make sense if you know the format.

#### Text vs Binary

A text file is designed to be read by humans.

A binary file is designed to be read by programs.

That does not mean binary files are mysterious. They are just more strict. Instead of reading lines and words, you read exact byte positions.

For example, a simple binary format might say:

```text
bytes 0..4    magic number
bytes 4..8    version
bytes 8..16   record count
bytes 16..    records
```

Your program must follow that layout exactly.

#### Magic Numbers

Many binary formats start with a magic number.

A magic number is a short byte sequence that identifies the file type.

For example, a custom format might begin with:

```text
ZDB1
```

In Zig:

```zig
const magic = "ZDB1";
```

When reading the file, check the first bytes:

```zig
if (!std.mem.eql(u8, bytes[0..4], "ZDB1")) {
    return error.BadMagic;
}
```

This prevents your parser from treating the wrong file as valid data.

#### A Tiny Binary Format

Let’s design a small file format for storing unsigned 32-bit numbers.

The file layout:

```text
bytes 0..4    magic: "NUMS"
bytes 4..8    count: u32 little-endian
bytes 8..     count numbers, each u32 little-endian
```

A file with three numbers:

```text
magic = "NUMS"
count = 3
numbers = 10, 20, 30
```

The byte layout is:

```text
4E 55 4D 53  03 00 00 00  0A 00 00 00  14 00 00 00  1E 00 00 00
```

Each number uses 4 bytes.

#### Writing the File

```zig
const std = @import("std");

pub fn main() !void {
    var file = try std.fs.cwd().createFile("numbers.bin", .{});
    defer file.close();

    const numbers = [_]u32{ 10, 20, 30 };

    try file.writeAll("NUMS");

    var buffer: [4]u8 = undefined;

    std.mem.writeInt(u32, &buffer, numbers.len, .little);
    try file.writeAll(&buffer);

    for (numbers) |n| {
        std.mem.writeInt(u32, &buffer, n, .little);
        try file.writeAll(&buffer);
    }
}
```

The key function is:

```zig
std.mem.writeInt(u32, &buffer, n, .little);
```

It writes an integer into bytes using little-endian order.

#### Reading the File

```zig
const std = @import("std");

const ParseError = error{
    BadMagic,
    Truncated,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    const bytes = try std.fs.cwd().readFileAlloc(
        allocator,
        "numbers.bin",
        1024 * 1024,
    );
    defer allocator.free(bytes);

    const numbers = try parseNumbers(bytes);

    for (numbers) |n| {
        std.debug.print("{}\n", .{n});
    }
}

fn parseNumbers(bytes: []const u8) ParseError![]const u32 {
    if (bytes.len < 8) {
        return error.Truncated;
    }

    if (!std.mem.eql(u8, bytes[0..4], "NUMS")) {
        return error.BadMagic;
    }

    const count = std.mem.readInt(u32, bytes[4..8], .little);

    const needed = 8 + @as(usize, count) * 4;
    if (bytes.len < needed) {
        return error.Truncated;
    }

    // This function returns a view-like idea in spirit, but not a real u32 slice.
    // We will parse one number at a time in real code below.
    _ = count;
    return error.Truncated;
}
```

This version shows validation, but the return type is not the right design. The file contains bytes, not a native `[]const u32` slice. You should not pretend those bytes are already a safe Zig `u32` array.

A better parser reads each integer from the byte slice.

```zig
const std = @import("std");

const ParseError = error{
    BadMagic,
    Truncated,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    const bytes = try std.fs.cwd().readFileAlloc(
        allocator,
        "numbers.bin",
        1024 * 1024,
    );
    defer allocator.free(bytes);

    try printNumbers(bytes);
}

fn printNumbers(bytes: []const u8) ParseError!void {
    if (bytes.len < 8) {
        return error.Truncated;
    }

    if (!std.mem.eql(u8, bytes[0..4], "NUMS")) {
        return error.BadMagic;
    }

    const count = std.mem.readInt(u32, bytes[4..8], .little);

    const needed = 8 + @as(usize, count) * 4;
    if (bytes.len < needed) {
        return error.Truncated;
    }

    var offset: usize = 8;
    var i: u32 = 0;

    while (i < count) : (i += 1) {
        const n = std.mem.readInt(u32, bytes[offset..][0..4], .little);
        offset += 4;

        std.debug.print("{}\n", .{n});
    }
}
```

This is safer. It treats the file as bytes and converts bytes into integers deliberately.

#### Endianness

Endianness means byte order.

The integer `0x12345678` can be stored in memory as:

```text
big-endian:    12 34 56 78
little-endian: 78 56 34 12
```

Many modern machines are little-endian, but file formats should not depend on the current machine unless they are explicitly machine-local.

A good binary format states its byte order.

Example:

```text
All integers are little-endian.
```

Then every reader and writer must follow that rule.

In Zig, make the byte order explicit:

```zig
std.mem.writeInt(u32, &buffer, value, .little);
std.mem.readInt(u32, bytes, .little);
```

This makes the file format portable across machines.

#### Alignment

A binary file is a sequence of bytes. It does not automatically obey the alignment rules of your CPU.

This is dangerous:

```zig
const value: *const u32 = @ptrCast(bytes.ptr);
```

The pointer may not be aligned for `u32`. The file may use a different endianness. The layout may not match Zig’s in-memory layout.

Prefer this:

```zig
const value = std.mem.readInt(u32, bytes[0..4], .little);
```

Parsing through bytes is clearer and safer.

#### Struct Layout Is Not a File Format

A common beginner mistake is to write a struct directly to disk and treat that as a file format.

```zig
const Header = struct {
    version: u32,
    count: u32,
};
```

The in-memory layout of this struct may include padding. It may depend on alignment rules. It may change if fields change. It may depend on target details unless you carefully control layout.

For file formats, define bytes, not structs.

Better:

```text
bytes 0..4    version, u32 little-endian
bytes 4..8    count, u32 little-endian
```

Then write parsing code that follows the byte layout.

You may use structs internally after parsing, but the file format itself should be described as bytes.

#### Offsets

Binary parsing is mostly offset management.

You keep track of where you are in the byte slice.

```zig
var offset: usize = 0;

const magic = bytes[offset..][0..4];
offset += 4;

const version = std.mem.readInt(u32, bytes[offset..][0..4], .little);
offset += 4;
```

This pattern appears everywhere in parsers.

For larger formats, it is useful to create a small reader.

```zig
const ByteReader = struct {
    bytes: []const u8,
    offset: usize = 0,

    fn readBytes(self: *ByteReader, n: usize) ![]const u8 {
        if (self.offset + n > self.bytes.len) {
            return error.Truncated;
        }

        const out = self.bytes[self.offset..][0..n];
        self.offset += n;
        return out;
    }

    fn readU32(self: *ByteReader) !u32 {
        const b = try self.readBytes(4);
        return std.mem.readInt(u32, b, .little);
    }
};
```

Now the parser is cleaner:

```zig
var reader = ByteReader{ .bytes = bytes };

const magic = try reader.readBytes(4);
const count = try reader.readU32();
```

#### Versioning

Binary formats should include a version field.

Example:

```text
bytes 0..4    magic: "NUMS"
bytes 4..8    version: u32 little-endian
bytes 8..12   count: u32 little-endian
bytes 12..    numbers
```

Versioning lets your format evolve.

Version 1 might store only numbers.

Version 2 might add timestamps.

Version 3 might add compression.

Without a version field, future readers must guess which layout the file uses. Guessing is fragile.

#### Length Fields

Binary formats often use length fields.

Example:

```text
bytes 0..4      name length, u32 little-endian
next N bytes    UTF-8 name bytes
```

When parsing length fields, always check bounds.

Bad:

```zig
const name = bytes[offset .. offset + name_len];
```

Good:

```zig
if (offset + name_len > bytes.len) {
    return error.Truncated;
}
const name = bytes[offset .. offset + name_len];
```

Also watch for integer overflow when computing sizes.

```zig
const end = std.math.add(usize, offset, name_len) catch {
    return error.Truncated;
};
```

For parsers that read untrusted files, these checks are not optional.

#### Checksums

Some binary formats include checksums.

A checksum is a value computed from bytes to detect corruption.

Example layout:

```text
bytes 0..4      magic
bytes 4..8      payload length
bytes 8..12     checksum
bytes 12..      payload
```

When reading, the parser recomputes the checksum and compares it with the stored checksum.

Checksums do not prove that data is safe or authentic. They mainly detect accidental corruption. For security, use cryptographic authentication such as MACs or signatures.

#### Binary Formats Must Be Defensive

A binary parser should assume the input may be invalid.

The file may be too short.

The magic number may be wrong.

The version may be unsupported.

A length field may point past the end.

A count may be huge.

Offsets may overflow.

Data may be compressed incorrectly.

Strings may not be valid UTF-8.

Your parser should reject bad data cleanly instead of crashing or reading outside the buffer.

Zig helps because slices carry lengths and integer conversions are explicit, but you still need to write the checks.

#### When Binary Formats Are Useful

Binary formats are useful when you care about compact size, fast parsing, exact layout, or compatibility with existing systems.

Common examples:

image files

audio files

video files

database files

index files

network packets

executables

object files

archives

game assets

compiler caches

Text formats are often better for configuration, logs, and simple data exchange. Binary formats are better when layout, speed, and size matter more.

#### Mental Model

A binary file is a contract.

The contract says what each byte means.

Your Zig code should follow that contract exactly: check the magic number, read integers with explicit endianness, validate lengths, manage offsets carefully, and reject malformed data.

Do not treat file bytes as native structs too early. Parse bytes first. Build structured values after validation.

