# Writing a Compiler in Zig

### Writing a Compiler in Zig

A compiler translates one form of code into another form.

For this chapter, we will use a small meaning:

```text
source code -> tokens -> AST -> bytecode
```

The source code is what the user writes. The bytecode is what our virtual machine runs.

A full production compiler can generate machine code, optimize programs, perform type checking, emit debug information, and link object files. We do not need all of that yet. The first compiler should be small enough to understand.

#### The Compiler Pipeline

A simple compiler has stages.

```text
source text
    -> tokenizer
    -> parser
    -> AST
    -> compiler
    -> bytecode
    -> VM
```

Each stage has a clear job.

The tokenizer reads characters and creates tokens.

The parser reads tokens and creates a syntax tree.

The compiler reads the syntax tree and creates instructions.

The VM executes the instructions.

Keeping these stages separate makes the system easier to test.

#### A Small Source Language

Suppose our source language only supports integer expressions:

```text
1 + 2 * 3
```

From the parser section, this becomes an AST:

```text
+
  1
  *
    2
    3
```

The compiler should turn that tree into bytecode:

```text
push_i64 1
push_i64 2
push_i64 3
mul
add
halt
```

The VM then runs the bytecode and computes `7`.

#### AST Input

We can reuse the expression type from the parser:

```zig
const Expr = union(enum) {
    integer: i64,
    add: Binary,
    sub: Binary,
    multiply: Binary,
    divide: Binary,

    const Binary = struct {
        left: *Expr,
        right: *Expr,
    };
};
```

This AST is already structured. The compiler does not need to rediscover precedence. The parser already handled that.

That is why the pipeline matters. Each stage should do its own job once.

#### Bytecode Output

Use a small instruction format:

```zig
const OpCode = enum(u8) {
    push_i64,
    add,
    sub,
    mul,
    div,
    halt,
};

const Instruction = union(OpCode) {
    push_i64: i64,
    add: void,
    sub: void,
    mul: void,
    div: void,
    halt: void,
};
```

The compiler will produce a list of `Instruction` values.

```zig
instructions: std.ArrayList(Instruction)
```

#### The Compiler Type

A compiler needs an allocator and an output buffer.

```zig
const std = @import("std");

const Compiler = struct {
    instructions: std.ArrayList(Instruction),

    pub fn init(allocator: std.mem.Allocator) Compiler {
        return .{
            .instructions = std.ArrayList(Instruction).init(allocator),
        };
    }

    pub fn deinit(self: *Compiler) void {
        self.instructions.deinit();
    }

    fn emit(self: *Compiler, instruction: Instruction) !void {
        try self.instructions.append(instruction);
    }
};
```

The `emit` function appends one instruction to the output.

This is the basic compiler action:

```text
look at AST node
emit instruction
```

#### Compiling an Integer

An integer expression compiles to one `push_i64` instruction.

```zig
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        else => {
            return error.UnsupportedExpression;
        },
    }
}
```

For this AST:

```text
integer 42
```

the compiler emits:

```text
push_i64 42
```

#### Compiling Addition

For binary operators, compile the left expression first, then the right expression, then emit the operator.

```zig
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        .add => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.add);
        },
        else => {
            return error.UnsupportedExpression;
        },
    }
}
```

For:

```text
1 + 2
```

the compiler emits:

```text
push_i64 1
push_i64 2
add
```

This works because our VM is stack-based.

#### Why Left Then Right

The stack machine expects operands to be pushed before the operator runs.

For:

```text
1 + 2
```

the execution is:

```text
push 1       stack: [1]
push 2       stack: [1, 2]
add          stack: [3]
```

For subtraction, order matters.

```text
10 - 3
```

must emit:

```text
push_i64 10
push_i64 3
sub
```

The VM pops `3` as the right operand and `10` as the left operand.

#### Compiling More Operators

Now add subtraction, multiplication, and division.

```zig
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        .add => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.add);
        },
        .sub => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.sub);
        },
        .multiply => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.mul);
        },
        .divide => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.div);
        },
    }
}
```

The shape is repetitive, but clear.

Later, you can factor it. At first, clarity is more important.

#### Finishing the Program

After compiling the expression, emit `halt`.

```zig
pub fn compile(self: *Compiler, expr: *const Expr) ![]Instruction {
    try self.compileExpr(expr);
    try self.emit(.halt);

    return self.instructions.items;
}
```

This returns the instruction slice.

Be careful with ownership. The returned slice belongs to the compiler’s `ArrayList`. It stays valid until the compiler is deinitialized or the list reallocates.

#### A Small Manual AST

Before connecting the parser, test the compiler with a hand-built AST.

```zig
test "compile addition" {
    var compiler = Compiler.init(std.testing.allocator);
    defer compiler.deinit();

    var left = Expr{ .integer = 1 };
    var right = Expr{ .integer = 2 };
    var root = Expr{
        .add = .{
            .left = &left,
            .right = &right,
        },
    };

    const instructions = try compiler.compile(&root);

    try std.testing.expectEqual(@as(usize, 4), instructions.len);
    try std.testing.expectEqual(Instruction{ .push_i64 = 1 }, instructions[0]);
    try std.testing.expectEqual(Instruction{ .push_i64 = 2 }, instructions[1]);
    try std.testing.expectEqual(Instruction.add, instructions[2]);
    try std.testing.expectEqual(Instruction.halt, instructions[3]);
}
```

This test checks only the compiler. It does not require a tokenizer, parser, or VM.

That separation is valuable.

#### Compiler Errors

A compiler should report structured errors.

Examples:

```text
unsupported expression
too many constants
variable not found
invalid assignment target
type mismatch
jump target too large
```

In Zig, start with an error set:

```zig
const CompileError = error{
    UnsupportedExpression,
    VariableNotFound,
    TypeMismatch,
};
```

Then return errors from compiler functions:

```zig
fn compileExpr(self: *Compiler, expr: *const Expr) CompileError!void {
    // ...
}
```

For a real compiler, an error message should also include source location. The parser can attach locations to AST nodes.

#### Adding Statements

Expressions compute values.

Statements perform actions.

Examples:

```text
print 1 + 2;
let x = 10;
x = x + 1;
```

A small statement AST might be:

```zig
const Stmt = union(enum) {
    print: *Expr,
    expr: *Expr,
};
```

Add a `print` instruction to the VM:

```zig
const OpCode = enum(u8) {
    push_i64,
    add,
    sub,
    mul,
    div,
    print,
    halt,
};
```

Compile a print statement:

```zig
fn compileStmt(self: *Compiler, stmt: *const Stmt) !void {
    switch (stmt.*) {
        .print => |expr| {
            try self.compileExpr(expr);
            try self.emit(.print);
        },
        .expr => |expr| {
            try self.compileExpr(expr);
        },
    }
}
```

Now:

```text
print 1 + 2;
```

emits:

```text
push_i64 1
push_i64 2
add
print
```

#### Adding Variables

Variables need storage.

The VM can use local slots:

```text
slot 0
slot 1
slot 2
```

Add bytecode instructions:

```zig
const OpCode = enum(u8) {
    push_i64,
    load_local,
    store_local,
    add,
    sub,
    mul,
    div,
    print,
    halt,
};

const Instruction = union(OpCode) {
    push_i64: i64,
    load_local: usize,
    store_local: usize,
    add: void,
    sub: void,
    mul: void,
    div: void,
    print: void,
    halt: void,
};
```

A source statement:

```text
let x = 10;
```

can emit:

```text
push_i64 10
store_local 0
```

Using the variable:

```text
print x;
```

can emit:

```text
load_local 0
print
```

The compiler maps variable names to local slots.

#### Symbol Table

A symbol table stores names known to the compiler.

```zig
const Symbol = struct {
    name: []const u8,
    slot: usize,
};

const Compiler = struct {
    instructions: std.ArrayList(Instruction),
    symbols: std.StringHashMap(usize),
};
```

When compiling:

```text
let x = 10;
```

the compiler assigns a slot:

```text
x -> 0
```

When compiling:

```text
x + 1
```

the compiler looks up `x` and emits:

```text
load_local 0
push_i64 1
add
```

A missing name is a compile error.

```zig
const slot = self.symbols.get(name) orelse {
    return error.VariableNotFound;
};
```

#### Control Flow

Control flow needs jumps.

For an `if` expression:

```text
if condition {
    then_branch
} else {
    else_branch
}
```

the compiler emits something like:

```text
compile condition
jump_if_zero else_start
compile then_branch
jump end
else_start:
compile else_branch
end:
```

The hard part is that the compiler may not know jump targets until later.

So it emits placeholder jumps, then patches them.

#### Patching Jumps

A compiler can remember the instruction index where a jump was emitted.

```zig
fn emitJump(self: *Compiler, instruction: Instruction) !usize {
    const index = self.instructions.items.len;
    try self.emit(instruction);
    return index;
}
```

Later it patches the target:

```zig
fn patchJump(self: *Compiler, index: usize, target: usize) void {
    switch (self.instructions.items[index]) {
        .jump => {
            self.instructions.items[index] = .{ .jump = target };
        },
        .jump_if_zero => {
            self.instructions.items[index] = .{ .jump_if_zero = target };
        },
        else => unreachable,
    }
}
```

This pattern appears in many bytecode compilers.

#### Compiler Design Rule

A compiler should not execute the program.

It should translate the program.

Bad compiler behavior:

```text
while compiling, run user code accidentally
while compiling, depend on runtime variable values
while compiling, mutate VM state
```

Good compiler behavior:

```text
read AST
check rules
emit bytecode
report errors
```

This keeps compile time and runtime separate.

#### Optimization

Do not start with optimization.

First make the compiler correct.

Then add small optimizations.

Example: constant folding.

Source:

```text
1 + 2
```

Instead of emitting:

```text
push_i64 1
push_i64 2
add
```

the compiler can emit:

```text
push_i64 3
```

But only do this after the ordinary compiler works.

A simple optimizer should preserve meaning. Speed is useless if the compiler becomes wrong.

#### Testing the Whole Pipeline

After testing each stage separately, test the full pipeline.

```text
source -> parse -> compile -> run -> result
```

Example test cases:

```text
1 + 2 -> 3
1 + 2 * 3 -> 7
(1 + 2) * 3 -> 9
10 - 3 - 2 -> 5
8 / 2 -> 4
```

Whole-pipeline tests catch integration bugs.

Separate stage tests tell you where the bug probably lives.

#### Common Mistakes

A common mistake is mixing parsing and compilation too early. Keep the AST stage until you understand the language.

Another mistake is adding variables, functions, types, and optimization all at once. Add one feature at a time.

A third mistake is producing bytecode without a disassembler. You need to see what the compiler emitted.

A fourth mistake is ignoring source locations. Good errors need locations.

A fifth mistake is treating compiler bugs as VM bugs. If the VM runs the emitted bytecode correctly, the compiler may be the broken part.

#### A Practical Build Order

Build a small compiler in this order:

```text
integer expressions
binary arithmetic
print statements
local variables
assignment
if statements
while loops
function calls
type checking
simple optimization
```

After each step, add tests.

Each feature should change the AST, compiler, VM, and tests in a controlled way.

#### The Main Idea

A compiler is a translator.

For a small Zig project, the compiler can translate an AST into bytecode for your VM.

Use clear stages. Keep the tokenizer, parser, compiler, and VM separate. Emit simple instructions. Add a disassembler. Test each stage alone, then test the full pipeline.

