A compiler translates one form of code into another form.
For this chapter, we will use a small meaning:
source code -> tokens -> AST -> bytecodeThe source code is what the user writes. The bytecode is what our virtual machine runs.
A full production compiler can generate machine code, optimize programs, perform type checking, emit debug information, and link object files. We do not need all of that yet. The first compiler should be small enough to understand.
The Compiler Pipeline
A simple compiler has stages.
source text
-> tokenizer
-> parser
-> AST
-> compiler
-> bytecode
-> VMEach stage has a clear job.
The tokenizer reads characters and creates tokens.
The parser reads tokens and creates a syntax tree.
The compiler reads the syntax tree and creates instructions.
The VM executes the instructions.
Keeping these stages separate makes the system easier to test.
A Small Source Language
Suppose our source language only supports integer expressions:
1 + 2 * 3From the parser section, this becomes an AST:
+
1
*
2
3The compiler should turn that tree into bytecode:
push_i64 1
push_i64 2
push_i64 3
mul
add
haltThe VM then runs the bytecode and computes 7.
AST Input
We can reuse the expression type from the parser:
const Expr = union(enum) {
integer: i64,
add: Binary,
sub: Binary,
multiply: Binary,
divide: Binary,
const Binary = struct {
left: *Expr,
right: *Expr,
};
};This AST is already structured. The compiler does not need to rediscover precedence. The parser already handled that.
That is why the pipeline matters. Each stage should do its own job once.
Bytecode Output
Use a small instruction format:
const OpCode = enum(u8) {
push_i64,
add,
sub,
mul,
div,
halt,
};
const Instruction = union(OpCode) {
push_i64: i64,
add: void,
sub: void,
mul: void,
div: void,
halt: void,
};The compiler will produce a list of Instruction values.
instructions: std.ArrayList(Instruction)The Compiler Type
A compiler needs an allocator and an output buffer.
const std = @import("std");
const Compiler = struct {
instructions: std.ArrayList(Instruction),
pub fn init(allocator: std.mem.Allocator) Compiler {
return .{
.instructions = std.ArrayList(Instruction).init(allocator),
};
}
pub fn deinit(self: *Compiler) void {
self.instructions.deinit();
}
fn emit(self: *Compiler, instruction: Instruction) !void {
try self.instructions.append(instruction);
}
};The emit function appends one instruction to the output.
This is the basic compiler action:
look at AST node
emit instructionCompiling an Integer
An integer expression compiles to one push_i64 instruction.
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
switch (expr.*) {
.integer => |value| {
try self.emit(.{ .push_i64 = value });
},
else => {
return error.UnsupportedExpression;
},
}
}For this AST:
integer 42the compiler emits:
push_i64 42Compiling Addition
For binary operators, compile the left expression first, then the right expression, then emit the operator.
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
switch (expr.*) {
.integer => |value| {
try self.emit(.{ .push_i64 = value });
},
.add => |node| {
try self.compileExpr(node.left);
try self.compileExpr(node.right);
try self.emit(.add);
},
else => {
return error.UnsupportedExpression;
},
}
}For:
1 + 2the compiler emits:
push_i64 1
push_i64 2
addThis works because our VM is stack-based.
Why Left Then Right
The stack machine expects operands to be pushed before the operator runs.
For:
1 + 2the execution is:
push 1 stack: [1]
push 2 stack: [1, 2]
add stack: [3]For subtraction, order matters.
10 - 3must emit:
push_i64 10
push_i64 3
subThe VM pops 3 as the right operand and 10 as the left operand.
Compiling More Operators
Now add subtraction, multiplication, and division.
fn compileExpr(self: *Compiler, expr: *const Expr) !void {
switch (expr.*) {
.integer => |value| {
try self.emit(.{ .push_i64 = value });
},
.add => |node| {
try self.compileExpr(node.left);
try self.compileExpr(node.right);
try self.emit(.add);
},
.sub => |node| {
try self.compileExpr(node.left);
try self.compileExpr(node.right);
try self.emit(.sub);
},
.multiply => |node| {
try self.compileExpr(node.left);
try self.compileExpr(node.right);
try self.emit(.mul);
},
.divide => |node| {
try self.compileExpr(node.left);
try self.compileExpr(node.right);
try self.emit(.div);
},
}
}The shape is repetitive, but clear.
Later, you can factor it. At first, clarity is more important.
Finishing the Program
After compiling the expression, emit halt.
pub fn compile(self: *Compiler, expr: *const Expr) ![]Instruction {
try self.compileExpr(expr);
try self.emit(.halt);
return self.instructions.items;
}This returns the instruction slice.
Be careful with ownership. The returned slice belongs to the compiler’s ArrayList. It stays valid until the compiler is deinitialized or the list reallocates.
A Small Manual AST
Before connecting the parser, test the compiler with a hand-built AST.
test "compile addition" {
var compiler = Compiler.init(std.testing.allocator);
defer compiler.deinit();
var left = Expr{ .integer = 1 };
var right = Expr{ .integer = 2 };
var root = Expr{
.add = .{
.left = &left,
.right = &right,
},
};
const instructions = try compiler.compile(&root);
try std.testing.expectEqual(@as(usize, 4), instructions.len);
try std.testing.expectEqual(Instruction{ .push_i64 = 1 }, instructions[0]);
try std.testing.expectEqual(Instruction{ .push_i64 = 2 }, instructions[1]);
try std.testing.expectEqual(Instruction.add, instructions[2]);
try std.testing.expectEqual(Instruction.halt, instructions[3]);
}This test checks only the compiler. It does not require a tokenizer, parser, or VM.
That separation is valuable.
Compiler Errors
A compiler should report structured errors.
Examples:
unsupported expression
too many constants
variable not found
invalid assignment target
type mismatch
jump target too largeIn Zig, start with an error set:
const CompileError = error{
UnsupportedExpression,
VariableNotFound,
TypeMismatch,
};Then return errors from compiler functions:
fn compileExpr(self: *Compiler, expr: *const Expr) CompileError!void {
// ...
}For a real compiler, an error message should also include source location. The parser can attach locations to AST nodes.
Adding Statements
Expressions compute values.
Statements perform actions.
Examples:
print 1 + 2;
let x = 10;
x = x + 1;A small statement AST might be:
const Stmt = union(enum) {
print: *Expr,
expr: *Expr,
};Add a print instruction to the VM:
const OpCode = enum(u8) {
push_i64,
add,
sub,
mul,
div,
print,
halt,
};Compile a print statement:
fn compileStmt(self: *Compiler, stmt: *const Stmt) !void {
switch (stmt.*) {
.print => |expr| {
try self.compileExpr(expr);
try self.emit(.print);
},
.expr => |expr| {
try self.compileExpr(expr);
},
}
}Now:
print 1 + 2;emits:
push_i64 1
push_i64 2
add
printAdding Variables
Variables need storage.
The VM can use local slots:
slot 0
slot 1
slot 2Add bytecode instructions:
const OpCode = enum(u8) {
push_i64,
load_local,
store_local,
add,
sub,
mul,
div,
print,
halt,
};
const Instruction = union(OpCode) {
push_i64: i64,
load_local: usize,
store_local: usize,
add: void,
sub: void,
mul: void,
div: void,
print: void,
halt: void,
};A source statement:
let x = 10;can emit:
push_i64 10
store_local 0Using the variable:
print x;can emit:
load_local 0
printThe compiler maps variable names to local slots.
Symbol Table
A symbol table stores names known to the compiler.
const Symbol = struct {
name: []const u8,
slot: usize,
};
const Compiler = struct {
instructions: std.ArrayList(Instruction),
symbols: std.StringHashMap(usize),
};When compiling:
let x = 10;the compiler assigns a slot:
x -> 0When compiling:
x + 1the compiler looks up x and emits:
load_local 0
push_i64 1
addA missing name is a compile error.
const slot = self.symbols.get(name) orelse {
return error.VariableNotFound;
};Control Flow
Control flow needs jumps.
For an if expression:
if condition {
then_branch
} else {
else_branch
}the compiler emits something like:
compile condition
jump_if_zero else_start
compile then_branch
jump end
else_start:
compile else_branch
end:The hard part is that the compiler may not know jump targets until later.
So it emits placeholder jumps, then patches them.
Patching Jumps
A compiler can remember the instruction index where a jump was emitted.
fn emitJump(self: *Compiler, instruction: Instruction) !usize {
const index = self.instructions.items.len;
try self.emit(instruction);
return index;
}Later it patches the target:
fn patchJump(self: *Compiler, index: usize, target: usize) void {
switch (self.instructions.items[index]) {
.jump => {
self.instructions.items[index] = .{ .jump = target };
},
.jump_if_zero => {
self.instructions.items[index] = .{ .jump_if_zero = target };
},
else => unreachable,
}
}This pattern appears in many bytecode compilers.
Compiler Design Rule
A compiler should not execute the program.
It should translate the program.
Bad compiler behavior:
while compiling, run user code accidentally
while compiling, depend on runtime variable values
while compiling, mutate VM stateGood compiler behavior:
read AST
check rules
emit bytecode
report errorsThis keeps compile time and runtime separate.
Optimization
Do not start with optimization.
First make the compiler correct.
Then add small optimizations.
Example: constant folding.
Source:
1 + 2Instead of emitting:
push_i64 1
push_i64 2
addthe compiler can emit:
push_i64 3But only do this after the ordinary compiler works.
A simple optimizer should preserve meaning. Speed is useless if the compiler becomes wrong.
Testing the Whole Pipeline
After testing each stage separately, test the full pipeline.
source -> parse -> compile -> run -> resultExample test cases:
1 + 2 -> 3
1 + 2 * 3 -> 7
(1 + 2) * 3 -> 9
10 - 3 - 2 -> 5
8 / 2 -> 4Whole-pipeline tests catch integration bugs.
Separate stage tests tell you where the bug probably lives.
Common Mistakes
A common mistake is mixing parsing and compilation too early. Keep the AST stage until you understand the language.
Another mistake is adding variables, functions, types, and optimization all at once. Add one feature at a time.
A third mistake is producing bytecode without a disassembler. You need to see what the compiler emitted.
A fourth mistake is ignoring source locations. Good errors need locations.
A fifth mistake is treating compiler bugs as VM bugs. If the VM runs the emitted bytecode correctly, the compiler may be the broken part.
A Practical Build Order
Build a small compiler in this order:
integer expressions
binary arithmetic
print statements
local variables
assignment
if statements
while loops
function calls
type checking
simple optimizationAfter each step, add tests.
Each feature should change the AST, compiler, VM, and tests in a controlled way.
The Main Idea
A compiler is a translator.
For a small Zig project, the compiler can translate an AST into bytecode for your VM.
Use clear stages. Keep the tokenizer, parser, compiler, and VM separate. Emit simple instructions. Add a disassembler. Test each stage alone, then test the full pipeline.