A parser reads text and turns it into structure.
For example, this text:
1 + 2 * 3can become a tree:
+
1
*
2
3That tree tells the program what the text means. The multiplication belongs together before the addition.
A parser is useful when writing:
programming languages
configuration formats
query languages
template engines
data formats
command-line expressionsA parser usually has two stages:
source text -> tokens -> syntax treeThe tokenizer reads characters and produces tokens.
The parser reads tokens and produces structure.
Tokens
A token is a small meaningful piece of text.
For this input:
1 + 2The tokens are:
integer 1
plus sign
integer 2
end of fileLet us define token types:
const TokenTag = enum {
integer,
plus,
star,
left_paren,
right_paren,
eof,
invalid,
};
const Token = struct {
tag: TokenTag,
lexeme: []const u8,
};tag says what kind of token it is.
lexeme is the slice of source text that produced the token.
For example:
Token{
.tag = .integer,
.lexeme = "123",
}A Simple Tokenizer
The tokenizer walks through bytes.
const std = @import("std");
const TokenTag = enum {
integer,
plus,
star,
left_paren,
right_paren,
eof,
invalid,
};
const Token = struct {
tag: TokenTag,
lexeme: []const u8,
};
const Tokenizer = struct {
source: []const u8,
index: usize,
pub fn init(source: []const u8) Tokenizer {
return .{
.source = source,
.index = 0,
};
}
fn skipWhitespace(self: *Tokenizer) void {
while (self.index < self.source.len) {
const c = self.source[self.index];
if (c == ' ' or c == '\n' or c == '\t' or c == '\r') {
self.index += 1;
} else {
break;
}
}
}
pub fn next(self: *Tokenizer) Token {
self.skipWhitespace();
if (self.index >= self.source.len) {
return .{
.tag = .eof,
.lexeme = "",
};
}
const start = self.index;
const c = self.source[self.index];
self.index += 1;
return switch (c) {
'+' => .{ .tag = .plus, .lexeme = self.source[start..self.index] },
'*' => .{ .tag = .star, .lexeme = self.source[start..self.index] },
'(' => .{ .tag = .left_paren, .lexeme = self.source[start..self.index] },
')' => .{ .tag = .right_paren, .lexeme = self.source[start..self.index] },
'0'...'9' => blk: {
while (self.index < self.source.len) {
const next_c = self.source[self.index];
if (next_c >= '0' and next_c <= '9') {
self.index += 1;
} else {
break;
}
}
break :blk .{
.tag = .integer,
.lexeme = self.source[start..self.index],
};
},
else => .{
.tag = .invalid,
.lexeme = self.source[start..self.index],
},
};
}
};This tokenizer recognizes:
integers
+
*
(
)
end of file
invalid charactersIt is small, but it already shows the basic tokenizer pattern.
Testing the Tokenizer
A tokenizer should be easy to test.
test "tokenizer reads simple expression" {
var tokenizer = Tokenizer.init("1 + 23");
const t1 = tokenizer.next();
try std.testing.expectEqual(TokenTag.integer, t1.tag);
try std.testing.expectEqualStrings("1", t1.lexeme);
const t2 = tokenizer.next();
try std.testing.expectEqual(TokenTag.plus, t2.tag);
const t3 = tokenizer.next();
try std.testing.expectEqual(TokenTag.integer, t3.tag);
try std.testing.expectEqualStrings("23", t3.lexeme);
const t4 = tokenizer.next();
try std.testing.expectEqual(TokenTag.eof, t4.tag);
}Tests like this catch many parser bugs early.
Syntax Trees
A syntax tree represents structure.
For arithmetic expressions, we need nodes such as:
integer literal
addition
multiplicationA simple AST node can use a tagged union.
const Expr = union(enum) {
integer: i64,
add: Binary,
multiply: Binary,
const Binary = struct {
left: *Expr,
right: *Expr,
};
};This says an expression is one of:
an integer
an addition node
a multiplication nodeThe binary nodes point to child expressions.
Parser State
The parser reads tokens one by one.
const Parser = struct {
allocator: std.mem.Allocator,
tokenizer: Tokenizer,
current: Token,
pub fn init(allocator: std.mem.Allocator, source: []const u8) Parser {
var tokenizer = Tokenizer.init(source);
return .{
.allocator = allocator,
.tokenizer = tokenizer,
.current = tokenizer.next(),
};
}
fn advance(self: *Parser) void {
self.current = self.tokenizer.next();
}
};The parser keeps the current token in self.current.
When it consumes a token, it calls advance.
Allocating AST Nodes
The parser needs memory for AST nodes.
fn createExpr(self: *Parser, expr: Expr) !*Expr {
const node = try self.allocator.create(Expr);
node.* = expr;
return node;
}This allocates one expression node and stores the given value.
A parser often uses an arena allocator for AST nodes. The whole tree can then be freed at once after compilation or evaluation.
Parsing Integer Literals
First parse the simplest expression: an integer.
fn parsePrimary(self: *Parser) !*Expr {
switch (self.current.tag) {
.integer => {
const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
self.advance();
return try self.createExpr(.{
.integer = value,
});
},
else => return error.ExpectedExpression,
}
}This reads an integer token and turns it into an AST node.
Input:
123AST:
integer 123Parsing Parentheses
Parentheses contain another expression.
fn parsePrimary(self: *Parser) !*Expr {
switch (self.current.tag) {
.integer => {
const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
self.advance();
return try self.createExpr(.{
.integer = value,
});
},
.left_paren => {
self.advance();
const expr = try self.parseExpression();
if (self.current.tag != .right_paren) {
return error.ExpectedRightParen;
}
self.advance();
return expr;
},
else => return error.ExpectedExpression,
}
}Now this works:
(1 + 2)The parser sees (, parses an expression inside it, then requires ).
Operator Precedence
This is where parsing becomes interesting.
The expression:
1 + 2 * 3should mean:
1 + (2 * 3)not:
(1 + 2) * 3Multiplication has higher precedence than addition.
One simple way to handle this is to use separate parser functions:
parseExpression -> parseAddition
parseAddition -> parseMultiplication
parseMultiplication -> parsePrimary
parsePrimary -> integers and parenthesesHigher-precedence functions are lower in the call chain.
Parsing Multiplication
fn parseMultiplication(self: *Parser) !*Expr {
var left = try self.parsePrimary();
while (self.current.tag == .star) {
self.advance();
const right = try self.parsePrimary();
left = try self.createExpr(.{
.multiply = .{
.left = left,
.right = right,
},
});
}
return left;
}This parses:
2 * 3 * 4as:
*
*
2
3
4That is left-associative multiplication.
Parsing Addition
Addition has lower precedence, so it calls parseMultiplication.
fn parseAddition(self: *Parser) !*Expr {
var left = try self.parseMultiplication();
while (self.current.tag == .plus) {
self.advance();
const right = try self.parseMultiplication();
left = try self.createExpr(.{
.add = .{
.left = left,
.right = right,
},
});
}
return left;
}Now:
1 + 2 * 3becomes:
+
1
*
2
3because the right side of + is parsed by parseMultiplication.
The Top-Level Parse Function
fn parseExpression(self: *Parser) !*Expr {
return self.parseAddition();
}
pub fn parse(self: *Parser) !*Expr {
const expr = try self.parseExpression();
if (self.current.tag != .eof) {
return error.ExpectedEndOfFile;
}
return expr;
}The parser accepts one full expression and then requires the end of the input.
This prevents partial parses.
Input:
1 + 2 abcshould fail instead of silently accepting only 1 + 2.
Evaluating the AST
To check that the parser works, we can evaluate the tree.
fn eval(expr: *const Expr) i64 {
return switch (expr.*) {
.integer => |value| value,
.add => |node| eval(node.left) + eval(node.right),
.multiply => |node| eval(node.left) * eval(node.right),
};
}Example:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
var arena = std.heap.ArenaAllocator.init(gpa.allocator());
defer arena.deinit();
var parser = Parser.init(arena.allocator(), "1 + 2 * 3");
const expr = try parser.parse();
std.debug.print("{}\n", .{eval(expr)});
}Output:
7The AST gives the expression its correct meaning.
Error Messages
A parser should not only reject bad input. It should explain what went wrong.
This input:
1 + )should produce something like:
expected expression, found ')'For better errors, tokens should store source positions.
const Token = struct {
tag: TokenTag,
lexeme: []const u8,
start: usize,
end: usize,
};Then the parser can report where the error happened.
1 + )
^
expected expressionGood parser errors save enormous debugging time.
Recursive Descent
The parser style shown here is called recursive descent.
Each grammar rule becomes one function.
expression -> addition
addition -> multiplication ("+" multiplication)*
multiplication -> primary ("*" primary)*
primary -> integer | "(" expression ")"This maps directly to code:
parseExpression
parseAddition
parseMultiplication
parsePrimaryRecursive descent is a good first parser style because it is easy to read and debug.
Parser Design Rules
Keep the tokenizer simple.
Keep parser functions small.
Use one function per grammar level.
Handle precedence explicitly.
Store source positions early.
Return errors instead of panicking.
Use an arena for AST nodes.
Test each layer separately.
The Main Idea
A parser turns text into structure.
In Zig, a simple parser is built from plain structs, enums, tagged unions, slices, and allocator-aware functions.
The tokenizer produces tokens. The parser consumes tokens. The AST stores meaning.
Start with a tiny grammar, test it carefully, then add one feature at a time.