Writing a Parser

A parser reads text and turns it into structure.

For example, this text:

1 + 2 * 3

can become a tree:

That tree tells the program what the text means. The multiplication belongs together before the addition.

A parser is useful when writing:

programming languages
configuration formats
query languages
template engines
data formats
command-line expressions

A parser usually has two stages:

source text -> tokens -> syntax tree

The tokenizer reads characters and produces tokens.

The parser reads tokens and produces structure.

Tokens

A token is a small meaningful piece of text.

For this input:

1 + 2

The tokens are:

integer 1
plus sign
integer 2
end of file

Let us define token types:

const TokenTag = enum {
    integer,
    plus,
    star,
    left_paren,
    right_paren,
    eof,
    invalid,
};

const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
};

tag says what kind of token it is.

lexeme is the slice of source text that produced the token.

For example:

Token{
    .tag = .integer,
    .lexeme = "123",
}

A Simple Tokenizer

The tokenizer walks through bytes.

const std = @import("std");

const TokenTag = enum {
    integer,
    plus,
    star,
    left_paren,
    right_paren,
    eof,
    invalid,
};

const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
};

const Tokenizer = struct {
    source: []const u8,
    index: usize,

    pub fn init(source: []const u8) Tokenizer {
        return .{
            .source = source,
            .index = 0,
        };
    }

    fn skipWhitespace(self: *Tokenizer) void {
        while (self.index < self.source.len) {
            const c = self.source[self.index];

            if (c == ' ' or c == '\n' or c == '\t' or c == '\r') {
                self.index += 1;
            } else {
                break;
            }
        }
    }

    pub fn next(self: *Tokenizer) Token {
        self.skipWhitespace();

        if (self.index >= self.source.len) {
            return .{
                .tag = .eof,
                .lexeme = "",
            };
        }

        const start = self.index;
        const c = self.source[self.index];
        self.index += 1;

        return switch (c) {
            '+' => .{ .tag = .plus, .lexeme = self.source[start..self.index] },
            '*' => .{ .tag = .star, .lexeme = self.source[start..self.index] },
            '(' => .{ .tag = .left_paren, .lexeme = self.source[start..self.index] },
            ')' => .{ .tag = .right_paren, .lexeme = self.source[start..self.index] },
            '0'...'9' => blk: {
                while (self.index < self.source.len) {
                    const next_c = self.source[self.index];

                    if (next_c >= '0' and next_c <= '9') {
                        self.index += 1;
                    } else {
                        break;
                    }
                }

                break :blk .{
                    .tag = .integer,
                    .lexeme = self.source[start..self.index],
                };
            },
            else => .{
                .tag = .invalid,
                .lexeme = self.source[start..self.index],
            },
        };
    }
};

This tokenizer recognizes:

integers
+
*
(
)
end of file
invalid characters

It is small, but it already shows the basic tokenizer pattern.

Testing the Tokenizer

A tokenizer should be easy to test.

test "tokenizer reads simple expression" {
    var tokenizer = Tokenizer.init("1 + 23");

    const t1 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.integer, t1.tag);
    try std.testing.expectEqualStrings("1", t1.lexeme);

    const t2 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.plus, t2.tag);

    const t3 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.integer, t3.tag);
    try std.testing.expectEqualStrings("23", t3.lexeme);

    const t4 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.eof, t4.tag);
}

Tests like this catch many parser bugs early.

Syntax Trees

A syntax tree represents structure.

For arithmetic expressions, we need nodes such as:

integer literal
addition
multiplication

A simple AST node can use a tagged union.

const Expr = union(enum) {
    integer: i64,
    add: Binary,
    multiply: Binary,

    const Binary = struct {
        left: *Expr,
        right: *Expr,
    };
};

This says an expression is one of:

an integer
an addition node
a multiplication node

The binary nodes point to child expressions.

Parser State

The parser reads tokens one by one.

const Parser = struct {
    allocator: std.mem.Allocator,
    tokenizer: Tokenizer,
    current: Token,

    pub fn init(allocator: std.mem.Allocator, source: []const u8) Parser {
        var tokenizer = Tokenizer.init(source);

        return .{
            .allocator = allocator,
            .tokenizer = tokenizer,
            .current = tokenizer.next(),
        };
    }

    fn advance(self: *Parser) void {
        self.current = self.tokenizer.next();
    }
};

The parser keeps the current token in self.current.

When it consumes a token, it calls advance.

Allocating AST Nodes

The parser needs memory for AST nodes.

fn createExpr(self: *Parser, expr: Expr) !*Expr {
    const node = try self.allocator.create(Expr);
    node.* = expr;
    return node;
}

This allocates one expression node and stores the given value.

A parser often uses an arena allocator for AST nodes. The whole tree can then be freed at once after compilation or evaluation.

Parsing Integer Literals

First parse the simplest expression: an integer.

fn parsePrimary(self: *Parser) !*Expr {
    switch (self.current.tag) {
        .integer => {
            const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
            self.advance();

            return try self.createExpr(.{
                .integer = value,
            });
        },
        else => return error.ExpectedExpression,
    }
}

This reads an integer token and turns it into an AST node.

Input:

AST:

integer 123

Parsing Parentheses

Parentheses contain another expression.

fn parsePrimary(self: *Parser) !*Expr {
    switch (self.current.tag) {
        .integer => {
            const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
            self.advance();

            return try self.createExpr(.{
                .integer = value,
            });
        },
        .left_paren => {
            self.advance();

            const expr = try self.parseExpression();

            if (self.current.tag != .right_paren) {
                return error.ExpectedRightParen;
            }

            self.advance();
            return expr;
        },
        else => return error.ExpectedExpression,
    }
}

Now this works:

(1 + 2)

The parser sees (, parses an expression inside it, then requires ).

Operator Precedence

This is where parsing becomes interesting.

The expression:

1 + 2 * 3

should mean:

1 + (2 * 3)

not:

(1 + 2) * 3

Multiplication has higher precedence than addition.

One simple way to handle this is to use separate parser functions:

parseExpression -> parseAddition
parseAddition -> parseMultiplication
parseMultiplication -> parsePrimary
parsePrimary -> integers and parentheses

Higher-precedence functions are lower in the call chain.

Parsing Multiplication

fn parseMultiplication(self: *Parser) !*Expr {
    var left = try self.parsePrimary();

    while (self.current.tag == .star) {
        self.advance();

        const right = try self.parsePrimary();

        left = try self.createExpr(.{
            .multiply = .{
                .left = left,
                .right = right,
            },
        });
    }

    return left;
}

This parses:

2 * 3 * 4

as:

That is left-associative multiplication.

Parsing Addition

Addition has lower precedence, so it calls parseMultiplication.

fn parseAddition(self: *Parser) !*Expr {
    var left = try self.parseMultiplication();

    while (self.current.tag == .plus) {
        self.advance();

        const right = try self.parseMultiplication();

        left = try self.createExpr(.{
            .add = .{
                .left = left,
                .right = right,
            },
        });
    }

    return left;
}

Now:

1 + 2 * 3

becomes:

because the right side of + is parsed by parseMultiplication.

The Top-Level Parse Function

fn parseExpression(self: *Parser) !*Expr {
    return self.parseAddition();
}

pub fn parse(self: *Parser) !*Expr {
    const expr = try self.parseExpression();

    if (self.current.tag != .eof) {
        return error.ExpectedEndOfFile;
    }

    return expr;
}

The parser accepts one full expression and then requires the end of the input.

This prevents partial parses.

Input:

1 + 2 abc

should fail instead of silently accepting only 1 + 2.

Evaluating the AST

To check that the parser works, we can evaluate the tree.

fn eval(expr: *const Expr) i64 {
    return switch (expr.*) {
        .integer => |value| value,
        .add => |node| eval(node.left) + eval(node.right),
        .multiply => |node| eval(node.left) * eval(node.right),
    };
}

Example:

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    var arena = std.heap.ArenaAllocator.init(gpa.allocator());
    defer arena.deinit();

    var parser = Parser.init(arena.allocator(), "1 + 2 * 3");

    const expr = try parser.parse();

    std.debug.print("{}\n", .{eval(expr)});
}

Output:

The AST gives the expression its correct meaning.

Error Messages

A parser should not only reject bad input. It should explain what went wrong.

This input:

1 + )

should produce something like:

expected expression, found ')'

For better errors, tokens should store source positions.

const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
    start: usize,
    end: usize,
};

Then the parser can report where the error happened.

1 + )
    ^
expected expression

Good parser errors save enormous debugging time.

Recursive Descent

The parser style shown here is called recursive descent.

Each grammar rule becomes one function.

expression      -> addition
addition        -> multiplication ("+" multiplication)*
multiplication  -> primary ("*" primary)*
primary         -> integer | "(" expression ")"

This maps directly to code:

parseExpression
parseAddition
parseMultiplication
parsePrimary

Recursive descent is a good first parser style because it is easy to read and debug.

Parser Design Rules

Keep the tokenizer simple.

Keep parser functions small.

Use one function per grammar level.

Handle precedence explicitly.

Store source positions early.

Return errors instead of panicking.

Use an arena for AST nodes.

Test each layer separately.

The Main Idea

A parser turns text into structure.

In Zig, a simple parser is built from plain structs, enums, tagged unions, slices, and allocator-aware functions.

The tokenizer produces tokens. The parser consumes tokens. The AST stores meaning.

Start with a tiny grammar, test it carefully, then add one feature at a time.