# Writing a Parser

### Writing a Parser

A parser reads text and turns it into structure.

For example, this text:

```text
1 + 2 * 3
```

can become a tree:

```text
+
  1
  *
    2
    3
```

That tree tells the program what the text means. The multiplication belongs together before the addition.

A parser is useful when writing:

```text
programming languages
configuration formats
query languages
template engines
data formats
command-line expressions
```

A parser usually has two stages:

```text
source text -> tokens -> syntax tree
```

The tokenizer reads characters and produces tokens.

The parser reads tokens and produces structure.

#### Tokens

A token is a small meaningful piece of text.

For this input:

```text
1 + 2
```

The tokens are:

```text
integer 1
plus sign
integer 2
end of file
```

Let us define token types:

```zig
const TokenTag = enum {
    integer,
    plus,
    star,
    left_paren,
    right_paren,
    eof,
    invalid,
};

const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
};
```

`tag` says what kind of token it is.

`lexeme` is the slice of source text that produced the token.

For example:

```zig
Token{
    .tag = .integer,
    .lexeme = "123",
}
```

#### A Simple Tokenizer

The tokenizer walks through bytes.

```zig
const std = @import("std");

const TokenTag = enum {
    integer,
    plus,
    star,
    left_paren,
    right_paren,
    eof,
    invalid,
};

const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
};

const Tokenizer = struct {
    source: []const u8,
    index: usize,

    pub fn init(source: []const u8) Tokenizer {
        return .{
            .source = source,
            .index = 0,
        };
    }

    fn skipWhitespace(self: *Tokenizer) void {
        while (self.index < self.source.len) {
            const c = self.source[self.index];

            if (c == ' ' or c == '\n' or c == '\t' or c == '\r') {
                self.index += 1;
            } else {
                break;
            }
        }
    }

    pub fn next(self: *Tokenizer) Token {
        self.skipWhitespace();

        if (self.index >= self.source.len) {
            return .{
                .tag = .eof,
                .lexeme = "",
            };
        }

        const start = self.index;
        const c = self.source[self.index];
        self.index += 1;

        return switch (c) {
            '+' => .{ .tag = .plus, .lexeme = self.source[start..self.index] },
            '*' => .{ .tag = .star, .lexeme = self.source[start..self.index] },
            '(' => .{ .tag = .left_paren, .lexeme = self.source[start..self.index] },
            ')' => .{ .tag = .right_paren, .lexeme = self.source[start..self.index] },
            '0'...'9' => blk: {
                while (self.index < self.source.len) {
                    const next_c = self.source[self.index];

                    if (next_c >= '0' and next_c <= '9') {
                        self.index += 1;
                    } else {
                        break;
                    }
                }

                break :blk .{
                    .tag = .integer,
                    .lexeme = self.source[start..self.index],
                };
            },
            else => .{
                .tag = .invalid,
                .lexeme = self.source[start..self.index],
            },
        };
    }
};
```

This tokenizer recognizes:

```text
integers
+
*
(
)
end of file
invalid characters
```

It is small, but it already shows the basic tokenizer pattern.

#### Testing the Tokenizer

A tokenizer should be easy to test.

```zig
test "tokenizer reads simple expression" {
    var tokenizer = Tokenizer.init("1 + 23");

    const t1 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.integer, t1.tag);
    try std.testing.expectEqualStrings("1", t1.lexeme);

    const t2 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.plus, t2.tag);

    const t3 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.integer, t3.tag);
    try std.testing.expectEqualStrings("23", t3.lexeme);

    const t4 = tokenizer.next();
    try std.testing.expectEqual(TokenTag.eof, t4.tag);
}
```

Tests like this catch many parser bugs early.

#### Syntax Trees

A syntax tree represents structure.

For arithmetic expressions, we need nodes such as:

```text
integer literal
addition
multiplication
```

A simple AST node can use a tagged union.

```zig
const Expr = union(enum) {
    integer: i64,
    add: Binary,
    multiply: Binary,

    const Binary = struct {
        left: *Expr,
        right: *Expr,
    };
};
```

This says an expression is one of:

```text
an integer
an addition node
a multiplication node
```

The binary nodes point to child expressions.

#### Parser State

The parser reads tokens one by one.

```zig
const Parser = struct {
    allocator: std.mem.Allocator,
    tokenizer: Tokenizer,
    current: Token,

    pub fn init(allocator: std.mem.Allocator, source: []const u8) Parser {
        var tokenizer = Tokenizer.init(source);

        return .{
            .allocator = allocator,
            .tokenizer = tokenizer,
            .current = tokenizer.next(),
        };
    }

    fn advance(self: *Parser) void {
        self.current = self.tokenizer.next();
    }
};
```

The parser keeps the current token in `self.current`.

When it consumes a token, it calls `advance`.

#### Allocating AST Nodes

The parser needs memory for AST nodes.

```zig
fn createExpr(self: *Parser, expr: Expr) !*Expr {
    const node = try self.allocator.create(Expr);
    node.* = expr;
    return node;
}
```

This allocates one expression node and stores the given value.

A parser often uses an arena allocator for AST nodes. The whole tree can then be freed at once after compilation or evaluation.

#### Parsing Integer Literals

First parse the simplest expression: an integer.

```zig
fn parsePrimary(self: *Parser) !*Expr {
    switch (self.current.tag) {
        .integer => {
            const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
            self.advance();

            return try self.createExpr(.{
                .integer = value,
            });
        },
        else => return error.ExpectedExpression,
    }
}
```

This reads an integer token and turns it into an AST node.

Input:

```text
123
```

AST:

```text
integer 123
```

#### Parsing Parentheses

Parentheses contain another expression.

```zig
fn parsePrimary(self: *Parser) !*Expr {
    switch (self.current.tag) {
        .integer => {
            const value = try std.fmt.parseInt(i64, self.current.lexeme, 10);
            self.advance();

            return try self.createExpr(.{
                .integer = value,
            });
        },
        .left_paren => {
            self.advance();

            const expr = try self.parseExpression();

            if (self.current.tag != .right_paren) {
                return error.ExpectedRightParen;
            }

            self.advance();
            return expr;
        },
        else => return error.ExpectedExpression,
    }
}
```

Now this works:

```text
(1 + 2)
```

The parser sees `(`, parses an expression inside it, then requires `)`.

#### Operator Precedence

This is where parsing becomes interesting.

The expression:

```text
1 + 2 * 3
```

should mean:

```text
1 + (2 * 3)
```

not:

```text
(1 + 2) * 3
```

Multiplication has higher precedence than addition.

One simple way to handle this is to use separate parser functions:

```text
parseExpression -> parseAddition
parseAddition -> parseMultiplication
parseMultiplication -> parsePrimary
parsePrimary -> integers and parentheses
```

Higher-precedence functions are lower in the call chain.

#### Parsing Multiplication

```zig
fn parseMultiplication(self: *Parser) !*Expr {
    var left = try self.parsePrimary();

    while (self.current.tag == .star) {
        self.advance();

        const right = try self.parsePrimary();

        left = try self.createExpr(.{
            .multiply = .{
                .left = left,
                .right = right,
            },
        });
    }

    return left;
}
```

This parses:

```text
2 * 3 * 4
```

as:

```text
*
  *
    2
    3
  4
```

That is left-associative multiplication.

#### Parsing Addition

Addition has lower precedence, so it calls `parseMultiplication`.

```zig
fn parseAddition(self: *Parser) !*Expr {
    var left = try self.parseMultiplication();

    while (self.current.tag == .plus) {
        self.advance();

        const right = try self.parseMultiplication();

        left = try self.createExpr(.{
            .add = .{
                .left = left,
                .right = right,
            },
        });
    }

    return left;
}
```

Now:

```text
1 + 2 * 3
```

becomes:

```text
+
  1
  *
    2
    3
```

because the right side of `+` is parsed by `parseMultiplication`.

#### The Top-Level Parse Function

```zig
fn parseExpression(self: *Parser) !*Expr {
    return self.parseAddition();
}

pub fn parse(self: *Parser) !*Expr {
    const expr = try self.parseExpression();

    if (self.current.tag != .eof) {
        return error.ExpectedEndOfFile;
    }

    return expr;
}
```

The parser accepts one full expression and then requires the end of the input.

This prevents partial parses.

Input:

```text
1 + 2 abc
```

should fail instead of silently accepting only `1 + 2`.

#### Evaluating the AST

To check that the parser works, we can evaluate the tree.

```zig
fn eval(expr: *const Expr) i64 {
    return switch (expr.*) {
        .integer => |value| value,
        .add => |node| eval(node.left) + eval(node.right),
        .multiply => |node| eval(node.left) * eval(node.right),
    };
}
```

Example:

```zig
pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    var arena = std.heap.ArenaAllocator.init(gpa.allocator());
    defer arena.deinit();

    var parser = Parser.init(arena.allocator(), "1 + 2 * 3");

    const expr = try parser.parse();

    std.debug.print("{}\n", .{eval(expr)});
}
```

Output:

```text
7
```

The AST gives the expression its correct meaning.

#### Error Messages

A parser should not only reject bad input. It should explain what went wrong.

This input:

```text
1 + )
```

should produce something like:

```text
expected expression, found ')'
```

For better errors, tokens should store source positions.

```zig
const Token = struct {
    tag: TokenTag,
    lexeme: []const u8,
    start: usize,
    end: usize,
};
```

Then the parser can report where the error happened.

```text
1 + )
    ^
expected expression
```

Good parser errors save enormous debugging time.

#### Recursive Descent

The parser style shown here is called recursive descent.

Each grammar rule becomes one function.

```text
expression      -> addition
addition        -> multiplication ("+" multiplication)*
multiplication  -> primary ("*" primary)*
primary         -> integer | "(" expression ")"
```

This maps directly to code:

```text
parseExpression
parseAddition
parseMultiplication
parsePrimary
```

Recursive descent is a good first parser style because it is easy to read and debug.

#### Parser Design Rules

Keep the tokenizer simple.

Keep parser functions small.

Use one function per grammar level.

Handle precedence explicitly.

Store source positions early.

Return errors instead of panicking.

Use an arena for AST nodes.

Test each layer separately.

#### The Main Idea

A parser turns text into structure.

In Zig, a simple parser is built from plain structs, enums, tagged unions, slices, and allocator-aware functions.

The tokenizer produces tokens. The parser consumes tokens. The AST stores meaning.

Start with a tiny grammar, test it carefully, then add one feature at a time.

