# Build a Programming Language Lexer

### Build a Programming Language Lexer

A lexer is the first stage of many programming language tools.

Its job is simple:

```text
input text
    ↓
sequence of tokens
```

For example, this source code:

```text
let x = 42 + y;
```

becomes:

```text
KeywordLet
Identifier("x")
Equal
Number("42")
Plus
Identifier("y")
Semicolon
```

A lexer does not understand meaning yet. It only splits text into pieces.

Parsers, compilers, interpreters, formatters, syntax highlighters, and linters usually begin with lexing.

#### The Goal

We will build a lexer for a tiny language.

It will support:

```text
identifiers
numbers
keywords
operators
punctuation
strings
whitespace skipping
comments
```

Input:

```text
let answer = 42;
print(answer);
```

Output:

```text
KeywordLet
Identifier(answer)
Equal
Number(42)
Semicolon
Identifier(print)
LeftParen
Identifier(answer)
RightParen
Semicolon
EOF
```

#### What Is a Token

A token has:

```text
kind
text
position
```

Example:

```text
Identifier("answer")
```

The token kind is:

```text
Identifier
```

The text is:

```text
answer
```

The lexer keeps the original text slice because later stages may need it.

#### Token Types

Start with an enum:

```zig
const TokenKind = enum {
    eof,

    identifier,
    number,
    string,

    keyword_let,
    keyword_if,
    keyword_else,
    keyword_return,

    plus,
    minus,
    star,
    slash,
    equal,

    left_paren,
    right_paren,
    left_brace,
    right_brace,

    comma,
    semicolon,
};
```

This defines every token the lexer can produce.

#### Token Struct

Now define the token itself:

```zig
const Token = struct {
    kind: TokenKind,
    text: []const u8,
    line: usize,
    column: usize,
};
```

The lexer stores line and column numbers for error reporting.

Example:

```text
unexpected token at line 3 column 14
```

Without positions, compiler errors become hard to understand.

#### The Lexer State

A lexer walks through text one character at a time.

```zig
const Lexer = struct {
    input: []const u8,
    index: usize,
    line: usize,
    column: usize,
};
```

The lexer tracks:

```text
current byte index
current line
current column
```

#### Initialize the Lexer

Add:

```zig
fn init(input: []const u8) Lexer {
    return .{
        .input = input,
        .index = 0,
        .line = 1,
        .column = 1,
    };
}
```

We start at line 1, column 1.

#### Peek and Advance

The lexer needs helper functions.

```zig
fn peek(self: *Lexer) ?u8 {
    if (self.index >= self.input.len) {
        return null;
    }

    return self.input[self.index];
}
```

`peek` looks at the current character without moving.

Now add `advance`:

```zig
fn advance(self: *Lexer) ?u8 {
    const ch = self.peek() orelse return null;

    self.index += 1;

    if (ch == '\n') {
        self.line += 1;
        self.column = 1;
    } else {
        self.column += 1;
    }

    return ch;
}
```

This moves the lexer forward.

Notice:

```zig
if (ch == '\n')
```

Newlines update both line and column counters.

#### Skip Whitespace

Programming languages usually ignore spaces and tabs between tokens.

```zig
fn skipWhitespace(self: *Lexer) void {
    while (self.peek()) |ch| {
        switch (ch) {
            ' ', '\t', '\r', '\n' => _ = self.advance(),
            else => return,
        }
    }
}
```

This consumes whitespace until a non-whitespace character appears.

#### Create Tokens

Add a helper:

```zig
fn makeToken(
    self: *Lexer,
    kind: TokenKind,
    start: usize,
    end: usize,
    line: usize,
    column: usize,
) Token {
    _ = self;

    return Token{
        .kind = kind,
        .text = self.input[start..end],
        .line = line,
        .column = column,
    };
}
```

A token references a slice of the original input.

The lexer does not allocate new strings.

#### Lexing Identifiers

Identifiers look like:

```text
x
answer
my_variable
```

Keywords also start as identifiers.

Add helpers:

```zig
fn isIdentifierStart(ch: u8) bool {
    return std.ascii.isAlphabetic(ch) or ch == '_';
}

fn isIdentifierContinue(ch: u8) bool {
    return std.ascii.isAlphanumeric(ch) or ch == '_';
}
```

Now add identifier lexing:

```zig
fn lexIdentifier(self: *Lexer) Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    _ = self.advance();

    while (self.peek()) |ch| {
        if (!isIdentifierContinue(ch)) {
            break;
        }

        _ = self.advance();
    }

    const text = self.input[start..self.index];

    const kind = keywordKind(text) orelse .identifier;

    return Token{
        .kind = kind,
        .text = text,
        .line = line,
        .column = column,
    };
}
```

#### Recognizing Keywords

Keywords are reserved words:

```text
let
if
else
return
```

Add:

```zig
fn keywordKind(text: []const u8) ?TokenKind {
    if (std.mem.eql(u8, text, "let")) return .keyword_let;
    if (std.mem.eql(u8, text, "if")) return .keyword_if;
    if (std.mem.eql(u8, text, "else")) return .keyword_else;
    if (std.mem.eql(u8, text, "return")) return .keyword_return;

    return null;
}
```

If the text matches a keyword, the token becomes a keyword token.

Otherwise it stays an identifier.

#### Lexing Numbers

Numbers are sequences of digits.

```zig
fn lexNumber(self: *Lexer) Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    while (self.peek()) |ch| {
        if (!std.ascii.isDigit(ch)) {
            break;
        }

        _ = self.advance();
    }

    return Token{
        .kind = .number,
        .text = self.input[start..self.index],
        .line = line,
        .column = column,
    };
}
```

This accepts:

```text
0
123
9999
```

This first lexer only supports integers.

#### Lexing Strings

Strings begin and end with quotes.

```zig
fn lexString(self: *Lexer) !Token {
    const start = self.index;
    const line = self.line;
    const column = self.column;

    _ = self.advance();

    while (self.peek()) |ch| {
        if (ch == '"') {
            _ = self.advance();

            return Token{
                .kind = .string,
                .text = self.input[start..self.index],
                .line = line,
                .column = column,
            };
        }

        if (ch == '\n') {
            return error.UnterminatedString;
        }

        _ = self.advance();
    }

    return error.UnterminatedString;
}
```

This lexer keeps the quotes inside the token text:

```text
"hello"
```

Later stages can remove them if needed.

#### Single Character Tokens

Operators and punctuation are simpler.

```zig
'+' -> plus
'-' -> minus
'(' -> left_paren
';' -> semicolon
```

We can lex them directly.

#### Lexing Comments

Add support for line comments:

```text
// this is a comment
```

Add:

```zig
fn skipComment(self: *Lexer) void {
    while (self.peek()) |ch| {
        if (ch == '\n') {
            return;
        }

        _ = self.advance();
    }
}
```

#### The Main Lex Function

Now combine everything.

```zig
fn nextToken(self: *Lexer) !Token {
    while (true) {
        self.skipWhitespace();

        const start = self.index;
        const line = self.line;
        const column = self.column;

        const ch = self.peek() orelse {
            return Token{
                .kind = .eof,
                .text = "",
                .line = line,
                .column = column,
            };
        };

        if (isIdentifierStart(ch)) {
            return self.lexIdentifier();
        }

        if (std.ascii.isDigit(ch)) {
            return self.lexNumber();
        }

        switch (ch) {
            '"' => return try self.lexString(),

            '/' => {
                _ = self.advance();

                if (self.peek() == '/') {
                    _ = self.advance();
                    self.skipComment();
                    continue;
                }

                return Token{
                    .kind = .slash,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '+' => {
                _ = self.advance();

                return Token{
                    .kind = .plus,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '-' => {
                _ = self.advance();

                return Token{
                    .kind = .minus,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '*' => {
                _ = self.advance();

                return Token{
                    .kind = .star,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '=' => {
                _ = self.advance();

                return Token{
                    .kind = .equal,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '(' => {
                _ = self.advance();

                return Token{
                    .kind = .left_paren,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ')' => {
                _ = self.advance();

                return Token{
                    .kind = .right_paren,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '{' => {
                _ = self.advance();

                return Token{
                    .kind = .left_brace,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            '}' => {
                _ = self.advance();

                return Token{
                    .kind = .right_brace,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ',' => {
                _ = self.advance();

                return Token{
                    .kind = .comma,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            ';' => {
                _ = self.advance();

                return Token{
                    .kind = .semicolon,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            },

            else => return error.InvalidCharacter,
        }
    }
}
```

#### Complete Program

Put this in `src/main.zig`:

```zig
const std = @import("std");

const TokenKind = enum {
    eof,

    identifier,
    number,
    string,

    keyword_let,
    keyword_if,
    keyword_else,
    keyword_return,

    plus,
    minus,
    star,
    slash,
    equal,

    left_paren,
    right_paren,
    left_brace,
    right_brace,

    comma,
    semicolon,
};

const Token = struct {
    kind: TokenKind,
    text: []const u8,
    line: usize,
    column: usize,
};

const Lexer = struct {
    input: []const u8,
    index: usize,
    line: usize,
    column: usize,

    fn init(input: []const u8) Lexer {
        return .{
            .input = input,
            .index = 0,
            .line = 1,
            .column = 1,
        };
    }

    fn peek(self: *Lexer) ?u8 {
        if (self.index >= self.input.len) {
            return null;
        }

        return self.input[self.index];
    }

    fn advance(self: *Lexer) ?u8 {
        const ch = self.peek() orelse return null;

        self.index += 1;

        if (ch == '\n') {
            self.line += 1;
            self.column = 1;
        } else {
            self.column += 1;
        }

        return ch;
    }

    fn skipWhitespace(self: *Lexer) void {
        while (self.peek()) |ch| {
            switch (ch) {
                ' ', '\t', '\r', '\n' => _ = self.advance(),
                else => return,
            }
        }
    }

    fn skipComment(self: *Lexer) void {
        while (self.peek()) |ch| {
            if (ch == '\n') {
                return;
            }

            _ = self.advance();
        }
    }

    fn lexIdentifier(self: *Lexer) Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        _ = self.advance();

        while (self.peek()) |ch| {
            if (!isIdentifierContinue(ch)) {
                break;
            }

            _ = self.advance();
        }

        const text = self.input[start..self.index];

        return Token{
            .kind = keywordKind(text) orelse .identifier,
            .text = text,
            .line = line,
            .column = column,
        };
    }

    fn lexNumber(self: *Lexer) Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        while (self.peek()) |ch| {
            if (!std.ascii.isDigit(ch)) {
                break;
            }

            _ = self.advance();
        }

        return Token{
            .kind = .number,
            .text = self.input[start..self.index],
            .line = line,
            .column = column,
        };
    }

    fn lexString(self: *Lexer) !Token {
        const start = self.index;
        const line = self.line;
        const column = self.column;

        _ = self.advance();

        while (self.peek()) |ch| {
            if (ch == '"') {
                _ = self.advance();

                return Token{
                    .kind = .string,
                    .text = self.input[start..self.index],
                    .line = line,
                    .column = column,
                };
            }

            if (ch == '\n') {
                return error.UnterminatedString;
            }

            _ = self.advance();
        }

        return error.UnterminatedString;
    }

    fn nextToken(self: *Lexer) !Token {
        while (true) {
            self.skipWhitespace();

            const start = self.index;
            const line = self.line;
            const column = self.column;

            const ch = self.peek() orelse {
                return Token{
                    .kind = .eof,
                    .text = "",
                    .line = line,
                    .column = column,
                };
            };

            if (isIdentifierStart(ch)) {
                return self.lexIdentifier();
            }

            if (std.ascii.isDigit(ch)) {
                return self.lexNumber();
            }

            switch (ch) {
                '"' => return try self.lexString(),

                '/' => {
                    _ = self.advance();

                    if (self.peek() == '/') {
                        _ = self.advance();
                        self.skipComment();
                        continue;
                    }

                    return Token{
                        .kind = .slash,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '+' => {
                    _ = self.advance();
                    return Token{
                        .kind = .plus,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '-' => {
                    _ = self.advance();
                    return Token{
                        .kind = .minus,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '*' => {
                    _ = self.advance();
                    return Token{
                        .kind = .star,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '=' => {
                    _ = self.advance();
                    return Token{
                        .kind = .equal,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '(' => {
                    _ = self.advance();
                    return Token{
                        .kind = .left_paren,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ')' => {
                    _ = self.advance();
                    return Token{
                        .kind = .right_paren,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '{' => {
                    _ = self.advance();
                    return Token{
                        .kind = .left_brace,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                '}' => {
                    _ = self.advance();
                    return Token{
                        .kind = .right_brace,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ',' => {
                    _ = self.advance();
                    return Token{
                        .kind = .comma,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                ';' => {
                    _ = self.advance();
                    return Token{
                        .kind = .semicolon,
                        .text = self.input[start..self.index],
                        .line = line,
                        .column = column,
                    };
                },

                else => return error.InvalidCharacter,
            }
        }
    }
};

fn isIdentifierStart(ch: u8) bool {
    return std.ascii.isAlphabetic(ch) or ch == '_';
}

fn isIdentifierContinue(ch: u8) bool {
    return std.ascii.isAlphanumeric(ch) or ch == '_';
}

fn keywordKind(text: []const u8) ?TokenKind {
    if (std.mem.eql(u8, text, "let")) return .keyword_let;
    if (std.mem.eql(u8, text, "if")) return .keyword_if;
    if (std.mem.eql(u8, text, "else")) return .keyword_else;
    if (std.mem.eql(u8, text, "return")) return .keyword_return;

    return null;
}

pub fn main() !void {
    const source =
        \\let answer = 42;
        \\print(answer);
        \\
        \\// this is a comment
        \\if answer {
        \\    return "done";
        \\}
    ;

    var lexer = Lexer.init(source);

    while (true) {
        const token = try lexer.nextToken();

        std.debug.print(
            "{s:<20} text={s:<12} line={d} column={d}\n",
            .{
                @tagName(token.kind),
                token.text,
                token.line,
                token.column,
            },
        );

        if (token.kind == .eof) {
            break;
        }
    }
}
```

Run:

```bash
zig build run
```

Example output:

```text
keyword_let         text=let          line=1 column=1
identifier          text=answer       line=1 column=5
equal               text==            line=1 column=12
number              text=42           line=1 column=14
semicolon           text=;            line=1 column=16
identifier          text=print        line=2 column=1
left_paren          text=(            line=2 column=6
identifier          text=answer       line=2 column=7
right_paren         text=)            line=2 column=13
semicolon           text=;            line=2 column=14
```

#### Why Lexers Usually Use Slices

Notice that tokens store:

```zig
text: []const u8
```

The lexer does not allocate new strings for every token.

Instead, each token references the original source text.

That is efficient because lexers create many tokens.

A parser for a large file might produce hundreds of thousands of tokens.

Avoiding allocations matters.

#### Why Comments Are Removed Here

The lexer skips comments:

```zig
self.skipComment();
continue;
```

So comments never become tokens.

That is common in compilers because comments usually do not affect execution.

Some tools preserve comments:

```text
formatters
documentation generators
IDEs
```

Those tools often emit comment tokens too.

#### Why Line and Column Tracking Matters

Without positions, compiler errors become frustrating.

This is bad:

```text
syntax error
```

This is much better:

```text
syntax error at line 12 column 7
```

Position tracking begins in the lexer because the lexer sees every character.

#### What You Learned

You built a lexer for a small programming language.

You scanned identifiers, keywords, numbers, strings, operators, and punctuation.

You skipped whitespace and comments.

You tracked line and column positions.

You returned tokens as slices into the original source.

This is the first stage of many language tools. Parsers, compilers, interpreters, linters, and syntax highlighters usually start with lexing.