A lexer is the first stage of many programming language tools.
Its job is simple:
input text
↓
sequence of tokensFor example, this source code:
let x = 42 + y;becomes:
KeywordLet
Identifier("x")
Equal
Number("42")
Plus
Identifier("y")
SemicolonA lexer does not understand meaning yet. It only splits text into pieces.
Parsers, compilers, interpreters, formatters, syntax highlighters, and linters usually begin with lexing.
The Goal
We will build a lexer for a tiny language.
It will support:
identifiers
numbers
keywords
operators
punctuation
strings
whitespace skipping
commentsInput:
let answer = 42;
print(answer);Output:
KeywordLet
Identifier(answer)
Equal
Number(42)
Semicolon
Identifier(print)
LeftParen
Identifier(answer)
RightParen
Semicolon
EOFWhat Is a Token
A token has:
kind
text
positionExample:
Identifier("answer")The token kind is:
IdentifierThe text is:
answerThe lexer keeps the original text slice because later stages may need it.
Token Types
Start with an enum:
const TokenKind = enum {
eof,
identifier,
number,
string,
keyword_let,
keyword_if,
keyword_else,
keyword_return,
plus,
minus,
star,
slash,
equal,
left_paren,
right_paren,
left_brace,
right_brace,
comma,
semicolon,
};This defines every token the lexer can produce.
Token Struct
Now define the token itself:
const Token = struct {
kind: TokenKind,
text: []const u8,
line: usize,
column: usize,
};The lexer stores line and column numbers for error reporting.
Example:
unexpected token at line 3 column 14Without positions, compiler errors become hard to understand.
The Lexer State
A lexer walks through text one character at a time.
const Lexer = struct {
input: []const u8,
index: usize,
line: usize,
column: usize,
};The lexer tracks:
current byte index
current line
current columnInitialize the Lexer
Add:
fn init(input: []const u8) Lexer {
return .{
.input = input,
.index = 0,
.line = 1,
.column = 1,
};
}We start at line 1, column 1.
Peek and Advance
The lexer needs helper functions.
fn peek(self: *Lexer) ?u8 {
if (self.index >= self.input.len) {
return null;
}
return self.input[self.index];
}peek looks at the current character without moving.
Now add advance:
fn advance(self: *Lexer) ?u8 {
const ch = self.peek() orelse return null;
self.index += 1;
if (ch == '\n') {
self.line += 1;
self.column = 1;
} else {
self.column += 1;
}
return ch;
}This moves the lexer forward.
Notice:
if (ch == '\n')Newlines update both line and column counters.
Skip Whitespace
Programming languages usually ignore spaces and tabs between tokens.
fn skipWhitespace(self: *Lexer) void {
while (self.peek()) |ch| {
switch (ch) {
' ', '\t', '\r', '\n' => _ = self.advance(),
else => return,
}
}
}This consumes whitespace until a non-whitespace character appears.
Create Tokens
Add a helper:
fn makeToken(
self: *Lexer,
kind: TokenKind,
start: usize,
end: usize,
line: usize,
column: usize,
) Token {
_ = self;
return Token{
.kind = kind,
.text = self.input[start..end],
.line = line,
.column = column,
};
}A token references a slice of the original input.
The lexer does not allocate new strings.
Lexing Identifiers
Identifiers look like:
x
answer
my_variableKeywords also start as identifiers.
Add helpers:
fn isIdentifierStart(ch: u8) bool {
return std.ascii.isAlphabetic(ch) or ch == '_';
}
fn isIdentifierContinue(ch: u8) bool {
return std.ascii.isAlphanumeric(ch) or ch == '_';
}Now add identifier lexing:
fn lexIdentifier(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (!isIdentifierContinue(ch)) {
break;
}
_ = self.advance();
}
const text = self.input[start..self.index];
const kind = keywordKind(text) orelse .identifier;
return Token{
.kind = kind,
.text = text,
.line = line,
.column = column,
};
}Recognizing Keywords
Keywords are reserved words:
let
if
else
returnAdd:
fn keywordKind(text: []const u8) ?TokenKind {
if (std.mem.eql(u8, text, "let")) return .keyword_let;
if (std.mem.eql(u8, text, "if")) return .keyword_if;
if (std.mem.eql(u8, text, "else")) return .keyword_else;
if (std.mem.eql(u8, text, "return")) return .keyword_return;
return null;
}If the text matches a keyword, the token becomes a keyword token.
Otherwise it stays an identifier.
Lexing Numbers
Numbers are sequences of digits.
fn lexNumber(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
while (self.peek()) |ch| {
if (!std.ascii.isDigit(ch)) {
break;
}
_ = self.advance();
}
return Token{
.kind = .number,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}This accepts:
0
123
9999This first lexer only supports integers.
Lexing Strings
Strings begin and end with quotes.
fn lexString(self: *Lexer) !Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (ch == '"') {
_ = self.advance();
return Token{
.kind = .string,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
if (ch == '\n') {
return error.UnterminatedString;
}
_ = self.advance();
}
return error.UnterminatedString;
}This lexer keeps the quotes inside the token text:
"hello"Later stages can remove them if needed.
Single Character Tokens
Operators and punctuation are simpler.
'+' -> plus
'-' -> minus
'(' -> left_paren
';' -> semicolonWe can lex them directly.
Lexing Comments
Add support for line comments:
// this is a commentAdd:
fn skipComment(self: *Lexer) void {
while (self.peek()) |ch| {
if (ch == '\n') {
return;
}
_ = self.advance();
}
}The Main Lex Function
Now combine everything.
fn nextToken(self: *Lexer) !Token {
while (true) {
self.skipWhitespace();
const start = self.index;
const line = self.line;
const column = self.column;
const ch = self.peek() orelse {
return Token{
.kind = .eof,
.text = "",
.line = line,
.column = column,
};
};
if (isIdentifierStart(ch)) {
return self.lexIdentifier();
}
if (std.ascii.isDigit(ch)) {
return self.lexNumber();
}
switch (ch) {
'"' => return try self.lexString(),
'/' => {
_ = self.advance();
if (self.peek() == '/') {
_ = self.advance();
self.skipComment();
continue;
}
return Token{
.kind = .slash,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'+' => {
_ = self.advance();
return Token{
.kind = .plus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'-' => {
_ = self.advance();
return Token{
.kind = .minus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'*' => {
_ = self.advance();
return Token{
.kind = .star,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'=' => {
_ = self.advance();
return Token{
.kind = .equal,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'(' => {
_ = self.advance();
return Token{
.kind = .left_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
')' => {
_ = self.advance();
return Token{
.kind = .right_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'{' => {
_ = self.advance();
return Token{
.kind = .left_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'}' => {
_ = self.advance();
return Token{
.kind = .right_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
',' => {
_ = self.advance();
return Token{
.kind = .comma,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
';' => {
_ = self.advance();
return Token{
.kind = .semicolon,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
else => return error.InvalidCharacter,
}
}
}Complete Program
Put this in src/main.zig:
const std = @import("std");
const TokenKind = enum {
eof,
identifier,
number,
string,
keyword_let,
keyword_if,
keyword_else,
keyword_return,
plus,
minus,
star,
slash,
equal,
left_paren,
right_paren,
left_brace,
right_brace,
comma,
semicolon,
};
const Token = struct {
kind: TokenKind,
text: []const u8,
line: usize,
column: usize,
};
const Lexer = struct {
input: []const u8,
index: usize,
line: usize,
column: usize,
fn init(input: []const u8) Lexer {
return .{
.input = input,
.index = 0,
.line = 1,
.column = 1,
};
}
fn peek(self: *Lexer) ?u8 {
if (self.index >= self.input.len) {
return null;
}
return self.input[self.index];
}
fn advance(self: *Lexer) ?u8 {
const ch = self.peek() orelse return null;
self.index += 1;
if (ch == '\n') {
self.line += 1;
self.column = 1;
} else {
self.column += 1;
}
return ch;
}
fn skipWhitespace(self: *Lexer) void {
while (self.peek()) |ch| {
switch (ch) {
' ', '\t', '\r', '\n' => _ = self.advance(),
else => return,
}
}
}
fn skipComment(self: *Lexer) void {
while (self.peek()) |ch| {
if (ch == '\n') {
return;
}
_ = self.advance();
}
}
fn lexIdentifier(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (!isIdentifierContinue(ch)) {
break;
}
_ = self.advance();
}
const text = self.input[start..self.index];
return Token{
.kind = keywordKind(text) orelse .identifier,
.text = text,
.line = line,
.column = column,
};
}
fn lexNumber(self: *Lexer) Token {
const start = self.index;
const line = self.line;
const column = self.column;
while (self.peek()) |ch| {
if (!std.ascii.isDigit(ch)) {
break;
}
_ = self.advance();
}
return Token{
.kind = .number,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
fn lexString(self: *Lexer) !Token {
const start = self.index;
const line = self.line;
const column = self.column;
_ = self.advance();
while (self.peek()) |ch| {
if (ch == '"') {
_ = self.advance();
return Token{
.kind = .string,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
}
if (ch == '\n') {
return error.UnterminatedString;
}
_ = self.advance();
}
return error.UnterminatedString;
}
fn nextToken(self: *Lexer) !Token {
while (true) {
self.skipWhitespace();
const start = self.index;
const line = self.line;
const column = self.column;
const ch = self.peek() orelse {
return Token{
.kind = .eof,
.text = "",
.line = line,
.column = column,
};
};
if (isIdentifierStart(ch)) {
return self.lexIdentifier();
}
if (std.ascii.isDigit(ch)) {
return self.lexNumber();
}
switch (ch) {
'"' => return try self.lexString(),
'/' => {
_ = self.advance();
if (self.peek() == '/') {
_ = self.advance();
self.skipComment();
continue;
}
return Token{
.kind = .slash,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'+' => {
_ = self.advance();
return Token{
.kind = .plus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'-' => {
_ = self.advance();
return Token{
.kind = .minus,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'*' => {
_ = self.advance();
return Token{
.kind = .star,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'=' => {
_ = self.advance();
return Token{
.kind = .equal,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'(' => {
_ = self.advance();
return Token{
.kind = .left_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
')' => {
_ = self.advance();
return Token{
.kind = .right_paren,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'{' => {
_ = self.advance();
return Token{
.kind = .left_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
'}' => {
_ = self.advance();
return Token{
.kind = .right_brace,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
',' => {
_ = self.advance();
return Token{
.kind = .comma,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
';' => {
_ = self.advance();
return Token{
.kind = .semicolon,
.text = self.input[start..self.index],
.line = line,
.column = column,
};
},
else => return error.InvalidCharacter,
}
}
}
};
fn isIdentifierStart(ch: u8) bool {
return std.ascii.isAlphabetic(ch) or ch == '_';
}
fn isIdentifierContinue(ch: u8) bool {
return std.ascii.isAlphanumeric(ch) or ch == '_';
}
fn keywordKind(text: []const u8) ?TokenKind {
if (std.mem.eql(u8, text, "let")) return .keyword_let;
if (std.mem.eql(u8, text, "if")) return .keyword_if;
if (std.mem.eql(u8, text, "else")) return .keyword_else;
if (std.mem.eql(u8, text, "return")) return .keyword_return;
return null;
}
pub fn main() !void {
const source =
\\let answer = 42;
\\print(answer);
\\
\\// this is a comment
\\if answer {
\\ return "done";
\\}
;
var lexer = Lexer.init(source);
while (true) {
const token = try lexer.nextToken();
std.debug.print(
"{s:<20} text={s:<12} line={d} column={d}\n",
.{
@tagName(token.kind),
token.text,
token.line,
token.column,
},
);
if (token.kind == .eof) {
break;
}
}
}Run:
zig build runExample output:
keyword_let text=let line=1 column=1
identifier text=answer line=1 column=5
equal text== line=1 column=12
number text=42 line=1 column=14
semicolon text=; line=1 column=16
identifier text=print line=2 column=1
left_paren text=( line=2 column=6
identifier text=answer line=2 column=7
right_paren text=) line=2 column=13
semicolon text=; line=2 column=14Why Lexers Usually Use Slices
Notice that tokens store:
text: []const u8The lexer does not allocate new strings for every token.
Instead, each token references the original source text.
That is efficient because lexers create many tokens.
A parser for a large file might produce hundreds of thousands of tokens.
Avoiding allocations matters.
Why Comments Are Removed Here
The lexer skips comments:
self.skipComment();
continue;So comments never become tokens.
That is common in compilers because comments usually do not affect execution.
Some tools preserve comments:
formatters
documentation generators
IDEsThose tools often emit comment tokens too.
Why Line and Column Tracking Matters
Without positions, compiler errors become frustrating.
This is bad:
syntax errorThis is much better:
syntax error at line 12 column 7Position tracking begins in the lexer because the lexer sees every character.
What You Learned
You built a lexer for a small programming language.
You scanned identifiers, keywords, numbers, strings, operators, and punctuation.
You skipped whitespace and comments.
You tracked line and column positions.
You returned tokens as slices into the original source.
This is the first stage of many language tools. Parsers, compilers, interpreters, linters, and syntax highlighters usually start with lexing.