Understanding Stage2

When people talk about Zig compiler internals, they often mention stage2.

The name can be confusing at first because it sounds like a single feature. It is better to understand it as part of Zig’s compiler history.

Zig started with an older compiler implementation. Over time, the project moved toward a newer compiler written mostly in Zig itself. That newer self-hosted compiler work was commonly called stage2.

So, in simple terms:

stage1 = older compiler path
stage2 = newer self-hosted compiler path

The goal of stage2 was not just to rewrite the old compiler line by line. The goal was to build the compiler architecture Zig needed for the long term.

Why Stage2 Exists

A programming language compiler is one of the hardest programs to write.

It must parse code, check types, report errors, run compile-time code, generate machine code, link programs, support many platforms, and stay fast enough for daily use.

The early Zig compiler was good enough to grow the language, but Zig needed a stronger foundation.

Stage2 exists because Zig needed:

a compiler written in Zig
better compile-time execution
better error messages
better incremental compilation support
better cross-compilation support
better control over code generation
less dependence on older compiler architecture

The most important idea is self-hosting.

A self-hosted compiler is a compiler for a language that is written in that same language.

For Zig, this means:

Zig compiler written in Zig

This matters because the compiler itself becomes a large real-world test of the language.

If Zig can implement Zig, then Zig is capable of building large systems software.

What “Stage” Means

The word stage comes from compiler bootstrapping.

Bootstrapping means building a compiler using an existing compiler.

Imagine you are creating a new language called X.

At first, there is no X compiler written in X. So you might write the first compiler in C.

Then, after the language is strong enough, you write a new compiler in X itself.

The process looks like this:

old compiler builds new compiler
new compiler builds user programs
new compiler eventually builds itself

That is why people use words like stage1, stage2, and sometimes stage3.

A simplified view:

stage1 compiler
    ↓ builds
stage2 compiler
    ↓ builds
Zig programs

Later, when the new compiler can compile itself reliably, the project can depend less on the older stage.

Stage2 and Self-Hosting

Self-hosting is not only symbolic. It has practical value.

When the compiler is written in Zig, compiler developers use Zig every day to build Zig itself.

That creates pressure to improve the language in real ways:

better compile times
better standard library APIs
better memory management patterns
better debugging tools
better build system behavior
better error reporting

A language improves differently when its own compiler depends on it.

Tiny annoyances become obvious. Missing features become painful. Slow paths become expensive.

Self-hosting forces the language to face its own design.

The Main Compiler Pipeline

Stage2 follows the same broad compiler pipeline you saw earlier:

source code
    ↓
tokenizer
    ↓
parser
    ↓
AST
    ↓
ZIR
    ↓
semantic analysis
    ↓
AIR
    ↓
code generation
    ↓
linking

Each stage changes the program into a form that is easier for the compiler to work with.

The source code is for humans.

The AST describes the syntax.

ZIR is a lowered representation of Zig code.

Semantic analysis checks meaning.

AIR represents analyzed code.

Code generation turns the analyzed program into target-specific output.

Linking produces the final artifact.

ZIR in Stage2

ZIR means Zig Intermediate Representation.

You can think of ZIR as a simplified internal version of Zig code.

The parser produces an AST. The AST is still shaped like the source file. It remembers many source-level details.

ZIR is lower-level. It is easier for the compiler to analyze.

For example, source code may contain convenient syntax:

const x = if (flag) 10 else 20;

The AST records that this came from an if expression.

ZIR represents it in a form that the compiler can process more systematically.

You do not need to read ZIR manually as a beginner. But you should know why it exists.

The compiler does not want to repeatedly reason about every surface syntax detail. It lowers code into a simpler representation, then analyzes that.

Semantic Analysis in Stage2

Semantic analysis is one of the largest and most complex parts of the compiler.

It answers questions such as:

What type is this expression?
Does this function call match the function type?
Can this integer fit into the destination type?
Is this value known at compile time?
Does this branch return correctly?
Is this pointer alignment valid?
Can this code be evaluated at comptime?

Example:

const x: u8 = 300;

The parser can parse this. The AST is valid.

But semantic analysis rejects it because 300 does not fit in u8.

Another example:

fn add(a: i32, b: i32) i32 {
    return a + b;
}

const x = add("hello", "world");

Again, the parser can parse this. The syntax has the right shape.

Semantic analysis rejects it because strings are not i32 values.

This is why semantic analysis is where many useful compiler errors come from.

AIR in Stage2

AIR means Analyzed Intermediate Representation.

By the time code reaches AIR, the compiler knows much more about it.

Types have been resolved. Many compile-time decisions have already happened. Invalid code has been rejected.

A simple mental model:

ZIR = code before full meaning is known
AIR = code after semantic meaning is known

AIR is useful because code generation should not need to solve all language-level questions again.

The backend wants a cleaner form:

operations
types
control flow
memory behavior
target requirements

AIR helps provide that.

Compile-Time Execution

Stage2 must support Zig’s compile-time execution model.

This is a major reason the compiler architecture is complex.

In Zig, the compiler may need to execute real Zig code while compiling.

Example:

fn makeValue(comptime n: usize) usize {
    return n * 2;
}

const x = makeValue(21);

The compiler can compute x while compiling the program.

This gets more powerful with types:

fn Pair(comptime T: type) type {
    return struct {
        first: T,
        second: T,
    };
}

const IntPair = Pair(i32);

Here, a function returns a type.

That means the compiler must execute Pair(i32) during compilation and create the resulting struct type.

So Stage2 needs more than parsing and type checking. It needs an interpreter for compile-time Zig.

Code Generation Backends

After semantic analysis, the compiler must produce target code.

Zig has used LLVM as an important backend. LLVM can optimize code and emit machine code for many architectures.

But Zig has also worked on its own backends.

Why?

Because depending entirely on LLVM has tradeoffs:

LLVM is powerful but large
LLVM increases compiler build complexity
LLVM can slow down simple debug builds
LLVM behavior is not always easy to control
LLVM does not cover every desired compiler use case

Zig’s own backend work can help with faster compilation, better control, and simpler bootstrap paths.

A practical view:

LLVM backend = strong optimization and broad target support
Zig native backends = more control and potential speed for some workflows

You do not need to choose between them as a beginner. Just know that Stage2 was designed to support a cleaner path from Zig source code to different backend strategies.

Incremental Compilation

One long-term goal of the newer compiler architecture is better incremental compilation.

Incremental compilation means the compiler should avoid rebuilding everything when only a small part of the program changes.

For example, if you edit one function, the compiler should ideally reuse previous work for the rest of the program.

That requires careful tracking:

which files changed
which declarations changed
which types depend on which declarations
which compile-time values must be recomputed
which generated code is still valid

This is hard in any language.

It is especially hard in Zig because compile-time code can inspect and generate types.

Stage2’s internal design is intended to make this kind of tracking more manageable.

Error Messages

A compiler is also a user interface.

When the compiler rejects a program, it must explain why.

Bad error messages make a language painful.

Stage2 work also matters because better internal representations can support better diagnostics.

For example, when the compiler understands the path from source code to semantic failure, it can show:

where the error happened
what type was expected
what type was found
which call caused the problem
which compile-time branch led here

Good diagnostics are not added at the end. They depend on architecture.

The compiler must keep enough source location and context information through each internal step.

Why Beginners Should Care

You do not need to understand Stage2 deeply to write Zig programs.

But you should know what it means because you will see it in discussions, issues, release notes, and compiler internals.

When someone says:

this changed in stage2

they usually mean:

this behavior belongs to the newer self-hosted compiler architecture

When someone says:

stage2 caught this differently

they may be talking about improved semantic analysis, different diagnostics, or changed compiler behavior.

When someone says:

stage2 backend

they may be talking about Zig’s newer code generation path rather than the older LLVM-centered path.

A Safe Mental Model

Use this model:

Stage2 is the newer self-hosted Zig compiler architecture.

It parses Zig, lowers it into internal representations, analyzes it, runs compile-time code, generates target code, and links the final output.

Its purpose is to make Zig’s compiler faster, cleaner, more self-reliant, and easier to evolve.

That is enough for now.

Later, when you read the compiler source, you can connect the names to files and systems:

AST
ZIR
Sema
AIR
codegen
linker

Stage2 is where these pieces come together.