20.14 Building a Static Analyzer

Design a static analyzer that scans source code and reports likely defects without running the program.

20.14 Building a Static Analyzer

Problem

Design a static analyzer that scans source code and reports likely defects without running the program.

Static analyzers appear in compilers, linters, security tools, code review systems, IDEs, CI pipelines, and migration tools. They answer questions such as:

Is this variable used before assignment?
Is this branch unreachable?
Can this value be null?
Does this function violate a style rule?
Does this code contain a suspicious pattern?

A static analyzer works over program text, but good analyzers do not operate directly on raw strings. They parse source code into structure, inspect that structure, and report diagnostics with useful locations.

The core algorithmic problem is graph and tree analysis. The engineering problem is building an analyzer that produces useful warnings without overwhelming users with false positives.

Solution

Build the analyzer as a pipeline.

source text
  -> lexer
  -> parser
  -> abstract syntax tree
  -> semantic model
  -> analysis passes
  -> diagnostics

A minimal analyzer can skip full language coverage and analyze a small expression language. The same architecture scales to real programming languages.

A Tiny Language

Use a small statement language:

x = 1
y = x
print y
print z

The final line should produce a diagnostic:

undefined variable: z

The language has three statement forms:

name = integer
name = name
print name

This is enough to demonstrate tokenization, parsing, AST construction, and analysis.

Tokenizing Source Text

A lexer converts source text into tokens.

import re
from dataclasses import dataclass

TOKEN_RE = re.compile(
    r"""
    (?P<NUMBER>[0-9]+)
  | (?P<IDENT>[A-Za-z_][A-Za-z0-9_]*)
  | (?P<EQUAL>=)
  | (?P<NEWLINE>\n)
  | (?P<SPACE>[ \t]+)
  | (?P<MISMATCH>.)
    """,
    re.VERBOSE,
)

@dataclass(frozen=True)
class Token:
    kind: str
    value: str
    line: int
    column: int

Lexer:

def tokenize(source):
    tokens = []
    line = 1
    line_start = 0

    for match in TOKEN_RE.finditer(source):
        kind = match.lastgroup
        value = match.group()
        column = match.start() - line_start + 1

        if kind == "SPACE":
            continue

        if kind == "NEWLINE":
            tokens.append(Token(kind, value, line, column))
            line += 1
            line_start = match.end()
            continue

        if kind == "MISMATCH":
            raise SyntaxError(
                f"unexpected character {value!r} at {line}:{column}"
            )

        tokens.append(Token(kind, value, line, column))

    tokens.append(Token("EOF", "", line, 1))

    return tokens

The lexer records locations. Diagnostics without locations are much less useful.

Defining the AST

The abstract syntax tree stores program structure.

@dataclass(frozen=True)
class Program:
    statements: list

@dataclass(frozen=True)
class Assign:
    name: str
    expr: object
    line: int
    column: int

@dataclass(frozen=True)
class Print:
    name: str
    line: int
    column: int

@dataclass(frozen=True)
class Number:
    value: int
    line: int
    column: int

@dataclass(frozen=True)
class Variable:
    name: str
    line: int
    column: int

The AST avoids raw text concerns. Assignment, printing, literals, and variable references become explicit objects.

Parsing Statements

A parser consumes tokens and builds the AST.

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.position = 0

    def current(self):
        return self.tokens[self.position]

    def advance(self):
        token = self.current()
        self.position += 1
        return token

    def expect(self, kind):
        token = self.current()

        if token.kind != kind:
            raise SyntaxError(
                f"expected {kind}, got {token.kind} "
                f"at {token.line}:{token.column}"
            )

        return self.advance()

    def parse_program(self):
        statements = []

        while self.current().kind != "EOF":
            if self.current().kind == "NEWLINE":
                self.advance()
                continue

            statements.append(self.parse_statement())

            if self.current().kind == "NEWLINE":
                self.advance()

        return Program(statements)

Parse each statement:

class Parser(Parser):
    def parse_statement(self):
        token = self.expect("IDENT")

        if token.value == "print":
            name = self.expect("IDENT")

            return Print(
                name=name.value,
                line=token.line,
                column=token.column,
            )

        self.expect("EQUAL")
        expr = self.parse_expr()

        return Assign(
            name=token.value,
            expr=expr,
            line=token.line,
            column=token.column,
        )

    def parse_expr(self):
        token = self.current()

        if token.kind == "NUMBER":
            token = self.advance()

            return Number(
                value=int(token.value),
                line=token.line,
                column=token.column,
            )

        if token.kind == "IDENT":
            token = self.advance()

            return Variable(
                name=token.value,
                line=token.line,
                column=token.column,
            )

        raise SyntaxError(
            f"expected expression at {token.line}:{token.column}"
        )

Use it:

source = """x = 1
y = x
print y
print z
"""

tokens = tokenize(source)
program = Parser(tokens).parse_program()

print(program)

The result is a structured representation suitable for analysis.

Diagnostics

A diagnostic should contain a message, a severity, and a source location.

@dataclass(frozen=True)
class Diagnostic:
    message: str
    line: int
    column: int
    severity: str = "warning"

Format diagnostics:

def format_diagnostic(diagnostic):
    return (
        f"{diagnostic.severity}: "
        f"{diagnostic.line}:{diagnostic.column}: "
        f"{diagnostic.message}"
    )

Example:

warning: 4:1: undefined variable: z

Location quality directly affects whether users trust the analyzer.

Undefined Variable Analysis

Track the set of variables that have been assigned.

def analyze_undefined_variables(program):
    defined = set()
    diagnostics = []

    for statement in program.statements:
        if isinstance(statement, Assign):
            if isinstance(statement.expr, Variable):
                if statement.expr.name not in defined:
                    diagnostics.append(
                        Diagnostic(
                            message=f"undefined variable: {statement.expr.name}",
                            line=statement.expr.line,
                            column=statement.expr.column,
                        )
                    )

            defined.add(statement.name)

        elif isinstance(statement, Print):
            if statement.name not in defined:
                diagnostics.append(
                    Diagnostic(
                        message=f"undefined variable: {statement.name}",
                        line=statement.line,
                        column=statement.column,
                    )
                )

    return diagnostics

Run it:

diagnostics = analyze_undefined_variables(program)

for diagnostic in diagnostics:
    print(format_diagnostic(diagnostic))

Output:

warning: 4:1: undefined variable: z

This is a dataflow analysis in its simplest form. The analyzer maintains facts while walking the program.

Unused Variable Analysis

Track assignments and uses.

def analyze_unused_variables(program):
    assigned = {}
    used = set()

    for statement in program.statements:
        if isinstance(statement, Assign):
            assigned[statement.name] = statement

            if isinstance(statement.expr, Variable):
                used.add(statement.expr.name)

        elif isinstance(statement, Print):
            used.add(statement.name)

    diagnostics = []

    for name, statement in assigned.items():
        if name not in used:
            diagnostics.append(
                Diagnostic(
                    message=f"unused variable: {name}",
                    line=statement.line,
                    column=statement.column,
                )
            )

    return diagnostics

Example:

x = 1
y = 2
print x

Diagnostic:

warning: 2:1: unused variable: y

This analysis is simple because the language has no branches. In a real language, usage may depend on control flow.

Control-Flow Graphs

For statements with branches and loops, a linear scan is not enough. Build a control-flow graph, or CFG.

A CFG has:

  • Nodes representing program points or basic blocks.
  • Edges representing possible execution flow.

Example:

if condition:
    x = 1
else:
    y = 2
print x

The analyzer must know that x may be undefined when execution comes from the else branch.

A simplified CFG:

entry
  |
condition
 / \\nx=1 y=2
 \ /
print x
  |
exit

Static analysis over a CFG usually computes facts until they stop changing.

Dataflow Analysis Pattern

Many static analyses share the same structure.

initialize facts
repeat:
    propagate facts across edges
    merge facts at joins
until no facts change

Examples:

Analysis Fact
Definite assignment Variables definitely assigned
Reaching definitions Assignments that may reach a point
Live variables Variables that may be used later
Constant propagation Variables with known constant values
Nullability Values that may be null
Taint analysis Values influenced by untrusted input

The difference between analyses lies in the facts, transfer functions, and merge rules.

Definite Assignment

Definite assignment asks whether a variable is assigned on all paths before use.

At a control-flow join, use intersection.

facts from branch 1: {x}
facts from branch 2: {y}
intersection: {}

So after the if, neither x nor y is definitely assigned.

By contrast, possible assignment uses union.

union: {x, y}

Choosing union or intersection changes the meaning of the analysis.

Constant Propagation

Constant propagation tracks known values.

Example:

x = 10
y = x
print y

The analyzer can infer:

x = 10
y = 10

A simple linear version:

def analyze_constants(program):
    constants = {}
    diagnostics = []

    for statement in program.statements:
        if isinstance(statement, Assign):
            if isinstance(statement.expr, Number):
                constants[statement.name] = statement.expr.value

            elif isinstance(statement.expr, Variable):
                if statement.expr.name in constants:
                    constants[statement.name] = constants[statement.expr.name]
                else:
                    constants.pop(statement.name, None)

        elif isinstance(statement, Print):
            pass

    return constants, diagnostics

At branches, a variable remains constant only if both paths assign the same constant.

Rule-Based Checks

Not every analyzer needs full dataflow. Many useful checks are tree patterns.

Example rule:

Do not print directly from library code.

A rule can traverse the AST:

def analyze_no_print(program):
    diagnostics = []

    for statement in program.statements:
        if isinstance(statement, Print):
            diagnostics.append(
                Diagnostic(
                    message="avoid direct print statements",
                    line=statement.line,
                    column=statement.column,
                    severity="info",
                )
            )

    return diagnostics

Rule-based checks are easy to add and test. They are the basis of many linters.

Organizing Analysis Passes

Use a registry of analysis passes.

ANALYSIS_PASSES = [
    analyze_undefined_variables,
    analyze_unused_variables,
    analyze_no_print,
]

def analyze(program):
    diagnostics = []

    for analysis_pass in ANALYSIS_PASSES:
        diagnostics.extend(analysis_pass(program))

    return diagnostics

This keeps each analysis small. A failed or noisy rule can be disabled without changing the rest of the system.

Severity and Codes

Diagnostics should have stable codes.

@dataclass(frozen=True)
class Diagnostic:
    code: str
    message: str
    line: int
    column: int
    severity: str = "warning"

Example:

ALG001 undefined variable
ALG002 unused variable
ALG003 direct print statement

Stable codes let users suppress specific warnings.

# analyzer-disable ALG002

They also make CI configuration and documentation manageable.

Suppression

Users need a way to suppress intentional warnings.

A simple line-based suppression model:

# analyzer-disable-next-line ALG002
x = 1

The analyzer can collect suppression comments during lexing or parse them as trivia.

Suppression should be precise. A global “ignore all warnings” option reduces analyzer value. Prefer suppressing one rule at one location with a reason.

False Positives and False Negatives

A false positive reports a problem that is not actually a problem.

A false negative misses a real problem.

Static analysis always balances these.

For style rules, false positives may be tolerable. For security tools, false negatives may be more dangerous. For CI enforcement, false positives are costly because they block legitimate changes.

A practical analyzer should classify rules:

Rule type Tolerance
Formatting Low cost, easy autofix
Style Moderate tolerance
Bug risk Needs high precision
Security Needs careful triage
Migration Often project-specific

Autofix

Some diagnostics can be fixed automatically.

Example:

unused import

The analyzer can remove the import safely if no references exist.

Represent fixes:

@dataclass(frozen=True)
class TextEdit:
    start_line: int
    start_column: int
    end_line: int
    end_column: int
    replacement: str

Attach fixes to diagnostics:

@dataclass(frozen=True)
class Diagnostic:
    code: str
    message: str
    line: int
    column: int
    severity: str = "warning"
    fix: object = None

Autofix requires exact source ranges. The lexer and parser must preserve enough position information.

Performance

Static analyzers often run in editors and CI systems, so performance matters.

Common strategies:

Strategy Effect
Incremental parsing Reanalyze only changed files
File-level caching Reuse previous results
Rule filtering Run only enabled rules
Syntax-first checks Avoid expensive semantic passes
Parallel files Analyze independent files concurrently
Early bailout Stop after severe syntax errors

A small analyzer can scan files from scratch. A production analyzer usually needs incremental architecture.

Testing

Test parsing and analysis separately.

Parser test:

def test_parse_assignment():
    program = Parser(tokenize("x = 1")).parse_program()

    assert len(program.statements) == 1
    assert isinstance(program.statements[0], Assign)
    assert program.statements[0].name == "x"

Undefined variable test:

def test_undefined_variable():
    source = """x = 1
print y
"""

    program = Parser(tokenize(source)).parse_program()
    diagnostics = analyze_undefined_variables(program)

    assert len(diagnostics) == 1
    assert diagnostics[0].message == "undefined variable: y"

Unused variable test:

def test_unused_variable():
    source = """x = 1
y = 2
print x
"""

    program = Parser(tokenize(source)).parse_program()
    diagnostics = analyze_unused_variables(program)

    assert len(diagnostics) == 1
    assert diagnostics[0].message == "unused variable: y"

Analyzer tests should use small programs where the expected diagnostic set is clear.

Common Bugs

The most common static analyzer bug is reporting diagnostics without precise source locations. Users cannot act on vague warnings.

Another common bug is analyzing raw text instead of parsed structure. This creates false positives in comments, strings, and unusual formatting.

Dataflow analyses often use the wrong merge operation. Union and intersection answer different questions.

Unused variable analysis can be wrong when variables are used indirectly, exported, reflected, or referenced from generated code.

Suppression systems can accidentally suppress too much. Suppression should be narrow and visible.

Autofix can corrupt source code if ranges are imprecise or edits overlap.

Performance problems appear when every rule repeatedly walks the full tree. Share semantic information between passes when possible.

Recipe

Build the analyzer in layers.

Start with a lexer that records source locations. Parse source into an AST. Define diagnostics with stable codes and precise locations. Implement simple tree-based checks first. Add linear dataflow checks such as undefined and unused variables. Introduce control-flow graphs when branches and loops matter. Organize checks as separate passes. Add suppression and autofix only after diagnostics are stable. Test each rule with small source examples. Track false-positive rate before enforcing rules in CI.

The main lesson is that static analysis is structured program inspection. The tree gives syntax, the control-flow graph gives execution paths, and dataflow facts give semantic evidence. A useful analyzer combines these structures while keeping diagnostics precise, explainable, and configurable.