20.14 Building a Static Analyzer
Design a static analyzer that scans source code and reports likely defects without running the program.
20.14 Building a Static Analyzer
Problem
Design a static analyzer that scans source code and reports likely defects without running the program.
Static analyzers appear in compilers, linters, security tools, code review systems, IDEs, CI pipelines, and migration tools. They answer questions such as:
Is this variable used before assignment?
Is this branch unreachable?
Can this value be null?
Does this function violate a style rule?
Does this code contain a suspicious pattern?
A static analyzer works over program text, but good analyzers do not operate directly on raw strings. They parse source code into structure, inspect that structure, and report diagnostics with useful locations.
The core algorithmic problem is graph and tree analysis. The engineering problem is building an analyzer that produces useful warnings without overwhelming users with false positives.
Solution
Build the analyzer as a pipeline.
source text
-> lexer
-> parser
-> abstract syntax tree
-> semantic model
-> analysis passes
-> diagnostics
A minimal analyzer can skip full language coverage and analyze a small expression language. The same architecture scales to real programming languages.
A Tiny Language
Use a small statement language:
x = 1
y = x
print y
print z
The final line should produce a diagnostic:
undefined variable: z
The language has three statement forms:
name = integer
name = name
print name
This is enough to demonstrate tokenization, parsing, AST construction, and analysis.
Tokenizing Source Text
A lexer converts source text into tokens.
import re
from dataclasses import dataclass
TOKEN_RE = re.compile(
r"""
(?P<NUMBER>[0-9]+)
| (?P<IDENT>[A-Za-z_][A-Za-z0-9_]*)
| (?P<EQUAL>=)
| (?P<NEWLINE>\n)
| (?P<SPACE>[ \t]+)
| (?P<MISMATCH>.)
""",
re.VERBOSE,
)
@dataclass(frozen=True)
class Token:
kind: str
value: str
line: int
column: int
Lexer:
def tokenize(source):
tokens = []
line = 1
line_start = 0
for match in TOKEN_RE.finditer(source):
kind = match.lastgroup
value = match.group()
column = match.start() - line_start + 1
if kind == "SPACE":
continue
if kind == "NEWLINE":
tokens.append(Token(kind, value, line, column))
line += 1
line_start = match.end()
continue
if kind == "MISMATCH":
raise SyntaxError(
f"unexpected character {value!r} at {line}:{column}"
)
tokens.append(Token(kind, value, line, column))
tokens.append(Token("EOF", "", line, 1))
return tokens
The lexer records locations. Diagnostics without locations are much less useful.
Defining the AST
The abstract syntax tree stores program structure.
@dataclass(frozen=True)
class Program:
statements: list
@dataclass(frozen=True)
class Assign:
name: str
expr: object
line: int
column: int
@dataclass(frozen=True)
class Print:
name: str
line: int
column: int
@dataclass(frozen=True)
class Number:
value: int
line: int
column: int
@dataclass(frozen=True)
class Variable:
name: str
line: int
column: int
The AST avoids raw text concerns. Assignment, printing, literals, and variable references become explicit objects.
Parsing Statements
A parser consumes tokens and builds the AST.
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.position = 0
def current(self):
return self.tokens[self.position]
def advance(self):
token = self.current()
self.position += 1
return token
def expect(self, kind):
token = self.current()
if token.kind != kind:
raise SyntaxError(
f"expected {kind}, got {token.kind} "
f"at {token.line}:{token.column}"
)
return self.advance()
def parse_program(self):
statements = []
while self.current().kind != "EOF":
if self.current().kind == "NEWLINE":
self.advance()
continue
statements.append(self.parse_statement())
if self.current().kind == "NEWLINE":
self.advance()
return Program(statements)
Parse each statement:
class Parser(Parser):
def parse_statement(self):
token = self.expect("IDENT")
if token.value == "print":
name = self.expect("IDENT")
return Print(
name=name.value,
line=token.line,
column=token.column,
)
self.expect("EQUAL")
expr = self.parse_expr()
return Assign(
name=token.value,
expr=expr,
line=token.line,
column=token.column,
)
def parse_expr(self):
token = self.current()
if token.kind == "NUMBER":
token = self.advance()
return Number(
value=int(token.value),
line=token.line,
column=token.column,
)
if token.kind == "IDENT":
token = self.advance()
return Variable(
name=token.value,
line=token.line,
column=token.column,
)
raise SyntaxError(
f"expected expression at {token.line}:{token.column}"
)
Use it:
source = """x = 1
y = x
print y
print z
"""
tokens = tokenize(source)
program = Parser(tokens).parse_program()
print(program)
The result is a structured representation suitable for analysis.
Diagnostics
A diagnostic should contain a message, a severity, and a source location.
@dataclass(frozen=True)
class Diagnostic:
message: str
line: int
column: int
severity: str = "warning"
Format diagnostics:
def format_diagnostic(diagnostic):
return (
f"{diagnostic.severity}: "
f"{diagnostic.line}:{diagnostic.column}: "
f"{diagnostic.message}"
)
Example:
warning: 4:1: undefined variable: z
Location quality directly affects whether users trust the analyzer.
Undefined Variable Analysis
Track the set of variables that have been assigned.
def analyze_undefined_variables(program):
defined = set()
diagnostics = []
for statement in program.statements:
if isinstance(statement, Assign):
if isinstance(statement.expr, Variable):
if statement.expr.name not in defined:
diagnostics.append(
Diagnostic(
message=f"undefined variable: {statement.expr.name}",
line=statement.expr.line,
column=statement.expr.column,
)
)
defined.add(statement.name)
elif isinstance(statement, Print):
if statement.name not in defined:
diagnostics.append(
Diagnostic(
message=f"undefined variable: {statement.name}",
line=statement.line,
column=statement.column,
)
)
return diagnostics
Run it:
diagnostics = analyze_undefined_variables(program)
for diagnostic in diagnostics:
print(format_diagnostic(diagnostic))
Output:
warning: 4:1: undefined variable: z
This is a dataflow analysis in its simplest form. The analyzer maintains facts while walking the program.
Unused Variable Analysis
Track assignments and uses.
def analyze_unused_variables(program):
assigned = {}
used = set()
for statement in program.statements:
if isinstance(statement, Assign):
assigned[statement.name] = statement
if isinstance(statement.expr, Variable):
used.add(statement.expr.name)
elif isinstance(statement, Print):
used.add(statement.name)
diagnostics = []
for name, statement in assigned.items():
if name not in used:
diagnostics.append(
Diagnostic(
message=f"unused variable: {name}",
line=statement.line,
column=statement.column,
)
)
return diagnostics
Example:
x = 1
y = 2
print x
Diagnostic:
warning: 2:1: unused variable: y
This analysis is simple because the language has no branches. In a real language, usage may depend on control flow.
Control-Flow Graphs
For statements with branches and loops, a linear scan is not enough. Build a control-flow graph, or CFG.
A CFG has:
- Nodes representing program points or basic blocks.
- Edges representing possible execution flow.
Example:
if condition:
x = 1
else:
y = 2
print x
The analyzer must know that x may be undefined when execution comes from the else branch.
A simplified CFG:
entry
|
condition
/ \\nx=1 y=2
\ /
print x
|
exit
Static analysis over a CFG usually computes facts until they stop changing.
Dataflow Analysis Pattern
Many static analyses share the same structure.
initialize facts
repeat:
propagate facts across edges
merge facts at joins
until no facts change
Examples:
| Analysis | Fact |
|---|---|
| Definite assignment | Variables definitely assigned |
| Reaching definitions | Assignments that may reach a point |
| Live variables | Variables that may be used later |
| Constant propagation | Variables with known constant values |
| Nullability | Values that may be null |
| Taint analysis | Values influenced by untrusted input |
The difference between analyses lies in the facts, transfer functions, and merge rules.
Definite Assignment
Definite assignment asks whether a variable is assigned on all paths before use.
At a control-flow join, use intersection.
facts from branch 1: {x}
facts from branch 2: {y}
intersection: {}
So after the if, neither x nor y is definitely assigned.
By contrast, possible assignment uses union.
union: {x, y}
Choosing union or intersection changes the meaning of the analysis.
Constant Propagation
Constant propagation tracks known values.
Example:
x = 10
y = x
print y
The analyzer can infer:
x = 10
y = 10
A simple linear version:
def analyze_constants(program):
constants = {}
diagnostics = []
for statement in program.statements:
if isinstance(statement, Assign):
if isinstance(statement.expr, Number):
constants[statement.name] = statement.expr.value
elif isinstance(statement.expr, Variable):
if statement.expr.name in constants:
constants[statement.name] = constants[statement.expr.name]
else:
constants.pop(statement.name, None)
elif isinstance(statement, Print):
pass
return constants, diagnostics
At branches, a variable remains constant only if both paths assign the same constant.
Rule-Based Checks
Not every analyzer needs full dataflow. Many useful checks are tree patterns.
Example rule:
Do not print directly from library code.
A rule can traverse the AST:
def analyze_no_print(program):
diagnostics = []
for statement in program.statements:
if isinstance(statement, Print):
diagnostics.append(
Diagnostic(
message="avoid direct print statements",
line=statement.line,
column=statement.column,
severity="info",
)
)
return diagnostics
Rule-based checks are easy to add and test. They are the basis of many linters.
Organizing Analysis Passes
Use a registry of analysis passes.
ANALYSIS_PASSES = [
analyze_undefined_variables,
analyze_unused_variables,
analyze_no_print,
]
def analyze(program):
diagnostics = []
for analysis_pass in ANALYSIS_PASSES:
diagnostics.extend(analysis_pass(program))
return diagnostics
This keeps each analysis small. A failed or noisy rule can be disabled without changing the rest of the system.
Severity and Codes
Diagnostics should have stable codes.
@dataclass(frozen=True)
class Diagnostic:
code: str
message: str
line: int
column: int
severity: str = "warning"
Example:
ALG001 undefined variable
ALG002 unused variable
ALG003 direct print statement
Stable codes let users suppress specific warnings.
# analyzer-disable ALG002
They also make CI configuration and documentation manageable.
Suppression
Users need a way to suppress intentional warnings.
A simple line-based suppression model:
# analyzer-disable-next-line ALG002
x = 1
The analyzer can collect suppression comments during lexing or parse them as trivia.
Suppression should be precise. A global “ignore all warnings” option reduces analyzer value. Prefer suppressing one rule at one location with a reason.
False Positives and False Negatives
A false positive reports a problem that is not actually a problem.
A false negative misses a real problem.
Static analysis always balances these.
For style rules, false positives may be tolerable. For security tools, false negatives may be more dangerous. For CI enforcement, false positives are costly because they block legitimate changes.
A practical analyzer should classify rules:
| Rule type | Tolerance |
|---|---|
| Formatting | Low cost, easy autofix |
| Style | Moderate tolerance |
| Bug risk | Needs high precision |
| Security | Needs careful triage |
| Migration | Often project-specific |
Autofix
Some diagnostics can be fixed automatically.
Example:
unused import
The analyzer can remove the import safely if no references exist.
Represent fixes:
@dataclass(frozen=True)
class TextEdit:
start_line: int
start_column: int
end_line: int
end_column: int
replacement: str
Attach fixes to diagnostics:
@dataclass(frozen=True)
class Diagnostic:
code: str
message: str
line: int
column: int
severity: str = "warning"
fix: object = None
Autofix requires exact source ranges. The lexer and parser must preserve enough position information.
Performance
Static analyzers often run in editors and CI systems, so performance matters.
Common strategies:
| Strategy | Effect |
|---|---|
| Incremental parsing | Reanalyze only changed files |
| File-level caching | Reuse previous results |
| Rule filtering | Run only enabled rules |
| Syntax-first checks | Avoid expensive semantic passes |
| Parallel files | Analyze independent files concurrently |
| Early bailout | Stop after severe syntax errors |
A small analyzer can scan files from scratch. A production analyzer usually needs incremental architecture.
Testing
Test parsing and analysis separately.
Parser test:
def test_parse_assignment():
program = Parser(tokenize("x = 1")).parse_program()
assert len(program.statements) == 1
assert isinstance(program.statements[0], Assign)
assert program.statements[0].name == "x"
Undefined variable test:
def test_undefined_variable():
source = """x = 1
print y
"""
program = Parser(tokenize(source)).parse_program()
diagnostics = analyze_undefined_variables(program)
assert len(diagnostics) == 1
assert diagnostics[0].message == "undefined variable: y"
Unused variable test:
def test_unused_variable():
source = """x = 1
y = 2
print x
"""
program = Parser(tokenize(source)).parse_program()
diagnostics = analyze_unused_variables(program)
assert len(diagnostics) == 1
assert diagnostics[0].message == "unused variable: y"
Analyzer tests should use small programs where the expected diagnostic set is clear.
Common Bugs
The most common static analyzer bug is reporting diagnostics without precise source locations. Users cannot act on vague warnings.
Another common bug is analyzing raw text instead of parsed structure. This creates false positives in comments, strings, and unusual formatting.
Dataflow analyses often use the wrong merge operation. Union and intersection answer different questions.
Unused variable analysis can be wrong when variables are used indirectly, exported, reflected, or referenced from generated code.
Suppression systems can accidentally suppress too much. Suppression should be narrow and visible.
Autofix can corrupt source code if ranges are imprecise or edits overlap.
Performance problems appear when every rule repeatedly walks the full tree. Share semantic information between passes when possible.
Recipe
Build the analyzer in layers.
Start with a lexer that records source locations. Parse source into an AST. Define diagnostics with stable codes and precise locations. Implement simple tree-based checks first. Add linear dataflow checks such as undefined and unused variables. Introduce control-flow graphs when branches and loops matter. Organize checks as separate passes. Add suppression and autofix only after diagnostics are stable. Test each rule with small source examples. Track false-positive rate before enforcing rules in CI.
The main lesson is that static analysis is structured program inspection. The tree gives syntax, the control-flow graph gives execution paths, and dataflow facts give semantic evidence. A useful analyzer combines these structures while keeping diagnostics precise, explainable, and configurable.