Ahead-of-Time vs Just-in-Time Differentiation

Automatic differentiation can be performed before a program runs, while it runs, or in a staged phase between the two.

The two main compiler-oriented models are ahead-of-time differentiation and just-in-time differentiation.

Ahead-of-time differentiation, or AOT differentiation, generates derivative code before execution. Just-in-time differentiation, or JIT differentiation, generates derivative code during execution, usually after seeing concrete input types, shapes, or runtime configuration.

Both models treat AD as program generation. They differ in when the derivative program is produced.

Ahead-of-Time Differentiation

In AOT differentiation, the AD system transforms a known program into derivative code before the program is run.

A typical pipeline is:

source program
    -> parse
    -> lower to IR
    -> differentiate
    -> optimize
    -> compile
    -> ship executable

The derivative function becomes part of the compiled artifact.

For example:

f(x) -> y

may be compiled into:

df(x, dx) -> (y, dy)

for forward mode, or:

f_primal(x) -> (y, residuals)
f_pullback(residuals, bar_y) -> bar_x

for reverse mode.

AOT differentiation is common in scientific computing, embedded systems, high-performance simulation, and languages with strong compiler infrastructure.

Strengths of AOT Differentiation

AOT differentiation gives the compiler maximum time to analyze and optimize the derivative program.

It can perform:

whole-program analysis
activity analysis
alias analysis
memory planning
checkpoint placement
loop optimization
vectorization
static scheduling
code generation for fixed targets

Because the derivative code is produced before deployment, runtime behavior is predictable. There is no first-call compilation pause. This matters in production services, real-time control, embedded systems, and batch jobs with strict scheduling.

AOT also fits environments where dynamic code generation is unavailable or undesirable.

Limits of AOT Differentiation

AOT differentiation needs the program structure early.

This is easy for static languages and fixed simulations. It is harder for dynamic programs where control flow, tensor shapes, or operator choices depend on runtime values.

If the compiler must handle all possible shapes and branches, the derivative program may become generic and slower. If it specializes too much, it may need many compiled variants.

AOT systems also face a deployment problem: users must decide in advance which functions and derivative modes to compile.

Just-in-Time Differentiation

In JIT differentiation, the system waits until runtime to capture or specialize the program.

A typical pipeline is:

first call
    -> inspect inputs
    -> trace or lower program
    -> specialize
    -> differentiate
    -> optimize
    -> compile
    -> execute

later calls
    -> reuse cached compiled derivative

The first call pays compilation cost. Later calls can be fast.

JIT differentiation is common in array systems, machine learning frameworks, and interactive numerical programming.

Why JIT Works Well for Tensor Programs

Tensor programs often have stable structure across many calls.

A training step may run thousands or millions of times with the same shapes and dtypes:

parameters: fixed shapes
batch: usually fixed shape
loss function: fixed graph
optimizer step: fixed update structure

JIT can specialize once, then reuse the compiled derivative program.

The compiler can exploit concrete information:

dtype = float32
shape = [1024, 4096]
device = GPU
layout = row-major or tiled

This enables kernel fusion, static memory planning, and backend-specific code generation.

Specialization and Recompilation

JIT systems rely on specialization.

A derivative compiled for one shape may not apply to another shape.

compiled for tensor<f32, [32, 768]>
not necessarily valid for tensor<f32, [64, 768]>

The system therefore uses cache keys.

A cache key may include:

function identity
input dtypes
input shapes
static argument values
device
compiler flags
AD mode
transformation stack

When the key changes, the system retraces or recompiles.

This is powerful, but users must understand which arguments are static and which are dynamic.

AOT vs JIT

Dimension	AOT differentiation	JIT differentiation
When derivative is generated	Before execution	During execution
Runtime startup	Predictable	First call may compile
Optimization knowledge	Static program knowledge	Concrete runtime shapes and types
Dynamic control flow	Harder unless represented explicitly	Natural if traced from execution
Deployment	Compiled artifact	Runtime compiler/cache needed
Interactivity	Lower	Higher
Embedded use	Strong fit	Often poor fit
ML workloads	Useful but less flexible	Strong fit
Debugging	Easier to inspect generated code	Harder due to staging and cache
Shape specialization	Must be planned early	Natural

Neither model is universally better. The correct choice depends on workload and deployment constraints.

Interaction with Source Transformation

AOT differentiation often uses source transformation.

source f
    -> derivative source df
    -> ordinary compiler

This gives visible derivative code and integrates well with conventional compiler pipelines.

JIT differentiation often uses tracing.

run f with tracer values
    -> graph
    -> derivative graph
    -> optimized executable

But these are tendencies, not rules. A JIT system can use source transformation internally. An AOT system can compile traced examples into deployable artifacts.

The deeper distinction is timing, not technique.

AOT Reverse Mode

AOT reverse mode usually produces two compiled routines.

forward routine:
    compute primal outputs
    save residuals

reverse routine:
    consume residuals and output adjoints
    produce input adjoints

This split is useful because the reverse routine may run later or multiple times.

For example, in an optimization solver, one forward computation may be paired with several adjoint computations under different seeds.

The compiler must define the residual format carefully. It becomes part of the internal ABI between the forward and reverse routines.

JIT Reverse Mode

JIT reverse mode may build the backward graph from one execution.

A dynamic system records:

executed operations
input and output references
saved primal values
local backward rules

Then backward() traverses the recorded graph.

A staged JIT system may instead compile a reusable backward function:

trace primal
build backward graph
compile forward plus backward
cache executable

The first model prioritizes flexibility. The second model prioritizes repeated performance.

Compilation Latency

JIT differentiation introduces compilation latency.

This matters when:

the function runs only once
input shapes change often
interactive calls are small
control flow creates many traces
compilation is expensive

For small computations, compilation time can exceed execution time.

AOT avoids this latency during execution. It pays compilation cost earlier.

A practical JIT system needs caching, incremental compilation, and heuristics that avoid compiling tiny computations where interpretation is faster.

Code Size

AOT can create large derivative binaries if many specialized variants are generated ahead of time.

JIT can create large caches if many shapes or static arguments occur at runtime.

Both systems must manage code size.

Common strategies include:

generic fallback code
shape polymorphism
limited specialization
cache eviction
separate compilation
lazy derivative generation

The tradeoff is specialization versus reuse.

Shape Polymorphism

Shape polymorphism narrows the gap between AOT and JIT.

Instead of compiling for a fixed shape:

tensor<f32, [32, 768]>

the compiler may compile for:

tensor<f32, [B, 768]>

This allows one derivative program to serve many batch sizes.

AOT systems gain flexibility. JIT systems reduce recompilation.

The cost is more complex generated code and weaker optimization when dimensions remain symbolic.

Debugging

AOT derivative code is often easier to inspect.

A user or compiler engineer may look at generated source, IR, or assembly before execution.

JIT derivative code is harder to inspect because it is generated during execution. The user may need tools to dump traces, graphs, lowered IR, optimized IR, and generated kernels.

Good JIT systems provide diagnostics:

why did this recompile?
what shapes were specialized?
what graph was captured?
which operations were fused?
which values were saved for backward?

Without these tools, performance behavior can feel unpredictable.

Safety and Reproducibility

AOT differentiation has simpler reproducibility properties.

The derivative program is fixed before execution. Runtime behavior depends mostly on inputs and environment.

JIT differentiation may depend on runtime cache state, compiler versions, device availability, and specialization choices. Two runs may compile at different times or choose different optimized forms.

For high-assurance computing, AOT is often easier to validate.

For exploratory machine learning, JIT flexibility is usually worth the added complexity.

Hybrid Systems

Many modern AD systems are hybrid.

They may:

trace dynamically
compile just in time
export ahead-of-time artifacts
cache compiled executables
support static subgraphs inside dynamic programs
fall back to eager execution for unsupported regions

A hybrid pipeline may look like:

eager user code
    -> capture hot region
    -> JIT compile derivative
    -> optionally export AOT executable

This approach gives interactive development and production deployment from the same program model.

Choosing AOT or JIT

AOT differentiation is usually better when:

program structure is fixed
deployment must be predictable
runtime compilation is unavailable
latency constraints are strict
certification or audit matters
embedded or HPC batch execution dominates

JIT differentiation is usually better when:

programs are interactive
input shapes and types are discovered at runtime
the same specialized computation repeats many times
dynamic language usability matters
backend-specific optimization is important

The decision is architectural. It affects language design, compiler design, runtime design, debugging tools, and deployment model.

Summary

Ahead-of-time and just-in-time differentiation both generate derivative programs.

AOT performs the transformation before execution, producing predictable compiled artifacts and enabling deep static analysis. JIT performs the transformation during execution, using concrete runtime information to specialize and optimize derivative code.

AOT favors predictability, auditability, and deployment simplicity. JIT favors interactivity, specialization, and high performance for repeated tensor workloads.

Most serious AD systems eventually blend both models: dynamic capture for usability, compiler IR for optimization, caching for repeated execution, and export paths for production use.