Automatic differentiation can be performed before a program runs, while it runs, or in a staged phase between the two.
Automatic differentiation can be performed before a program runs, while it runs, or in a staged phase between the two.
The two main compiler-oriented models are ahead-of-time differentiation and just-in-time differentiation.
Ahead-of-time differentiation, or AOT differentiation, generates derivative code before execution. Just-in-time differentiation, or JIT differentiation, generates derivative code during execution, usually after seeing concrete input types, shapes, or runtime configuration.
Both models treat AD as program generation. They differ in when the derivative program is produced.
Ahead-of-Time Differentiation
In AOT differentiation, the AD system transforms a known program into derivative code before the program is run.
A typical pipeline is:
source program
-> parse
-> lower to IR
-> differentiate
-> optimize
-> compile
-> ship executableThe derivative function becomes part of the compiled artifact.
For example:
f(x) -> ymay be compiled into:
df(x, dx) -> (y, dy)for forward mode, or:
f_primal(x) -> (y, residuals)
f_pullback(residuals, bar_y) -> bar_xfor reverse mode.
AOT differentiation is common in scientific computing, embedded systems, high-performance simulation, and languages with strong compiler infrastructure.
Strengths of AOT Differentiation
AOT differentiation gives the compiler maximum time to analyze and optimize the derivative program.
It can perform:
whole-program analysis
activity analysis
alias analysis
memory planning
checkpoint placement
loop optimization
vectorization
static scheduling
code generation for fixed targetsBecause the derivative code is produced before deployment, runtime behavior is predictable. There is no first-call compilation pause. This matters in production services, real-time control, embedded systems, and batch jobs with strict scheduling.
AOT also fits environments where dynamic code generation is unavailable or undesirable.
Limits of AOT Differentiation
AOT differentiation needs the program structure early.
This is easy for static languages and fixed simulations. It is harder for dynamic programs where control flow, tensor shapes, or operator choices depend on runtime values.
If the compiler must handle all possible shapes and branches, the derivative program may become generic and slower. If it specializes too much, it may need many compiled variants.
AOT systems also face a deployment problem: users must decide in advance which functions and derivative modes to compile.
Just-in-Time Differentiation
In JIT differentiation, the system waits until runtime to capture or specialize the program.
A typical pipeline is:
first call
-> inspect inputs
-> trace or lower program
-> specialize
-> differentiate
-> optimize
-> compile
-> execute
later calls
-> reuse cached compiled derivativeThe first call pays compilation cost. Later calls can be fast.
JIT differentiation is common in array systems, machine learning frameworks, and interactive numerical programming.
Why JIT Works Well for Tensor Programs
Tensor programs often have stable structure across many calls.
A training step may run thousands or millions of times with the same shapes and dtypes:
parameters: fixed shapes
batch: usually fixed shape
loss function: fixed graph
optimizer step: fixed update structureJIT can specialize once, then reuse the compiled derivative program.
The compiler can exploit concrete information:
dtype = float32
shape = [1024, 4096]
device = GPU
layout = row-major or tiledThis enables kernel fusion, static memory planning, and backend-specific code generation.
Specialization and Recompilation
JIT systems rely on specialization.
A derivative compiled for one shape may not apply to another shape.
compiled for tensor<f32, [32, 768]>
not necessarily valid for tensor<f32, [64, 768]>The system therefore uses cache keys.
A cache key may include:
function identity
input dtypes
input shapes
static argument values
device
compiler flags
AD mode
transformation stackWhen the key changes, the system retraces or recompiles.
This is powerful, but users must understand which arguments are static and which are dynamic.
AOT vs JIT
| Dimension | AOT differentiation | JIT differentiation |
|---|---|---|
| When derivative is generated | Before execution | During execution |
| Runtime startup | Predictable | First call may compile |
| Optimization knowledge | Static program knowledge | Concrete runtime shapes and types |
| Dynamic control flow | Harder unless represented explicitly | Natural if traced from execution |
| Deployment | Compiled artifact | Runtime compiler/cache needed |
| Interactivity | Lower | Higher |
| Embedded use | Strong fit | Often poor fit |
| ML workloads | Useful but less flexible | Strong fit |
| Debugging | Easier to inspect generated code | Harder due to staging and cache |
| Shape specialization | Must be planned early | Natural |
Neither model is universally better. The correct choice depends on workload and deployment constraints.
Interaction with Source Transformation
AOT differentiation often uses source transformation.
source f
-> derivative source df
-> ordinary compilerThis gives visible derivative code and integrates well with conventional compiler pipelines.
JIT differentiation often uses tracing.
run f with tracer values
-> graph
-> derivative graph
-> optimized executableBut these are tendencies, not rules. A JIT system can use source transformation internally. An AOT system can compile traced examples into deployable artifacts.
The deeper distinction is timing, not technique.
AOT Reverse Mode
AOT reverse mode usually produces two compiled routines.
forward routine:
compute primal outputs
save residuals
reverse routine:
consume residuals and output adjoints
produce input adjointsThis split is useful because the reverse routine may run later or multiple times.
For example, in an optimization solver, one forward computation may be paired with several adjoint computations under different seeds.
The compiler must define the residual format carefully. It becomes part of the internal ABI between the forward and reverse routines.
JIT Reverse Mode
JIT reverse mode may build the backward graph from one execution.
A dynamic system records:
executed operations
input and output references
saved primal values
local backward rulesThen backward() traverses the recorded graph.
A staged JIT system may instead compile a reusable backward function:
trace primal
build backward graph
compile forward plus backward
cache executableThe first model prioritizes flexibility. The second model prioritizes repeated performance.
Compilation Latency
JIT differentiation introduces compilation latency.
This matters when:
the function runs only once
input shapes change often
interactive calls are small
control flow creates many traces
compilation is expensiveFor small computations, compilation time can exceed execution time.
AOT avoids this latency during execution. It pays compilation cost earlier.
A practical JIT system needs caching, incremental compilation, and heuristics that avoid compiling tiny computations where interpretation is faster.
Code Size
AOT can create large derivative binaries if many specialized variants are generated ahead of time.
JIT can create large caches if many shapes or static arguments occur at runtime.
Both systems must manage code size.
Common strategies include:
generic fallback code
shape polymorphism
limited specialization
cache eviction
separate compilation
lazy derivative generationThe tradeoff is specialization versus reuse.
Shape Polymorphism
Shape polymorphism narrows the gap between AOT and JIT.
Instead of compiling for a fixed shape:
tensor<f32, [32, 768]>the compiler may compile for:
tensor<f32, [B, 768]>This allows one derivative program to serve many batch sizes.
AOT systems gain flexibility. JIT systems reduce recompilation.
The cost is more complex generated code and weaker optimization when dimensions remain symbolic.
Debugging
AOT derivative code is often easier to inspect.
A user or compiler engineer may look at generated source, IR, or assembly before execution.
JIT derivative code is harder to inspect because it is generated during execution. The user may need tools to dump traces, graphs, lowered IR, optimized IR, and generated kernels.
Good JIT systems provide diagnostics:
why did this recompile?
what shapes were specialized?
what graph was captured?
which operations were fused?
which values were saved for backward?Without these tools, performance behavior can feel unpredictable.
Safety and Reproducibility
AOT differentiation has simpler reproducibility properties.
The derivative program is fixed before execution. Runtime behavior depends mostly on inputs and environment.
JIT differentiation may depend on runtime cache state, compiler versions, device availability, and specialization choices. Two runs may compile at different times or choose different optimized forms.
For high-assurance computing, AOT is often easier to validate.
For exploratory machine learning, JIT flexibility is usually worth the added complexity.
Hybrid Systems
Many modern AD systems are hybrid.
They may:
trace dynamically
compile just in time
export ahead-of-time artifacts
cache compiled executables
support static subgraphs inside dynamic programs
fall back to eager execution for unsupported regionsA hybrid pipeline may look like:
eager user code
-> capture hot region
-> JIT compile derivative
-> optionally export AOT executableThis approach gives interactive development and production deployment from the same program model.
Choosing AOT or JIT
AOT differentiation is usually better when:
program structure is fixed
deployment must be predictable
runtime compilation is unavailable
latency constraints are strict
certification or audit matters
embedded or HPC batch execution dominatesJIT differentiation is usually better when:
programs are interactive
input shapes and types are discovered at runtime
the same specialized computation repeats many times
dynamic language usability matters
backend-specific optimization is importantThe decision is architectural. It affects language design, compiler design, runtime design, debugging tools, and deployment model.
Summary
Ahead-of-time and just-in-time differentiation both generate derivative programs.
AOT performs the transformation before execution, producing predictable compiled artifacts and enabling deep static analysis. JIT performs the transformation during execution, using concrete runtime information to specialize and optimize derivative code.
AOT favors predictability, auditability, and deployment simplicity. JIT favors interactivity, specialization, and high performance for repeated tensor workloads.
Most serious AD systems eventually blend both models: dynamic capture for usability, compiler IR for optimization, caching for repeated execution, and export paths for production use.