Skip to content

AD in Julia

Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This...

Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This makes it a strong environment for automatic differentiation because AD systems can operate close to ordinary mathematical code while still generating optimized machine code.

Unlike Python, where tensor operations are often dispatched into external runtimes, Julia programs are commonly written directly in the host language itself. Numerical kernels, array operations, loops, and scientific algorithms are all expressed in Julia. As a result, Julia AD systems frequently differentiate ordinary language-level programs rather than only tensor graphs.

Multiple Dispatch and Generic Programming

Julia’s core abstraction mechanism is multiple dispatch.

Functions specialize on argument types:

f(x) = (x + 1) * sin(x)

The same function can run on:

  • Float64
  • Complex numbers
  • Dual numbers
  • Static arrays
  • GPU tensors
  • Reverse-mode tracked values

This makes AD integration natural. An AD library introduces new numeric types, and Julia dispatches specialized methods automatically.

Forward Mode with Dual Numbers

Forward mode in Julia is commonly implemented with dual numbers.

A simplified representation:

struct Dual{T}
    x::T
    dx::T
end

Arithmetic propagates tangent values:

Base.:+(a::Dual, b::Dual) =
    Dual(a.x + b.x, a.dx + b.dx)

Base.:*(a::Dual, b::Dual) =
    Dual(
        a.x * b.x,
        a.dx * b.x + a.x * b.dx
    )

Elementary functions define derivative propagation:

Base.sin(a::Dual) =
    Dual(sin(a.x), cos(a.x) * a.dx)

A generic Julia function automatically becomes differentiable:

f(x) = (x + 1) * sin(x)

Evaluating:

f(Dual(2.0, 1.0))

returns both the primal value and derivative.

This works because Julia numerical code is usually written generically rather than hardcoding concrete scalar types.

Type Specialization

Julia compiles methods specialized to concrete argument types.

If:

f(x::Float64)

and:

f(x::Dual{Float64})

are called, the compiler generates separate optimized machine code for each.

This gives several advantages for AD:

BenefitExplanation
Static specializationRemoves dynamic dispatch overhead
InliningDerivative propagation becomes efficient
SIMD/vectorizationWorks on differentiated code
Loop optimizationAD inside loops remains fast
Generic reuseSame source works for many numeric domains

Forward-mode dual arithmetic therefore performs well in Julia compared to many dynamic languages.

Reverse Mode in Julia

Reverse mode is essential for machine learning and optimization workloads with many parameters.

Julia reverse-mode systems usually construct pullbacks or tapes representing adjoint propagation.

For:

y = f(x)

the AD system computes:

  1. The primal output
  2. A backward propagation function

Conceptually:

(y, back) = pullback(f, x)

where:

dx = back(dy)

propagates output adjoints backward into input adjoints.

This functional pullback representation became central in modern Julia AD systems.

Source-to-Source AD

One influential Julia approach is source-to-source differentiation.

Instead of tracing runtime execution, the AD system transforms Julia intermediate representations into derivative code.

A simplified primal function:

f(x) = x * sin(x)

may conceptually transform into:

function f_pullback(x)
    y1 = sin(x)
    y2 = x * y1

    function back(dy)
        dx = dy * y1 + dy * x * cos(x)
        return dx
    end

    return y2, back
end

The derivative is represented directly as Julia code.

Advantages include:

PropertyBenefit
Native language semanticsAD works on ordinary Julia code
Compiler visibilityOptimizer sees derivative code
Control-flow supportLoops and branches differentiate naturally
Higher-order supportDerivative code itself is Julia
ComposabilityAD interacts with generic dispatch

This approach differs from tensor-only graph systems.

Julia Intermediate Representations

Julia exposes multiple compiler IR stages.

IR stageRole
ASTParsed syntax
Lowered IRSimplified control flow
Typed IRType-inferred representation
LLVM IRLow-level optimized representation
Machine codeFinal compiled execution

AD systems can operate at different levels.

LevelCharacteristics
AST transformationSimple but limited semantic information
Lowered IRExplicit control flow
Typed IRFull type information
LLVM IRHardware-level optimization

Many Julia AD tools operate around typed IR because it preserves language semantics while exposing compiler information.

Control Flow

Julia AD systems typically support ordinary control flow directly.

Example:

function f(x)
    y = 0.0

    for i in eachindex(x)
        if x[i] > 0
            y += x[i]^2
        else
            y -= x[i]
        end
    end

    return y
end

The differentiated program follows the executed path.

This is important because Julia scientific code frequently contains:

  • Adaptive solvers
  • Iterative algorithms
  • Dynamic branching
  • Recursive methods
  • Simulation loops

The AD system must therefore differentiate general programs, not only static computation graphs.

Mutation and Arrays

Mutation is one of the difficult areas for Julia AD.

Julia arrays are mutable:

x[1] = 5

Reverse mode must define how adjoints behave through updates.

Problems include:

ProblemExample
AliasingMultiple references to same array
In-place updatesOverwriting needed primal values
Broadcasting mutationElementwise adjoint accumulation
Views/slicesShared memory regions
Sparse updatesScatter-style reverse rules

Some Julia AD systems restrict mutation or transform mutating code into functional equivalents internally.

Others provide specialized adjoint rules for common mutation patterns.

Broadcast Fusion

Julia supports fused broadcasting:

y .= sin.(x) .+ x .* x

This lowers into a fused loop rather than allocating intermediates.

For AD, this matters because naive differentiation may accidentally break fusion and introduce temporary arrays.

A high-performance Julia AD system should preserve:

  • Loop fusion
  • SIMD vectorization
  • Allocation elimination
  • Static dispatch

Otherwise the derivative program may be much slower than the primal program.

Higher-Order Differentiation

Julia’s compositional design makes nested AD natural.

Example:

gradient(x -> gradient(f, x)[1], x)

This computes second derivatives.

Because derivative transformations themselves produce Julia functions, higher-order differentiation becomes recursive transformation over ordinary code.

This supports:

OperationUse
HessiansCurvature methods
Hessian-vector productsLarge-scale optimization
Meta-gradientsDifferentiating optimizers
Implicit differentiationSolver sensitivities
Differential equationsSensitivity analysis

The challenge is perturbation confusion and tape interaction across nested passes.

AD for Scientific Computing

Julia is heavily used in scientific computing, so its AD ecosystem emphasizes more than neural networks.

Typical workloads include:

DomainAD use
Differential equationsSensitivity analysis
OptimizationGradients and Hessians
PDE solversParameter inference
Probabilistic programmingGradient-based inference
Control systemsTrajectory optimization
Physics simulationDifferentiable dynamics
Computational biologyParameter fitting

This shaped Julia AD systems toward differentiating general numerical programs rather than only tensor layers.

Differentiating Through Solvers

Scientific code often calls iterative solvers.

Example:

x = solve(A, b)

Naively differentiating the implementation may unroll every iteration. This can be expensive and numerically unstable.

Julia AD frameworks increasingly support implicit differentiation:

Ax=b A x = b

Differentiate both sides:

dAx+Adx=db dA \, x + A \, dx = db

and solve directly for dx.

Ax=b A x = b

This avoids differentiating every internal solver iteration.

GPU Support

Julia supports GPU execution through native language tooling.

A tensor operation may compile directly into GPU kernels.

AD systems therefore must support:

RequirementPurpose
GPU-compatible pullbacksReverse mode on device
Kernel differentiationGradients through GPU code
Memory synchronizationHost-device consistency
Broadcast loweringEfficient fused kernels
Mixed precisionAccelerator efficiency

Since Julia kernels are written in Julia itself, AD can sometimes operate directly on GPU-level code.

Compiler Integration

Julia’s compiler visibility enables unusually deep AD integration.

The compiler can optimize differentiated code using ordinary optimization passes:

  • Dead code elimination
  • Inlining
  • Constant propagation
  • Loop optimization
  • Escape analysis
  • Allocation removal

This is a major advantage over black-box runtime tracing systems.

The derivative program becomes an ordinary compilable Julia program.

Major Julia AD Systems

Representative Julia AD systems include:

SystemMain style
ForwardDiff.jlDual-number forward mode
ReverseDiff.jlTape-based reverse mode
Zygote.jlSource-to-source reverse mode
Tracker.jlDynamic graph reverse mode
Enzyme.jlLLVM IR differentiation
Diffractor.jlCompiler-integrated experimentation

These systems explore different tradeoffs between:

  • Performance
  • Mutation support
  • Compiler integration
  • GPU compatibility
  • Higher-order differentiation
  • Language coverage

Performance Characteristics

Julia AD systems often achieve strong performance because:

FeatureEffect
SpecializationEfficient derivative kernels
Generic numerical codeAD integrates naturally
Compiler optimizationDerivative code optimized normally
Multiple dispatchClean derivative extensibility
Low-level accessEfficient tensor and solver integration

However, performance still depends heavily on allocation patterns, mutation handling, and compiler inference quality.

Design Philosophy

Julia AD systems tend to treat differentiation as a compiler and language problem rather than only a runtime graph problem.

The important idea is that numerical programs themselves should remain ordinary Julia programs.

A user writes:

loss(params)

and the AD system transforms, specializes, optimizes, and compiles the derivative computation automatically.

This aligns closely with the broader goal of differentiable programming: gradients should emerge from ordinary programs rather than requiring a separate graph language or restricted execution environment.