# AD in Julia

## AD in Julia

Julia was designed for high-performance technical computing. It combines interactive syntax with a compiler capable of specializing code aggressively based on types. This makes it a strong environment for automatic differentiation because AD systems can operate close to ordinary mathematical code while still generating optimized machine code.

Unlike Python, where tensor operations are often dispatched into external runtimes, Julia programs are commonly written directly in the host language itself. Numerical kernels, array operations, loops, and scientific algorithms are all expressed in Julia. As a result, Julia AD systems frequently differentiate ordinary language-level programs rather than only tensor graphs.

### Multiple Dispatch and Generic Programming

Julia’s core abstraction mechanism is multiple dispatch.

Functions specialize on argument types:

```julia
f(x) = (x + 1) * sin(x)
```

The same function can run on:

- `Float64`
- Complex numbers
- Dual numbers
- Static arrays
- GPU tensors
- Reverse-mode tracked values

This makes AD integration natural. An AD library introduces new numeric types, and Julia dispatches specialized methods automatically.

### Forward Mode with Dual Numbers

Forward mode in Julia is commonly implemented with dual numbers.

A simplified representation:

```julia
struct Dual{T}
    x::T
    dx::T
end
```

Arithmetic propagates tangent values:

```julia
Base.:+(a::Dual, b::Dual) =
    Dual(a.x + b.x, a.dx + b.dx)

Base.:*(a::Dual, b::Dual) =
    Dual(
        a.x * b.x,
        a.dx * b.x + a.x * b.dx
    )
```

Elementary functions define derivative propagation:

```julia
Base.sin(a::Dual) =
    Dual(sin(a.x), cos(a.x) * a.dx)
```

A generic Julia function automatically becomes differentiable:

```julia
f(x) = (x + 1) * sin(x)
```

Evaluating:

```julia
f(Dual(2.0, 1.0))
```

returns both the primal value and derivative.

This works because Julia numerical code is usually written generically rather than hardcoding concrete scalar types.

### Type Specialization

Julia compiles methods specialized to concrete argument types.

If:

```julia
f(x::Float64)
```

and:

```julia
f(x::Dual{Float64})
```

are called, the compiler generates separate optimized machine code for each.

This gives several advantages for AD:

| Benefit | Explanation |
|---|---|
| Static specialization | Removes dynamic dispatch overhead |
| Inlining | Derivative propagation becomes efficient |
| SIMD/vectorization | Works on differentiated code |
| Loop optimization | AD inside loops remains fast |
| Generic reuse | Same source works for many numeric domains |

Forward-mode dual arithmetic therefore performs well in Julia compared to many dynamic languages.

### Reverse Mode in Julia

Reverse mode is essential for machine learning and optimization workloads with many parameters.

Julia reverse-mode systems usually construct pullbacks or tapes representing adjoint propagation.

For:

```julia
y = f(x)
```

the AD system computes:

1. The primal output
2. A backward propagation function

Conceptually:

```julia
(y, back) = pullback(f, x)
```

where:

```julia
dx = back(dy)
```

propagates output adjoints backward into input adjoints.

This functional pullback representation became central in modern Julia AD systems.

### Source-to-Source AD

One influential Julia approach is source-to-source differentiation.

Instead of tracing runtime execution, the AD system transforms Julia intermediate representations into derivative code.

A simplified primal function:

```julia
f(x) = x * sin(x)
```

may conceptually transform into:

```julia
function f_pullback(x)
    y1 = sin(x)
    y2 = x * y1

    function back(dy)
        dx = dy * y1 + dy * x * cos(x)
        return dx
    end

    return y2, back
end
```

The derivative is represented directly as Julia code.

Advantages include:

| Property | Benefit |
|---|---|
| Native language semantics | AD works on ordinary Julia code |
| Compiler visibility | Optimizer sees derivative code |
| Control-flow support | Loops and branches differentiate naturally |
| Higher-order support | Derivative code itself is Julia |
| Composability | AD interacts with generic dispatch |

This approach differs from tensor-only graph systems.

### Julia Intermediate Representations

Julia exposes multiple compiler IR stages.

| IR stage | Role |
|---|---|
| AST | Parsed syntax |
| Lowered IR | Simplified control flow |
| Typed IR | Type-inferred representation |
| LLVM IR | Low-level optimized representation |
| Machine code | Final compiled execution |

AD systems can operate at different levels.

| Level | Characteristics |
|---|---|
| AST transformation | Simple but limited semantic information |
| Lowered IR | Explicit control flow |
| Typed IR | Full type information |
| LLVM IR | Hardware-level optimization |

Many Julia AD tools operate around typed IR because it preserves language semantics while exposing compiler information.

### Control Flow

Julia AD systems typically support ordinary control flow directly.

Example:

```julia
function f(x)
    y = 0.0

    for i in eachindex(x)
        if x[i] > 0
            y += x[i]^2
        else
            y -= x[i]
        end
    end

    return y
end
```

The differentiated program follows the executed path.

This is important because Julia scientific code frequently contains:

- Adaptive solvers
- Iterative algorithms
- Dynamic branching
- Recursive methods
- Simulation loops

The AD system must therefore differentiate general programs, not only static computation graphs.

### Mutation and Arrays

Mutation is one of the difficult areas for Julia AD.

Julia arrays are mutable:

```julia
x[1] = 5
```

Reverse mode must define how adjoints behave through updates.

Problems include:

| Problem | Example |
|---|---|
| Aliasing | Multiple references to same array |
| In-place updates | Overwriting needed primal values |
| Broadcasting mutation | Elementwise adjoint accumulation |
| Views/slices | Shared memory regions |
| Sparse updates | Scatter-style reverse rules |

Some Julia AD systems restrict mutation or transform mutating code into functional equivalents internally.

Others provide specialized adjoint rules for common mutation patterns.

### Broadcast Fusion

Julia supports fused broadcasting:

```julia
y .= sin.(x) .+ x .* x
```

This lowers into a fused loop rather than allocating intermediates.

For AD, this matters because naive differentiation may accidentally break fusion and introduce temporary arrays.

A high-performance Julia AD system should preserve:

- Loop fusion
- SIMD vectorization
- Allocation elimination
- Static dispatch

Otherwise the derivative program may be much slower than the primal program.

### Higher-Order Differentiation

Julia’s compositional design makes nested AD natural.

Example:

```julia
gradient(x -> gradient(f, x)[1], x)
```

This computes second derivatives.

Because derivative transformations themselves produce Julia functions, higher-order differentiation becomes recursive transformation over ordinary code.

This supports:

| Operation | Use |
|---|---|
| Hessians | Curvature methods |
| Hessian-vector products | Large-scale optimization |
| Meta-gradients | Differentiating optimizers |
| Implicit differentiation | Solver sensitivities |
| Differential equations | Sensitivity analysis |

The challenge is perturbation confusion and tape interaction across nested passes.

### AD for Scientific Computing

Julia is heavily used in scientific computing, so its AD ecosystem emphasizes more than neural networks.

Typical workloads include:

| Domain | AD use |
|---|---|
| Differential equations | Sensitivity analysis |
| Optimization | Gradients and Hessians |
| PDE solvers | Parameter inference |
| Probabilistic programming | Gradient-based inference |
| Control systems | Trajectory optimization |
| Physics simulation | Differentiable dynamics |
| Computational biology | Parameter fitting |

This shaped Julia AD systems toward differentiating general numerical programs rather than only tensor layers.

### Differentiating Through Solvers

Scientific code often calls iterative solvers.

Example:

```julia
x = solve(A, b)
```

Naively differentiating the implementation may unroll every iteration. This can be expensive and numerically unstable.

Julia AD frameworks increasingly support implicit differentiation:

$$
A x = b
$$

Differentiate both sides:

$$
dA \, x + A \, dx = db
$$

and solve directly for `dx`.

$$
A x = b
$$

This avoids differentiating every internal solver iteration.

### GPU Support

Julia supports GPU execution through native language tooling.

A tensor operation may compile directly into GPU kernels.

AD systems therefore must support:

| Requirement | Purpose |
|---|---|
| GPU-compatible pullbacks | Reverse mode on device |
| Kernel differentiation | Gradients through GPU code |
| Memory synchronization | Host-device consistency |
| Broadcast lowering | Efficient fused kernels |
| Mixed precision | Accelerator efficiency |

Since Julia kernels are written in Julia itself, AD can sometimes operate directly on GPU-level code.

### Compiler Integration

Julia’s compiler visibility enables unusually deep AD integration.

The compiler can optimize differentiated code using ordinary optimization passes:

- Dead code elimination
- Inlining
- Constant propagation
- Loop optimization
- Escape analysis
- Allocation removal

This is a major advantage over black-box runtime tracing systems.

The derivative program becomes an ordinary compilable Julia program.

### Major Julia AD Systems

Representative Julia AD systems include:

| System | Main style |
|---|---|
| ForwardDiff.jl | Dual-number forward mode |
| ReverseDiff.jl | Tape-based reverse mode |
| Zygote.jl | Source-to-source reverse mode |
| Tracker.jl | Dynamic graph reverse mode |
| Enzyme.jl | LLVM IR differentiation |
| Diffractor.jl | Compiler-integrated experimentation |

These systems explore different tradeoffs between:

- Performance
- Mutation support
- Compiler integration
- GPU compatibility
- Higher-order differentiation
- Language coverage

### Performance Characteristics

Julia AD systems often achieve strong performance because:

| Feature | Effect |
|---|---|
| Specialization | Efficient derivative kernels |
| Generic numerical code | AD integrates naturally |
| Compiler optimization | Derivative code optimized normally |
| Multiple dispatch | Clean derivative extensibility |
| Low-level access | Efficient tensor and solver integration |

However, performance still depends heavily on allocation patterns, mutation handling, and compiler inference quality.

### Design Philosophy

Julia AD systems tend to treat differentiation as a compiler and language problem rather than only a runtime graph problem.

The important idea is that numerical programs themselves should remain ordinary Julia programs.

A user writes:

```julia
loss(params)
```

and the AD system transforms, specializes, optimizes, and compiles the derivative computation automatically.

This aligns closely with the broader goal of differentiable programming: gradients should emerge from ordinary programs rather than requiring a separate graph language or restricted execution environment.