# AD in Swift

## AD in Swift

Swift became an important experiment in language-integrated automatic differentiation because it attempted to make differentiation a core compiler feature rather than a library layered on top of the language. The central idea was that differentiable functions should participate in the type system, compilation pipeline, optimization passes, and language semantics directly.

This differs from many runtime-based systems. Instead of tracing operations dynamically or overloading tensor types externally, the compiler itself understands derivative transformations.

### Differentiation as a Language Feature

Swift introduced differentiable function types.

A function could be declared differentiable:

```swift
@differentiable
func f(_ x: Float) -> Float {
    x * sin(x)
}
```

The annotation tells the compiler:

1. The function participates in AD.
2. Its operations must support derivatives.
3. The compiler may synthesize derivative code automatically.

Differentiation becomes part of semantic analysis rather than a runtime convention.

### Differentiable Function Types

Swift modeled differentiable functions explicitly in the type system.

Conceptually:

$$
f : X \rightarrow Y
$$

becomes a differentiable mapping equipped with derivative structure.

A derivative operator:

```swift
gradient(at: x) { x in
    f(x)
}
```

returns the gradient of the closure.

The compiler understands this transformation statically.

This allows differentiation to interact cleanly with:

| Language feature | Benefit |
|---|---|
| Generics | Generic differentiable code |
| Protocols | Abstract differentiable interfaces |
| Type checking | Static derivative validation |
| Optimization | Compiler-level derivative optimization |
| Ownership analysis | Efficient memory behavior |

### Pullbacks and Reverse Mode

Swift’s AD design centered on pullbacks.

Given:

$$
f : X \rightarrow Y
$$

reverse mode constructs:

$$
f^* : T_Y \rightarrow T_X
$$

where the pullback maps output cotangents back into input cotangents.

The compiler synthesizes a pullback automatically.

Conceptually:

```swift
let (y, pullback) = valueWithPullback(at: x, in: f)
let dx = pullback(dy)
```

This separates:

| Component | Role |
|---|---|
| Primal computation | Forward evaluation |
| Pullback | Reverse derivative propagation |

This representation is mathematically clean and composes naturally.

### Compiler-Level Transformation

Swift AD operated inside the compiler pipeline.

A simplified flow:

```text
Swift source
→ AST
→ typed intermediate representation
→ AD transformation
→ optimized derivative IR
→ LLVM lowering
→ machine code
```

The AD pass rewrites functions into derivative-producing versions.

For example:

```swift
func f(_ x: Float) -> Float {
    x * x
}
```

may conceptually become:

```swift
func f_with_pullback(_ x: Float)
    -> (Float, (Float) -> Float)
{
    let y = x * x

    func pullback(_ dy: Float) -> Float {
        dy * 2 * x
    }

    return (y, pullback)
}
```

The derivative becomes ordinary compiler IR.

### Static Differentiation

One major difference from dynamic tracing systems is that Swift differentiation is static.

The compiler knows:

- Function signatures
- Types
- Control flow structure
- Mutation behavior
- Ownership rules

before generating derivative code.

Advantages include:

| Advantage | Explanation |
|---|---|
| Early validation | Invalid differentiation rejected at compile time |
| Optimization visibility | Compiler optimizes derivative code directly |
| No runtime tracing overhead | Graph construction unnecessary |
| Better memory planning | Lifetimes known statically |
| Strong composability | Derivatives behave like ordinary functions |

This resembles traditional compiler transformations more than runtime graph execution.

### Differentiable Protocols

Swift protocols allow abstraction over differentiable structures.

A type can conform to a differentiability interface:

```swift
protocol Differentiable {
    associatedtype TangentVector
}
```

A tensor, vector, or model parameter type can define its tangent representation.

This supports generalized differentiation across many structures.

For example:

| Type | Tangent representation |
|---|---|
| Scalar | Scalar |
| Vector | Vector |
| Matrix | Matrix |
| Struct | Struct of tangents |
| Neural network layer | Parameter tangent structure |

This gives differentiation a structural interpretation.

### Tangent Vectors

Swift modeled tangent spaces explicitly.

A differentiable type defines a tangent vector type:

```swift
struct Point: Differentiable {
    var x: Float
    var y: Float

    typealias TangentVector = Point
}
```

For more complex structures:

```swift
struct Model: Differentiable {
    var weights: Tensor<Float>
    var bias: Tensor<Float>
}
```

the tangent vector has the same structural shape.

This matches the mathematical view:

$$
T(X \times Y) = TX \times TY
$$

The tangent of a product type is the product of tangents.

### Mutation and Inout Parameters

Swift supports controlled mutation through `inout` parameters:

```swift
func update(_ x: inout Float) {
    x *= 2
}
```

Mutation complicates reverse mode because overwritten values may be needed later during adjoint propagation.

Compiler-integrated AD can analyze mutation statically.

Possible strategies include:

| Strategy | Purpose |
|---|---|
| Save old values | Needed for reverse reconstruction |
| Activity analysis | Ignore inactive mutations |
| Functionalization | Rewrite mutation into immutable updates |
| Ownership tracking | Avoid unnecessary copies |

Because the compiler already performs ownership analysis, AD can reuse this information.

### Control Flow

Swift AD supports ordinary control flow:

```swift
func f(_ x: Float) -> Float {
    if x > 0 {
        return x * x
    } else {
        return -x
    }
}
```

The derivative follows the executed branch.

Loops are similarly differentiated by transforming the loop body and propagating adjoints through iterations.

This allows AD over general programs rather than only tensor graphs.

### Generic Differentiable Programming

Swift aimed to support differentiable programming broadly.

Differentiation should apply to:

- Neural networks
- Numerical simulations
- Optimization routines
- Geometry code
- Physics systems
- Data structures

This required AD to integrate with ordinary language semantics.

A differentiable function was not a special runtime object. It was an ordinary typed function with additional compiler-known structure.

### Custom Derivative Definitions

Some functions require manually defined derivatives.

Swift exposed APIs for custom derivative rules.

Conceptually:

```swift
@derivative(of: f)
func fDerivative(_ x: Float)
    -> (value: Float, pullback: (Float) -> Float)
{
    ...
}
```

This lets library authors provide:

- Numerically stable derivatives
- Efficient adjoints
- Implicit derivatives
- Specialized tensor rules

Custom rules are essential for practical systems because naive differentiation is often inefficient or unstable.

### Higher-Order Differentiation

Because derivatives are represented as ordinary functions, higher-order differentiation becomes recursive transformation.

Example:

```swift
gradient(at: x) { x in
    gradient(at: x, in: f)
}
```

This computes second derivatives.

The compiler must avoid:

| Problem | Meaning |
|---|---|
| Perturbation confusion | Mixing derivative levels |
| Excessive code expansion | Nested transformations explode |
| Pullback duplication | Repeated reverse structures |
| Memory blowup | Saved intermediates accumulate |

Compiler-level visibility helps manage these issues.

### SIL and Intermediate Representations

Swift lowers source code into SIL (Swift Intermediate Language).

SIL is:

- Typed
- Ownership-aware
- Explicit about control flow
- Suitable for optimization

AD transformations operate at the SIL level.

This is important because SIL preserves language semantics while exposing compiler structure.

Compared with source-level rewriting, SIL-based AD avoids many ambiguities from parsing and overload resolution.

Compared with LLVM-level AD, SIL retains higher-level semantic information.

### Ownership and Memory

Swift’s ownership system was important for efficient reverse mode.

Reverse mode often extends value lifetimes because intermediates must survive until the backward pass.

Ownership analysis helps determine:

| Question | Importance |
|---|---|
| When can values be freed? | Memory optimization |
| Which values need saving? | Reverse correctness |
| Can buffers be reused? | Performance |
| Is copying necessary? | Avoid allocation overhead |

Compiler-integrated AD can coordinate these analyses directly with ordinary optimization passes.

### Tensor Systems and Swift for TensorFlow

Swift AD became widely known through the Swift for TensorFlow project.

The goal was not only tensor computation. The project attempted to build a language-native differentiable programming model.

Key ideas included:

| Idea | Meaning |
|---|---|
| Differentiable types | AD integrated with type system |
| Compiler transformations | Static derivative generation |
| Python interoperability | Access ML ecosystem |
| Staged compilation | Accelerator optimization |
| First-class gradients | Derivatives as ordinary functions |

Although Swift for TensorFlow was discontinued as a product direction, many of its ideas strongly influenced later differentiable programming research.

### Advantages of Swift for AD

| Feature | Benefit |
|---|---|
| Strong typing | Static derivative correctness |
| Compiler integration | Efficient derivative generation |
| Ownership model | Better memory optimization |
| Protocol system | Generic differentiable abstractions |
| High-level syntax | Productive numerical programming |
| LLVM backend | Access to optimized compilation |

Swift demonstrated that AD can be integrated deeply into a modern compiled language.

### Limitations

Several challenges emerged.

| Challenge | Explanation |
|---|---|
| Compiler complexity | AD touched many compiler stages |
| Language coverage | Some constructs difficult to differentiate |
| Mutation semantics | Reverse-mode state management hard |
| Ecosystem size | Smaller scientific ecosystem than Python |
| Compile times | Differentiated code increases compilation work |
| GPU integration | Accelerator support required large infrastructure |

Deep compiler integration gives power but increases implementation complexity substantially.

### Influence on Differentiable Programming

Swift helped formalize several important ideas:

| Idea | Long-term importance |
|---|---|
| Differentiable functions as types | AD integrated into language semantics |
| Pullbacks as compiler objects | Structured reverse mode |
| Tangent-vector protocols | Structural differentiation |
| Compiler-generated derivatives | Static transformation model |
| Ownership-aware AD | Memory-efficient reverse mode |

These ideas continue to influence compiler-level AD systems and differentiable programming language research.

### Broader Significance

Swift demonstrated that automatic differentiation does not need to be a runtime library layered over opaque tensor objects. It can instead become part of the language itself.

In this model:

- Gradients are typed program transformations.
- Pullbacks are compiler-generated functions.
- Differentiable structures participate in semantic analysis.
- Optimization and differentiation occur in the same compilation pipeline.

This reframed AD as a core compiler capability rather than only a machine learning runtime feature.

