# Operator Libraries

## Operator Libraries

An automatic differentiation engine becomes useful only after it supports a sufficiently rich set of primitive operations. The collection of these primitives is the operator library.

The operator library defines:
- which computations are differentiable
- how derivatives propagate
- what tensor semantics exist
- how shapes and dtypes behave
- which kernels execute on which devices

In a minimal engine, the operator library may contain fewer than ten scalar operations. In a production tensor system, it may contain thousands of operators.

## Primitive Operations

A primitive operation is an operation whose derivative rule is implemented directly.

Examples:
- addition
- multiplication
- sine
- exponential
- matrix multiplication
- convolution
- reduction
- reshape

Everything else can be expressed as compositions of primitives.

For example:

$$
f(x) = x^3
$$

does not require a dedicated cube operator:

```text
x2 = mul(x, x)
x3 = mul(x2, x)
```

The derivative emerges automatically from composition.

The operator library therefore defines the algebraic basis of the AD system.

## Minimal Scalar Operator Set

A small scalar engine can begin with:

| Operator | Meaning |
|---|---|
| add | Addition |
| sub | Subtraction |
| mul | Multiplication |
| div | Division |
| sin | Sine |
| cos | Cosine |
| exp | Exponential |
| log | Natural logarithm |

A tape representation:

```go
type OpKind uint8

const (
    OpAdd OpKind = iota
    OpSub
    OpMul
    OpDiv
    OpSin
    OpCos
    OpExp
    OpLog
)
```

Each operation needs:
1. forward evaluation
2. backward rule

Addition:

```go
func (t *Tape) Add(a, b Slot) Slot {
    out := t.alloc(
        t.Values[a] + t.Values[b],
        t.RequiresGrad[a] || t.RequiresGrad[b],
    )

    if t.RequiresGrad[out] {
        t.Instrs = append(t.Instrs, Instr{
            Op:  OpAdd,
            Out: out,
            A:   a,
            B:   b,
        })
    }

    return out
}
```

Backward:

```go
case OpAdd:
    t.Grads[ins.A] += t.Grads[ins.Out]
    t.Grads[ins.B] += t.Grads[ins.Out]
```

This pattern repeats for every primitive.

## Local Derivative Rules

Every operator implements a local Jacobian action.

For scalar operations:

$$
y = g(x_1, \dots, x_n)
$$

the backward rule computes:

$$
\bar{x_i}
\mathrel{+}=
\bar{y}
\frac{\partial g}{\partial x_i}
$$

The operator library therefore contains executable forms of derivative identities.

Multiplication:

$$
y = ab
$$

gives:

$$
\bar{a} \mathrel{+}= b\bar{y}
$$

$$
\bar{b} \mathrel{+}= a\bar{y}
$$

Division:

$$
y = \frac{a}{b}
$$

gives:

$$
\bar{a} \mathrel{+}= \frac{1}{b}\bar{y}
$$

$$
\bar{b}
\mathrel{+}=
-\frac{a}{b^2}\bar{y}
$$

The engine itself does not understand calculus globally. Each primitive contributes one local rule.

## Forward and Backward Registration

A useful operator abstraction separates operation metadata from execution.

```go
type Operator struct {
    Name string

    Forward  func(*Tape, []Slot) Slot
    Backward func(*Tape, Instr)
}
```

Registration:

```go
var Ops = map[OpKind]Operator{}
```

Example:

```go
Ops[OpAdd] = Operator{
    Name: "add",

    Backward: func(t *Tape, ins Instr) {
        t.Grads[ins.A] += t.Grads[ins.Out]
        t.Grads[ins.B] += t.Grads[ins.Out]
    },
}
```

This structure:
- centralizes derivative rules
- simplifies debugging
- supports reflection and serialization
- decouples graph representation from derivative logic

A closure-per-node design is simpler for teaching. An operator table scales better.

## Tensor Operators

Scalar operators are easy because shapes are trivial. Tensor operators require shape semantics.

Tensor addition:

```text
[a, b, c] + [x, y, z]
```

propagates gradients elementwise.

Matrix multiplication:

$$
C = AB
$$

has backward rules:

$$
\bar{A} \mathrel{+}= \bar{C}B^T
$$

$$
\bar{B} \mathrel{+}= A^T\bar{C}
$$

A tensor operator therefore needs:
- shape information
- layout information
- dtype information
- broadcasting rules
- reduction semantics

A tensor node:

```go
type TensorNode struct {
    Value Tensor
    Grad  Tensor

    Shape []int
    DType DType

    Prev []*TensorNode
    Op   OpKind
}
```

The operator library becomes partly a tensor algebra system.

## Broadcasting Semantics

Broadcasting complicates backward propagation.

Example:

```text
x shape: [batch, features]
b shape: [features]

y = x + b
```

Forward broadcasting duplicates `b` conceptually across the batch dimension.

Backward propagation must reverse that duplication:

```text
grad_b = reduce_sum(grad_y, axis=batch)
```

The backward rule therefore depends on:
- original input shapes
- broadcasted output shape

The operator library must preserve this metadata.

A minimal tensor engine should store original operand shapes inside the instruction.

```go
type Instr struct {
    Op OpKind

    Out Slot
    A   Slot
    B   Slot

    ShapeA []int
    ShapeB []int
}
```

## Shape Transformations

Shape-changing operators often have trivial derivatives mathematically but important systems behavior.

Reshape:

```text
y = reshape(x)
```

Backward:

```text
grad_x = reshape(grad_y, original_shape)
```

Transpose:

```text
y = transpose(x)
```

Backward:

```text
grad_x = transpose(grad_y)
```

Slicing:

```text
y = x[2:5]
```

Backward scatters the gradient into the original tensor shape.

These operators demonstrate an important principle:

```text
The backward rule must reverse the data movement semantics of the forward rule.
```

## Reduction Operators

Reductions shrink dimensions.

Example:

```text
y = sum(x)
```

Backward:

```text
grad_x = broadcast(grad_y)
```

Mean:

```text
y = mean(x)
```

Backward:

```text
grad_x = broadcast(grad_y / n)
```

Maximum is harder.

```text
y = max(x)
```

Backward must identify which elements achieved the maximum.

The forward pass therefore often stores an argmax mask or index set.

Operator design is tightly connected to saved intermediate state.

## Numerical Stability

The operator library must consider stable derivatives.

Naive softmax:

$$
\text{softmax}(x_i) =
\frac{e^{x_i}}{\sum_j e^{x_j}}
$$

is numerically unstable for large values.

Stable implementation:

$$
x'_i = x_i - \max_j x_j
$$

then:

$$
\text{softmax}(x_i) =
\frac{e^{x'_i}}{\sum_j e^{x'_j}}
$$

The backward rule should use the stable forward result rather than recomputing unstable exponentials.

Many practical AD bugs are numerical rather than symbolic.

## Fused Operators

A fused operator combines several operations into one primitive.

Example:

```text
y = relu(matmul(x, w) + b)
```

Possible representations:

| Strategy | Representation |
|---|---|
| Unfused | matmul → add → relu |
| Fused | single kernel |

Fusion can reduce:
- memory traffic
- temporary allocations
- kernel launch overhead

But fusion complicates:
- debugging
- intermediate inspection
- operator reuse

A minimal engine should start unfused. Fusion belongs in a compiler or optimization layer.

## Operator Purity

Differentiable operators should ideally behave as pure functions:

```text
same inputs -> same outputs
```

Purity simplifies:
- caching
- graph replay
- checkpointing
- common subexpression elimination
- compiler optimization

Stateful operators complicate reverse mode.

Random number generation is a common example.

```text
y = dropout(x)
```

Backward requires replaying the same random mask used during forward.

The operator must therefore:
- save the mask
- or save the RNG state

The operator library defines the system's state semantics as much as its derivative semantics.

## Custom Operators

Users often need operators outside the built-in library.

A useful engine exposes registration:

```go
func Register(
    name string,
    forward func(...Tensor) Tensor,
    backward func(...Tensor) []Tensor,
)
```

The user supplies both:
- primal computation
- gradient computation

This is important for:
- optimized kernels
- domain-specific operations
- interoperability
- experimental layers

A custom operator is essentially a manually supplied local derivative rule.

## Operator Validation

Every operator should be testable independently.

A common strategy:
1. choose random inputs
2. compute AD gradient
3. compare against finite differences

Finite difference approximation:

$$
\frac{f(x + \epsilon) - f(x - \epsilon)}{2\epsilon}
$$

Example test:

```go
func CheckGrad(
    f func(float64) float64,
    g func(float64) float64,
    x float64,
) {
    eps := 1e-6

    fd := (f(x+eps) - f(x-eps)) / (2 * eps)
    ad := g(x)

    fmt.Println(fd, ad)
}
```

This catches:
- sign errors
- missing reductions
- broadcasting mistakes
- shape mismatches
- unstable formulas

Operator testing is one of the highest leverage activities in AD implementation.

## Minimal Operator Interface

A compact operator interface for a small tensor engine:

```go
type Operator interface {
    Forward(*Tape, []Slot) Slot
    Backward(*Tape, Instr)
}
```

Or more explicit:

```go
type Operator struct {
    Name string

    Arity int

    Forward func(*Tape, []Slot) Slot
    Backward func(*Tape, Instr)
}
```

This separates:
- graph structure
- storage
- execution
- differentiation logic

The operator library becomes the semantic core of the AD engine.

## Minimal Correctness Invariant

Every operator must satisfy:

```text
Its backward rule computes the transpose Jacobian action of the forward rule.
```

If:

$$
y = g(x)
$$

then the backward rule computes:

$$
\bar{x}
\mathrel{+}=
J_g(x)^T \bar{y}
$$

This is the central invariant of reverse mode.

Everything else in the operator library:
- tensor shapes
- broadcasting
- memory reuse
- device kernels
- fusion
- dtype promotion

exists to make this invariant efficient and practical at scale.