Dual Numbers

Dual numbers give forward mode automatic differentiation a compact algebraic form. Instead of storing a value and a tangent as two unrelated fields, we package them into one object:

x + \epsilon \dot{x}

where $\epsilon$ is a formal symbol satisfying

\epsilon^2 = 0.

The number $x$ is the primal value. The number $\dot{x}$ is the tangent. The symbol $\epsilon$ marks the tangent part.

A dual number is therefore a first-order approximation stored as an algebraic value:

x + \epsilon \dot{x}.

It behaves like an ordinary number, except all terms involving $\epsilon^2$ vanish.

Why $\epsilon^2 = 0$

The rule $\epsilon^2 = 0$ means dual numbers keep only first-order information. This mirrors the first-order Taylor expansion:

f(x + h) = f(x) + f'(x)h + O(h^2).

Dual numbers replace the small perturbation $h$ with $\epsilon \dot{x}$ . Since $\epsilon^2 = 0$ , every second-order and higher-order term disappears exactly.

f(x + \epsilon \dot{x}) = f(x) + \epsilon f'(x)\dot{x}.

This is the central identity behind forward mode AD.

Arithmetic with dual numbers

Let

a = x + \epsilon \dot{x}, \qquad b = y + \epsilon \dot{y}.

Addition is componentwise:

a + b = (x + y) + \epsilon(\dot{x} + \dot{y}).

Multiplication follows ordinary algebra, then removes the $\epsilon^2$ term:

ab = (x + \epsilon \dot{x})(y + \epsilon \dot{y})

= xy + \epsilon x\dot{y} + \epsilon y\dot{x} + \epsilon^2\dot{x}\dot{y}.

Since $\epsilon^2 = 0$ ,

ab = xy + \epsilon(x\dot{y} + y\dot{x}).

The tangent part is exactly the product rule.

Division works similarly. For

z = \frac{x}{y},

the dual result is

\frac{x + \epsilon\dot{x}}{y + \epsilon\dot{y}} = \frac{x}{y} + \epsilon \frac{\dot{x}y - x\dot{y}}{y^2}.

The tangent part is exactly the quotient rule.

Elementary functions

Dual numbers extend ordinary elementary functions by Taylor expansion.

For a smooth scalar function $f$ ,

f(x + \epsilon \dot{x}) = f(x) + \epsilon f'(x)\dot{x}.

For example:

\sin(x + \epsilon\dot{x}) = \sin x + \epsilon \cos x \dot{x}.

\exp(x + \epsilon\dot{x}) = \exp x + \epsilon \exp x \dot{x}.

\log(x + \epsilon\dot{x}) = \log x + \epsilon \frac{\dot{x}}{x}.

\sqrt{x + \epsilon\dot{x}} = \sqrt{x} + \epsilon \frac{\dot{x}}{2\sqrt{x}}.

Every primitive operation exposes both its value rule and its derivative rule.

Example

Let

f(x) = x^2 + 3x.

Evaluate it on the dual input

x + \epsilon.

This corresponds to primal input $x$ and tangent seed $\dot{x} = 1$ .

Now compute:

f(x + \epsilon) = (x + \epsilon)^2 + 3(x + \epsilon).

Expand:

(x + \epsilon)^2 = x^2 + 2\epsilon x + \epsilon^2.

Since $\epsilon^2 = 0$ ,

(x + \epsilon)^2 = x^2 + 2\epsilon x.

Then

f(x + \epsilon) = x^2 + 2\epsilon x + 3x + 3\epsilon.

Collect primal and tangent parts:

f(x + \epsilon) = (x^2 + 3x) + \epsilon(2x + 3).

The primal part is $f(x)$ . The tangent part is $f'(x)$ .

At $x = 5$ ,

f(5 + \epsilon) = 40 + 13\epsilon.

So the function value is $40$ , and the derivative is $13$ .

Directional derivatives with dual numbers

For a function

f : \mathbb{R}^n \to \mathbb{R}^m,

we seed each input variable with a tangent component:

x_i \mapsto x_i + \epsilon \dot{x}_i.

The program then computes

f(x + \epsilon \dot{x}) = f(x) + \epsilon J_f(x)\dot{x}.

The coefficient of $\epsilon$ is the Jacobian-vector product.

For example, let

f(x, y) = xy + \sin x.

Use the seeded inputs

x + \epsilon \dot{x}, \qquad y + \epsilon \dot{y}.

Then

f(x + \epsilon\dot{x}, y + \epsilon\dot{y}) = (x + \epsilon\dot{x})(y + \epsilon\dot{y}) + \sin(x + \epsilon\dot{x}).

The product term gives

xy + \epsilon(x\dot{y} + y\dot{x}).

The sine term gives

\sin x + \epsilon \cos x \dot{x}.

f(x + \epsilon\dot{x}, y + \epsilon\dot{y}) = xy + \sin x + \epsilon(x\dot{y} + y\dot{x} + \cos x \dot{x}).

The tangent is

\dot{f} = (y + \cos x)\dot{x} + x\dot{y}.

This equals

J_f(x, y) \begin{bmatrix} \dot{x} \\ \dot{y} \end{bmatrix}.

Implementation form

A dual number can be represented as a pair:

type Dual struct {
    Value   float64
    Tangent float64
}

Addition:

func Add(a, b Dual) Dual {
    return Dual{
        Value:   a.Value + b.Value,
        Tangent: a.Tangent + b.Tangent,
    }
}

Multiplication:

func Mul(a, b Dual) Dual {
    return Dual{
        Value:   a.Value * b.Value,
        Tangent: a.Tangent*b.Value + a.Value*b.Tangent,
    }
}

Sine:

func Sin(a Dual) Dual {
    return Dual{
        Value:   math.Sin(a.Value),
        Tangent: math.Cos(a.Value) * a.Tangent,
    }
}

Exponentiation:

func Exp(a Dual) Dual {
    v := math.Exp(a.Value)
    return Dual{
        Value:   v,
        Tangent: v * a.Tangent,
    }
}

This representation is enough to build a small forward mode AD system. A user writes ordinary numerical code, but the inputs are dual numbers instead of plain floating point numbers. The overloaded operations then propagate derivatives automatically.

Multiple tangent directions

A scalar dual number stores one tangent direction. To propagate several directions in one pass, replace the scalar tangent with a vector:

type DualVec struct {
    Value   float64
    Tangent []float64
}

Now the value is still scalar, but the tangent records several directional derivatives at once.

If the tangent vector has length $k$ , one execution computes $k$ Jacobian-vector products. This is often called vector forward mode.

For example, to compute the full gradient of a scalar function

f : \mathbb{R}^n \to \mathbb{R},

one can seed all $n$ basis directions at once by giving each input a tangent vector:

x_1 \mapsto x_1 + \epsilon e_1,

x_2 \mapsto x_2 + \epsilon e_2,

\cdots

x_n \mapsto x_n + \epsilon e_n.

The output tangent vector then contains the gradient components.

This is practical when $n$ is small or moderate. For very large $n$ , reverse mode is usually preferred for scalar outputs.

Dual numbers versus finite differences

Dual numbers may look similar to finite differences because both involve perturbing the input. The difference is fundamental.

Finite differences evaluate

\frac{f(x + h) - f(x)}{h}

for a small floating point number $h$ . The result depends on the choice of $h$ . If $h$ is too large, truncation error dominates. If $h$ is too small, roundoff error dominates.

Dual numbers use a formal perturbation $\epsilon$ with $\epsilon^2 = 0$ . There is no small numerical step. The derivative is carried exactly through the arithmetic rules of the program, subject only to the normal floating point errors of the primal and tangent computations.

So dual numbers avoid the step-size problem of finite differences.

Dual numbers versus symbolic differentiation

Dual numbers also differ from symbolic differentiation. Symbolic differentiation constructs an expression for the derivative. This expression may become large and difficult to simplify.

Dual numbers execute the original program once with extended arithmetic. They compute derivative values, not derivative formulas. The derivative computation follows the same structure as the primal computation.

This is why dual numbers are well suited to program differentiation. They do not require the whole program to be converted into a symbolic expression.

Algebraic meaning

The dual numbers form the algebra

\mathbb{R}[\epsilon] / (\epsilon^2).

This means polynomials in $\epsilon$ , but with the relation $\epsilon^2 = 0$ . Every element reduces to the form

a + \epsilon b.

The primal value $a$ is the constant coefficient. The tangent value $b$ is the first-order coefficient.

Forward mode AD can be seen as evaluating a program over this algebra instead of over ordinary real numbers. If the original program computes over $\mathbb{R}$ , the differentiated program computes over dual numbers.

This view explains why ordinary arithmetic rules automatically become derivative propagation rules. The chain rule is built into composition over the dual number algebra.

Practical limitations

Dual numbers work cleanly for smooth operations. Care is needed for operations that are discontinuous, non-smooth, or undefined at some inputs.

For example:

|x|

has no derivative at $x = 0$ . A dual-number implementation must choose what to do at that point. It may return an error, return a conventional subgradient, or follow the derivative of the branch taken by the program.

Conditionals introduce similar issues. If a program contains

if x > 0 {
    y = x
} else {
    y = 0
}

then the derivative follows the executed branch. At the boundary $x = 0$ , the mathematical derivative may be undefined even though the program still returns a value.

Thus dual numbers provide exact first-order propagation through the executed operations, but they do not remove the mathematical difficulties of non-smooth programs.

Summary

Dual numbers are the algebraic core of forward mode automatic differentiation. A value

x + \epsilon\dot{x}

stores both a primal value and a tangent value. The rule

\epsilon^2 = 0

removes all higher-order terms, leaving exactly the first-order derivative information.

Evaluating a program on dual numbers computes both the original output and the directional derivative in one execution. This makes dual numbers one of the simplest and most precise ways to implement forward mode AD.