Skip to content

Geometric Interpretation

Dual numbers provide an algebraic mechanism for differentiation, but they also have a precise geometric meaning. A dual number represents a point together with an...

Dual numbers provide an algebraic mechanism for differentiation, but they also have a precise geometric meaning. A dual number represents a point together with an infinitesimal displacement. Forward mode automatic differentiation can therefore be interpreted as the propagation of tangent vectors through a computation.

This geometric viewpoint explains why the chain rule appears naturally, why derivatives are linear maps, and why forward mode computes directional derivatives.

Points and Infinitesimal Motion

Consider a smooth function

f:RR. f : \mathbb{R} \to \mathbb{R}.

An ordinary real number xx identifies a point on the real line.

A dual number

x+vε x + v\varepsilon

adds an infinitesimal displacement vv attached to the point xx.

Geometrically:

  • xx is the base point
  • vv is an infinitesimal velocity or tangent direction
  • ε\varepsilon marks infinitesimal scale

The dual number therefore describes a first-order motion through space.

Instead of representing only location, it represents location plus local movement.

Tangent Vectors

In differential geometry, a tangent vector describes how a curve moves through a point.

Let

γ(t) \gamma(t)

be a smooth curve with

γ(0)=x. \gamma(0)=x.

Its tangent vector at xx is

γ(0). \gamma'(0).

Now consider the infinitesimal expression

γ(0)+γ(0)ε. \gamma(0) + \gamma'(0)\varepsilon.

This has exactly the structure of a dual number.

Dual numbers therefore encode tangent vectors directly.

The coefficient of ε\varepsilon is not merely a derivative value. It is a geometric tangent direction attached to a point.

Tangent Propagation Through Functions

Suppose

f:RR f : \mathbb{R} \to \mathbb{R}

and consider a tangent vector

x+vε. x + v\varepsilon.

Evaluating ff gives

f(x+vε)=f(x)+f(x)vε. f(x+v\varepsilon) = f(x) + f'(x)v\varepsilon.

Geometrically:

  • f(x)f(x) is the transformed point
  • f(x)vf'(x)v is the transformed tangent vector

The derivative acts by transporting infinitesimal motion.

The map

vf(x)v v \mapsto f'(x)v

is the tangent map of ff at xx.

Forward mode AD computes this tangent transport automatically.

The Tangent Bundle

For a smooth manifold MM, the tangent bundle TMTM contains:

  • all points xMx \in M
  • all tangent vectors attached at each point

For Euclidean space:

TRn=Rn×Rn. T\mathbb{R}^n = \mathbb{R}^n \times \mathbb{R}^n.

Each element has two components:

(x,v). (x,v).

This matches exactly the structure used in forward mode AD:

(value,tangent). (\text{value}, \text{tangent}).

Dual numbers are therefore a coordinate representation of tangent bundle computation.

Forward mode AD evaluates functions on tangent bundles.

Functions Lift to Tangent Maps

Given

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

there exists an induced tangent map

Tf:TRnTRm. Tf : T\mathbb{R}^n \to T\mathbb{R}^m.

The tangent map acts as

Tf(x,v)=(f(x),Dfx(v)). Tf(x,v) = (f(x), Df_x(v)).

This is exactly the operational semantics of forward mode automatic differentiation.

Every variable carries:

  • a primal value
  • a tangent vector

Each operation updates both simultaneously.

AD therefore computes lifted functions on tangent spaces.

Example: A Curve Through a Function

Consider

f(x)=x2. f(x)=x^2.

Take a curve

γ(t)=x+tv. \gamma(t)=x+tv.

Applying ff:

f(γ(t))=(x+tv)2. f(\gamma(t)) = (x+tv)^2.

Expanding:

=x2+2xvt+v2t2. = x^2 + 2xvt + v^2t^2.

Near t=0t=0, only the linear term matters:

f(γ(t))x2+2xvt. f(\gamma(t)) \approx x^2 + 2xvt.

Thus the tangent vector transforms as

v2xv. v \mapsto 2xv.

Now compare with dual numbers:

f(x+vε)=x2+2xvε. f(x+v\varepsilon) = x^2 + 2xv\varepsilon.

The coefficient of ε\varepsilon gives the same tangent transformation.

Dual-number arithmetic therefore reproduces first-order geometry exactly.

Directional Derivatives

A tangent vector specifies a direction.

For

f:RnR, f : \mathbb{R}^n \to \mathbb{R},

the directional derivative at xx in direction vv is

Dfx(v). Df_x(v).

Forward mode computes this naturally.

Evaluate:

x+vε. x + v\varepsilon.

Then:

f(x+vε)=f(x)+Dfx(v)ε. f(x+v\varepsilon) = f(x) + Df_x(v)\varepsilon.

The infinitesimal coefficient is the directional derivative.

Thus forward mode computes:

Jv, Jv,

where JJ is the Jacobian matrix.

Geometrically, this measures how the function changes when moving infinitesimally in direction vv.

Jacobians as Local Linear Maps

Near a point xx, a smooth function behaves approximately linearly:

f(x+h)f(x)+Jxh. f(x+h) \approx f(x)+J_x h.

The Jacobian is the best local linear approximation.

Dual numbers isolate exactly this linear part.

Because

ε2=0, \varepsilon^2 = 0,

all nonlinear higher-order terms vanish.

Only the local linear transformation survives.

Forward mode therefore computes the local geometry of the function.

Geometric Meaning of the Chain Rule

Suppose

f:RnRm f : \mathbb{R}^n \to \mathbb{R}^m

and

g:RmRk. g : \mathbb{R}^m \to \mathbb{R}^k.

Then

gf g \circ f

induces tangent propagation:

(x,v)(f(x),Dfx(v))(g(f(x)),Dgf(x)(Dfx(v))). (x,v) \mapsto (f(x), Df_x(v)) \mapsto (g(f(x)), Dg_{f(x)}(Df_x(v))).

Thus:

D(gf)x(v)=Dgf(x)(Dfx(v)). D(g\circ f)_x(v) = Dg_{f(x)}(Df_x(v)).

This is the chain rule.

Geometrically, tangent vectors move through successive local linearizations.

Forward mode AD performs exactly this process during execution.

Pushforwards

In geometry, transporting tangent vectors through a map is called the pushforward.

For a smooth map ff, the pushforward at xx is

f:TxMTf(x)N. f_* : T_xM \to T_{f(x)}N.

It maps tangent vectors from one space into another.

Automatic differentiation computes pushforwards mechanically.

Each program instruction pushes tangent information forward through the computational graph.

This is why forward mode is sometimes called tangent-linear propagation.

Example in Two Dimensions

Consider:

f(x,y)=(xy,sinx). f(x,y) = (xy,\sin x).

Take a tangent vector:

(x,y,vx,vy). (x,y,v_x,v_y).

The Jacobian is:

J=[yxcosx0]. J= \begin{bmatrix} y & x \\ \cos x & 0 \end{bmatrix}.

Applying the tangent map:

J[vxvy]=[yvx+xvy(cosx)vx]. J \begin{bmatrix} v_x \\ v_y \end{bmatrix} = \begin{bmatrix} yv_x + xv_y \\ (\cos x)v_x \end{bmatrix}.

Using dual numbers:

xx+vxε x \mapsto x+v_x\varepsilon yy+vyε. y \mapsto y+v_y\varepsilon.

Then:

xy=xy+(yvx+xvy)ε xy = xy + (yv_x+xv_y)\varepsilon

and

sinx=sinx+(cosx)vxε. \sin x = \sin x + (\cos x)v_x\varepsilon.

The coefficients match exactly.

Forward mode therefore computes tangent-vector transport through multivariate functions.

Infinitesimal Geometry

Dual numbers provide a first-order approximation to local geometry.

At infinitesimal scale:

  • curves become tangent vectors
  • smooth maps become linear maps
  • nonlinear structure disappears

The nilpotent condition

ε2=0 \varepsilon^2 = 0

removes curvature terms and retains only first-order behavior.

This is why dual numbers naturally encode differentiation.

Tangent Spaces as Linearization

At every point xx, the tangent space provides the best linear approximation to the surrounding geometry.

Automatic differentiation operates entirely within this linearized world.

Forward mode propagates tangent vectors:

vJv. v \to Jv.

Reverse mode propagates cotangent vectors:

uJTu. u \to J^Tu.

These correspond to two dual geometric viewpoints:

ModeGeometric ObjectOperation
Forward modeTangent vectorsPushforward
Reverse modeCotangent vectorsPullback

This distinction becomes central in large-scale optimization and deep learning.

Relation to Differential Forms

Reverse mode AD is naturally connected to cotangent geometry.

A gradient is not fundamentally a tangent vector. It is a covector:

dfx. df_x.

It acts on tangent vectors:

dfx(v)=Dfx(v). df_x(v)=Df_x(v).

Forward mode pushes tangent vectors forward.

Reverse mode pulls covectors backward.

The two modes correspond to dual geometric structures.

Computational Graph Geometry

A computational graph can be viewed geometrically as a composition of local maps.

Each node defines a local transformation:

xy. x \to y.

Forward mode propagates tangent information along graph edges.

At each node:

  1. Compute the primal output
  2. Apply the local Jacobian to the incoming tangent

The global derivative emerges from repeated local linear transport.

This interpretation explains why AD scales compositionally.

Geometric Summary

Dual numbers are not merely algebraic tricks. They model infinitesimal geometry.

A dual number

x+vε x+v\varepsilon

represents:

  • a point xx
  • an attached tangent vector vv

Applying a smooth function transports both:

f(x+vε)=f(x)+Dfx(v)ε. f(x+v\varepsilon) = f(x)+Df_x(v)\varepsilon.

Forward mode automatic differentiation is therefore:

  • tangent bundle evaluation
  • infinitesimal motion propagation
  • local linear transport through programs

The chain rule becomes geometric composition of tangent maps.