Chapter 12. AD in Modern Programming Languages

Lisp Systems

Lisp is one of the natural homes of automatic differentiation. It treats programs as data, has a simple expression syntax, and supports macro systems that can transform code before execution. These properties make it possible to implement automatic differentiation as a direct program transformation rather than as an external compiler pass.

A Lisp system represents code as symbolic expressions. The expression

(* (+ x 1) (sin x))

is both a program and a tree-shaped data structure. A differentiation system can inspect this tree, rewrite it, attach derivative rules, and produce another program. This gives Lisp a clean route to source-to-source automatic differentiation.

For a function

(defun f (x)
  (* (+ x 1) (sin x)))

an AD system may generate a new function that computes both the primal value and its derivative:

(defun f-forward (x dx)
  (let* ((v1 (+ x 1))
         (dv1 (+ dx 0))

         (v2 (sin x))
         (dv2 (* (cos x) dx))

         (v3 (* v1 v2))
         (dv3 (+ (* dv1 v2)
                 (* v1 dv2))))
    (values v3 dv3)))

This is forward mode. Each intermediate variable has a primal part and a tangent part. The primal part computes the original program. The tangent part computes the directional derivative.

The same idea can be expressed with dual numbers:

(defstruct dual
  primal
  tangent)

Then arithmetic operations are lifted:

(defun dual+ (a b)
  (make-dual
   :primal (+ (dual-primal a) (dual-primal b))
   :tangent (+ (dual-tangent a) (dual-tangent b))))

(defun dual* (a b)
  (make-dual
   :primal (* (dual-primal a) (dual-primal b))
   :tangent (+ (* (dual-tangent a) (dual-primal b))
               (* (dual-primal a) (dual-tangent b)))))

The user writes an ordinary-looking program, but the numbers flowing through it carry derivative information. This style is easy to prototype in Lisp because operators can be wrapped, redefined, or dispatched dynamically.

Macros and Source Transformation

Macros are the central advantage of Lisp for AD implementation. A macro receives code before evaluation and returns transformed code. This lets the AD system rewrite a function into a derivative-producing function.

A simple forward-mode transformation follows this pattern:

(defmacro with-forward-ad ((x dx) body)
  ;; Expand BODY into code that propagates primal and tangent values.
  ...)

Conceptually, the macro transforms each expression according to local derivative rules:

Source expression	Primal	Tangent
`(+ a b)`	`(+ a b)`	`(+ da db)`
`(- a b)`	`(- a b)`	`(- da db)`
`(* a b)`	`(* a b)`	`(+ (* da b) (* a db))`
`(/ a b)`	`(/ a b)`	`(/ (- (* da b) (* a db)) (* b b))`
`(sin a)`	`(sin a)`	`(* (cos a) da)`
`(exp a)`	`(exp a)`	`(* (exp a) da)`

The transformation is local. Each node in the expression tree is rewritten using the chain rule. The compiler then receives ordinary Lisp code.

This is a major design difference from systems based only on operator overloading. Operator overloading changes the meaning of values. Source transformation changes the program itself. Lisp supports both styles.

Reverse Mode in Lisp

Reverse mode is more complex because the derivative flows backward through the computation. The system must record intermediate values, allocate adjoints, and then run a reverse pass.

For the same function

(defun f (x)
  (* (+ x 1) (sin x)))

a reverse-mode transformation produces code shaped like this:

(defun f-reverse (x)
  (let* ((v1 (+ x 1))
         (v2 (sin x))
         (v3 (* v1 v2))

         (av3 1)
         (av2 (* av3 v1))
         (av1 (* av3 v2))
         (ax (+ av1 (* av2 (cos x)))))
    (values v3 ax)))

The forward pass computes and stores v1, v2, and v3. The reverse pass starts from the output adjoint av3 = 1 and propagates sensitivity backward.

For a scalar-output function with many inputs, reverse mode is efficient because it computes the gradient in a cost comparable to a small constant multiple of the original function evaluation. That is why reverse mode became the dominant form in neural network training.

Tapes and Wengert Lists

A Lisp reverse-mode AD system can use a tape. The tape is a list of primitive operations executed during the forward pass. Each entry stores enough information to run the corresponding derivative rule later.

A tape entry may contain:

Field	Meaning
`op`	Primitive operation, such as `+`, `*`, or `sin`
`inputs`	References to input variables
`output`	Reference to output variable
`primal`	Computed primal value
`pullback`	Function that propagates adjoints backward

In Lisp, the pullback can be represented as a closure:

(lambda (bar-output)
  ;; Add contributions to input adjoints.
  ...)

This representation is flexible. It works well for prototyping and supports dynamic control flow. Its cost is allocation overhead. A production compiler may instead lower the tape into arrays or static instructions.

Control Flow

Lisp programs commonly use conditionals, recursion, higher-order functions, and macros. AD must define how each of these behaves.

A conditional differentiates only the branch taken:

(if (> x 0)
    (* x x)
    (- x))

At x > 0, the derivative is 2x. At x < 0, the derivative is -1. At x = 0, the program has a branch boundary. The derivative may be undefined, even though the program still returns a value.

Loops and recursion are handled by differentiating the executed computation. In reverse mode, this means the system must replay or store the sequence of operations actually performed. This makes dynamic control flow natural but increases pressure on memory.

Higher-order functions introduce another issue. Consider:

(mapcar #'sin xs)

The AD system needs a derivative rule not only for sin, but also for mapcar as a control structure over many elements. A source transformer can expand or specialize such calls when enough structure is known.

Strengths of Lisp for AD

Lisp systems have several clear strengths for automatic differentiation.

Strength	Why it matters
Code as data	AD can inspect and rewrite programs directly
Macros	Differentiation can be implemented as language extension
Dynamic typing	Fast experimentation with derivative-carrying values
Closures	Pullbacks and custom derivative rules are easy to represent
REPL workflow	AD systems can be developed interactively
Symbolic manipulation	Symbolic and automatic methods can be mixed

These properties made Lisp attractive for early AI, symbolic mathematics, and program transformation research. They also make it a useful laboratory for AD design.

Limitations

The same features that make Lisp flexible can make high-performance AD difficult.

Dynamic dispatch can obscure types. Generic arithmetic can allocate heavily. Closures used as pullbacks are elegant but expensive. Macro expansion may produce large code if the source program is deeply nested. Mutation and side effects require careful treatment, especially in reverse mode.

A practical Lisp AD system therefore needs a boundary between flexible front-end representation and efficient execution. One common design is:

Layer	Role
User language	Ordinary Lisp functions and macros
AD transformer	Rewrites expressions into derivative code
IR layer	Normalized computation graph or SSA-like form
Optimizer	Removes redundant primal and tangent work
Runtime	Executes dense numeric kernels efficiently

This mirrors the broader architecture of modern AD compilers. Lisp provides an expressive front end, but the differentiated program often needs a lower-level representation for speed.

Lisp as a Model for Differentiable Programming

Lisp shows that automatic differentiation can be understood as a language feature, not only as a numeric library. The AD system can participate in compilation. It can transform source code, introduce new variables, generate reverse passes, and attach custom derivative rules.

This viewpoint remains important in modern differentiable programming. Systems such as JAX, Julia AD tools, Swift for TensorFlow, and compiler-level AD frameworks all use some form of program transformation. Lisp reached this idea early because its syntax and macro system made program transformation ordinary.

The key lesson is simple: when a language exposes programs as transformable objects, automatic differentiation becomes a systematic rewrite of computation. The derivative program is not external to the original program. It is another program generated from it by applying the chain rule to the structure of evaluation.