Differentiable Optimization Layers

An optimization layer is a program component whose output is the solution of an optimization problem. Instead of computing

z = f(x)

by a fixed formula, it computes

z^*(\theta) = \arg\min_z L(z,\theta).

The parameters θ may come from a neural network, a simulator, a control system, or a statistical model. The output z* is then passed to another computation, often a scalar loss.

This pattern appears in structured prediction, control, planning, portfolio optimization, inverse problems, and constrained machine learning.

Optimization as a Layer

A usual layer applies an explicit function:

h_{k+1} = \sigma(W h_k + b).

An optimization layer applies an implicit function:

z^* = \operatorname{solve}(\theta).

For example, a quadratic programming layer may solve

\begin{aligned} z^* = \arg\min_z \quad & \frac{1}{2}z^TQz + q^Tz \\ \text{subject to} \quad & Az \leq b. \end{aligned}

Here the input may be Q, q, A, or b. The output is the optimizer z*.

The surrounding model treats this optimizer as a differentiable value. During the backward pass, the AD system must compute how z* changes when the inputs change.

Unconstrained Optimization Layers

Start with the smooth unconstrained case:

z^*(\theta)=\arg\min_z L(z,\theta).

At a local optimum,

\nabla_z L(z^*,\theta)=0.

Define

F(z,\theta)=\nabla_z L(z,\theta).

Then the implicit derivative follows from

F(z^*,\theta)=0.

Differentiating gives

\nabla^2_{zz}L \frac{\partial z^*}{\partial \theta} + \nabla^2_{z\theta}L = 0.

Therefore,

\frac{\partial z^*}{\partial \theta} = - \left(\nabla^2_{zz}L\right)^{-1} \nabla^2_{z\theta}L.

This formula says that sensitivity of the optimum is controlled by the curvature of the objective. If the Hessian is well conditioned, small parameter changes cause controlled changes in the optimizer. If the Hessian is nearly singular, the optimizer may be highly sensitive.

Reverse Mode for Optimization Layers

Suppose a scalar outer loss depends on the optimizer:

\ell = R(z^*,\theta).

Let

\bar{z} = \frac{\partial \ell}{\partial z^*}.

Reverse mode avoids forming the full Jacobian of z* with respect to θ.

Solve the adjoint system

\left(\nabla^2_{zz}L\right)^T \lambda = \bar{z}^T.

For a twice differentiable scalar objective, the Hessian is symmetric, so this is usually

\nabla^2_{zz}L \lambda = \bar{z}^T.

Then the gradient contribution to θ is

\bar{\theta} = - \lambda^T \nabla^2_{z\theta}L.

If the outer loss also depends directly on θ, add that term:

\frac{d\ell}{d\theta} = \frac{\partial R}{\partial \theta} - \lambda^T \nabla^2_{z\theta}L.

This is the standard reverse-mode rule for a smooth unconstrained optimization layer.

Quadratic Objectives

Consider

L(z,\theta)=\frac{1}{2}z^TQz - \theta^Tz,

where Q is symmetric positive definite.

The optimum satisfies

Qz^*-\theta=0.

Thus,

z^* = Q^{-1}\theta.

The derivative is

\frac{\partial z^*}{\partial \theta}=Q^{-1}.

In reverse mode, given zbar, solve

Q\lambda=\bar{z}^T.

Then

\bar{\theta}=\lambda.

This simple example shows the relation between optimization layers and linear solvers. A quadratic minimization layer is a linear solve layer.

Constrained Optimization Layers

Many useful optimization layers include constraints:

\begin{aligned} z^*(\theta)=\arg\min_z \quad & L(z,\theta) \\ \text{subject to} \quad & c(z,\theta)=0, \\ & g(z,\theta)\leq 0. \end{aligned}

The solution is characterized by the Karush-Kuhn-Tucker conditions, under suitable regularity assumptions.

For equality constraints, define the Lagrangian

\mathcal{L}(z,\nu,\theta) = L(z,\theta)+\nu^Tc(z,\theta).

At a solution,

\nabla_z \mathcal{L}(z,\nu,\theta)=0, \qquad c(z,\theta)=0.

These equations form an implicit system in the primal variable z and the multiplier ν.

Let

u = \begin{bmatrix} z \\ \nu \end{bmatrix}.

Then define

F(u,\theta)= \begin{bmatrix} \nabla_z \mathcal{L}(z,\nu,\theta) \\ c(z,\theta) \end{bmatrix}.

The derivative of the optimizer follows from

F(u,\theta)=0.

The backward pass solves a linear system involving the KKT matrix.

KKT Linear System

For equality-constrained problems, the linearized KKT system has the form

\begin{bmatrix} \nabla^2_{zz}\mathcal{L} & c_z^T \\ c_z & 0 \end{bmatrix} \begin{bmatrix} dz \\ d\nu \end{bmatrix} = - \begin{bmatrix} F_\theta^{(1)} \\ F_\theta^{(2)} \end{bmatrix} d\theta.

The matrix

\begin{bmatrix} \nabla^2_{zz}\mathcal{L} & c_z^T \\ c_z & 0 \end{bmatrix}

is the KKT matrix.

In reverse mode, the transpose KKT system is solved against the upstream adjoint. The result is then used to compute gradients with respect to the problem parameters.

The structure matters. KKT matrices are often sparse, symmetric, and indefinite. Efficient optimization layers exploit this structure instead of treating the system as a dense matrix.

Inequality Constraints

Inequality constraints add complementarity:

g(z,\theta)\leq 0, \qquad \mu \geq 0, \qquad \mu_i g_i(z,\theta)=0.

The active constraints are those with

g_i(z^*,\theta)=0.

Locally, if the active set is stable, active inequalities behave like equality constraints. Inactive inequalities have zero multiplier and do not enter the local equality system.

This gives a piecewise smooth solution map. The derivative is valid as long as the active set does not change.

At active-set boundaries, the optimizer may fail to be differentiable. The gradient may jump, become undefined, or require a generalized derivative.

Convex Optimization Layers

Convex optimization layers are common because convex problems have strong stability properties.

A typical layer solves

\begin{aligned} \min_z \quad & L(z,\theta) \\ \text{subject to} \quad & z \in C(\theta), \end{aligned}

where L is convex in z and C is a convex feasible set.

If the solution is unique and regularity conditions hold, the solution map is differentiable almost everywhere. This makes convex layers practical inside larger differentiable models.

Examples include:

Layer	Typical use
quadratic program	control, structured prediction
cone program	differentiable convex modeling
projection layer	constraints, normalization
optimal transport	matching, distribution alignment
sparse coding	learned representations

Projection Layers

A projection layer computes

z^* = \Pi_C(y) = \arg\min_{z\in C} \frac{1}{2}\|z-y\|^2.

If C is a linear subspace, the projection is linear and easy to differentiate.

If C is a convex set with boundaries, the projection is piecewise smooth. The derivative depends on the active constraints at the projected point.

For example, projection onto the nonnegative orthant is

z_i^* = \max(y_i,0).

Its derivative is

\frac{\partial z_i^*}{\partial y_i} = \begin{cases} 1, & y_i>0, \\ 0, & y_i<0. \end{cases}

At y_i = 0, the classical derivative is undefined. AD systems usually choose a convention, often one-sided or subgradient-based.

Optimization Layers vs Loss Functions

An optimization problem can appear in two places.

First, it can be the training objective:

\min_\theta R(\theta).

Second, it can be part of the model:

z^*(\theta)=\arg\min_z L(z,\theta), \qquad \min_\theta R(z^*(\theta),\theta).

The second case is bilevel optimization. The inner problem defines a layer. The outer problem trains parameters that affect the inner solution.

This pattern appears in meta-learning, hyperparameter optimization, differentiable architecture search, and inverse design.

Implementation Pattern

A differentiable optimization layer usually has the following structure:

forward(theta):
    z, solver_state = solve_optimization_problem(theta)
    save z, theta, solver_state if needed
    return z

backward(zbar):
    define optimality residual F(z, theta)
    solve transpose linearized optimality system
    compute gradient with respect to theta
    return thetabar

The residual is the key contract. It defines what mathematical problem the layer claims to solve.

The solver may use interior-point methods, active-set methods, projected gradient, ADMM, or domain-specific algorithms. The backward rule should correspond to the optimality conditions, not necessarily to the internal iteration trace.

Approximate Solves

In practice, optimization layers rarely solve exactly. They stop at a tolerance.

This creates a modeling choice.

One option is to differentiate the approximate computation by unrolling the solver. This gives the exact derivative of the finite algorithm.

Another option is to apply implicit differentiation to the approximate solution. This gives an approximate derivative of the exact solution.

A third option is to define a surrogate backward pass that trades mathematical precision for stability or speed.

The choice should be explicit. Hidden mismatches between forward solve accuracy and backward derivative assumptions are a common source of training instability.

Regularization for Differentiability

Optimization layers often need regularization to produce stable gradients.

For example, adding

\frac{\rho}{2}\|z\|^2

makes a convex objective strongly convex when ρ > 0. Strong convexity improves uniqueness and conditioning.

Entropy regularization plays a similar role in optimal transport and assignment problems. It smooths a non-smooth combinatorial problem into a differentiable approximation.

Regularization changes the mathematical layer. It should be treated as part of the model, not only as a numerical trick.

Failure Modes

Optimization layers can fail for several reasons.

The problem may be infeasible. Then z* does not exist.

The solution may be non-unique. Then the solution map may depend on arbitrary solver choices.

The KKT matrix may be singular or ill-conditioned. Then gradients may explode.

The active set may change. Then derivatives may be discontinuous.

The solver may stop early. Then the backward pass may differentiate a solution that was not actually reached.

A robust implementation reports primal residuals, dual residuals, complementarity gaps, active-set status, and backward linear-solve residuals.

Design Rule

A differentiable optimization layer should make four contracts explicit:

Contract	Question
primal problem	What optimization problem is being solved?
solution convention	What happens if multiple solutions exist?
derivative rule	Is the backward pass unrolled, implicit, or custom?
numerical tolerance	How accurate must forward and backward solves be?

Without these contracts, the layer may appear differentiable while producing gradients with unclear meaning.

Role in Differentiable Systems

Optimization layers connect learning systems with structured numerical reasoning.

A neural network can predict the parameters of a constrained problem. The optimization layer enforces physical, logical, or economic structure. The outer loss trains the whole pipeline end to end.

This gives a useful division of labor. Neural components handle approximation from data. Optimization components enforce precise constraints.

Automatic differentiation supplies the interface between them.