Skip to content

Implicit Layers

An implicit layer defines its output as the solution of an equation, not as a fixed sequence of explicit operations. Instead of computing

An implicit layer defines its output as the solution of an equation, not as a fixed sequence of explicit operations. Instead of computing

z=fθ(x) z = f_\theta(x)

by applying a known list of layers, an implicit layer defines zz through a condition:

F(z,x,θ)=0. F(z, x, \theta) = 0.

The value zz is found by a solver. The solver may use Newton iterations, fixed-point iteration, conjugate gradients, a root finder, or an optimization algorithm. The layer output is whatever value satisfies the equation well enough.

This changes the role of automatic differentiation. The forward pass may contain many solver iterations, but the backward pass does not always need to differentiate through every iteration. In many cases, it can differentiate the equation that the solution satisfies.

Explicit vs Implicit Layers

An explicit layer gives a direct formula:

z=fθ(x). z = f_\theta(x).

A residual block, convolution, normalization layer, or MLP layer has this form.

An implicit layer gives a defining equation:

F(z,x,θ)=0. F(z, x, \theta) = 0.

Examples include equilibrium models, differentiable optimization layers, physics solvers, constrained systems, and layers defined by fixed points.

A fixed-point layer has the form:

z=gθ(z,x). z = g_\theta(z, x).

This can be rewritten as:

F(z,x,θ)=zgθ(z,x)=0. F(z, x, \theta) = z - g_\theta(z, x) = 0.

The forward computation solves for zz. The model then uses zz as an activation.

Differentiating the Solution

Assume zz satisfies:

F(z,x,θ)=0. F(z, x, \theta) = 0.

We want the derivative of zz with respect to θ\theta. Differentiate both sides:

Fzzθ+Fθ=0. \frac{\partial F}{\partial z} \frac{\partial z}{\partial \theta} + \frac{\partial F}{\partial \theta} = 0.

Solving for the sensitivity gives:

zθ=(Fz)1Fθ. \frac{\partial z}{\partial \theta} = - \left( \frac{\partial F}{\partial z} \right)^{-1} \frac{\partial F}{\partial \theta}.

This is the implicit function theorem in computational form. It says that we can obtain derivatives of the solution by solving a linear system involving the Jacobian of the defining equation.

The inverse is usually not formed explicitly. Implementations solve linear systems or compute vector-Jacobian products.

Reverse Mode Form

In neural network training, the loss is scalar. Suppose the backward pass receives an upstream adjoint:

zˉ=Lz. \bar{z} = \frac{\partial L}{\partial z}.

We need gradients with respect to xx and θ\theta. Rather than forming z/θ\partial z / \partial \theta, reverse mode solves an adjoint linear system.

Let

Jz=Fz. J_z = \frac{\partial F}{\partial z}.

Find uu such that:

Jzu=zˉ. J_z^\top u = \bar{z}.

Then:

θˉ=(Fθ)u, \bar{\theta} = - \left( \frac{\partial F}{\partial \theta} \right)^\top u, xˉ=(Fx)u. \bar{x} = - \left( \frac{\partial F}{\partial x} \right)^\top u.

This is usually the practical backward rule for an implicit layer. The AD system is used to compute products with JzJ_z^\top, (F/θ)(\partial F/\partial \theta)^\top, and (F/x)(\partial F/\partial x)^\top, while an iterative solver handles the linear system.

Why Not Differentiate Through the Solver?

One direct method is to unroll the solver and apply ordinary reverse mode AD through all iterations.

z0 = initial_guess(x)

for k in range(K):
    z = solver_step(z, x, theta)

loss = downstream_loss(z)
backward(loss)

This works when KK is small and fixed. It also gives the derivative of the finite computation that was actually executed.

But it has drawbacks. It stores many intermediate solver states. It ties the gradient to the chosen number of iterations. It can be unstable for long or adaptive solvers. It may also compute a derivative of the algorithm rather than the derivative of the converged solution.

Implicit differentiation instead differentiates the solved equation. It treats the solver as a method for finding zz, not as the mathematical definition of the layer.

Fixed-Point Layers

A fixed-point layer solves:

z=gθ(z,x). z = g_\theta(z, x).

Define:

F(z,x,θ)=zgθ(z,x). F(z, x, \theta) = z - g_\theta(z, x).

Then:

Jz=Igθz. J_z = I - \frac{\partial g_\theta}{\partial z}.

The reverse-mode adjoint solve becomes:

(Igθz)u=zˉ. \left( I - \frac{\partial g_\theta}{\partial z} \right)^\top u = \bar{z}.

Once uu is found:

θˉ=(gθθ)u, \bar{\theta} = \left( \frac{\partial g_\theta}{\partial \theta} \right)^\top u, xˉ=(gθx)u. \bar{x} = \left( \frac{\partial g_\theta}{\partial x} \right)^\top u.

The signs change because F=zgθ(z,x)F = z - g_\theta(z,x).

This structure appears in deep equilibrium models, where a repeated transformation is run until it reaches an equilibrium state.

Optimization Layers

An optimization layer defines its output as the solution to an optimization problem:

z(x,θ)=argminzϕ(z,x,θ). z^\star(x,\theta) = \arg\min_z \phi(z, x, \theta).

The first-order optimality condition is:

zϕ(z,x,θ)=0. \nabla_z \phi(z^\star, x, \theta) = 0.

This is an implicit equation. Let

F(z,x,θ)=zϕ(z,x,θ). F(z, x, \theta) = \nabla_z \phi(z, x, \theta).

Then implicit differentiation can be applied to the optimality condition.

For constrained optimization, the defining equation usually includes Karush-Kuhn-Tucker conditions. The layer output and multipliers are differentiated together.

This is how differentiable quadratic programs, convex optimization layers, and some control layers expose gradients to a larger neural network.

Solvers in the Backward Pass

Implicit layers often require a solver during the backward pass. The forward pass solves a nonlinear or optimization problem. The backward pass solves a linear system.

This introduces a second source of numerical approximation. The gradient depends on the tolerance of the backward solve. If the linear system is solved poorly, gradients may be noisy or biased.

A practical implementation must choose:

Design choiceConsequence
Forward toleranceAccuracy of layer output
Backward toleranceAccuracy of gradient
Solver typeSpeed and stability
PreconditionerLinear solve efficiency
Maximum iterationsRuntime bound
Stopping ruleDeterminism and accuracy

The AD engine supplies Jacobian-vector or vector-Jacobian products. The solver controls convergence.

Memory Benefits

Implicit differentiation can reduce memory use. If the backward rule differentiates the defining equation, it does not need to store every forward solver iterate.

The forward pass may store only:

solution z
input x
parameters theta
solver metadata

The backward pass recomputes local derivatives around the solution.

This is useful for deep equilibrium models and differentiable solvers where unrolling would create very long computation graphs.

The memory saving is not free. Backward computation may require additional iterative solves and repeated evaluations of local derivative products.

Accuracy and Convergence

Implicit differentiation assumes that the forward solution is sufficiently close to a true solution and that the relevant Jacobian is nonsingular or suitably regularized.

If the solver stops far from a solution, then the condition

F(z,x,θ)=0 F(z, x, \theta) = 0

does not really hold. The implicit gradient may then describe the derivative of a nearby idealized solution, not the derivative of the actual finite computation.

If JzJ_z is ill-conditioned, the adjoint solve may amplify numerical error. Small perturbations in the upstream gradient can produce large changes in uu.

These issues are not AD bugs. They are numerical properties of the implicit problem.

Interaction With Automatic Differentiation

Implicit layers usually combine custom backward rules with ordinary AD.

The forward layer calls a solver:

z = solve(lambda z: F(z, x, theta) == 0)

The backward rule defines an adjoint solve:

u = solve_linear(lambda v: Jz_T_v(v), z_bar)

theta_bar = -vjp_theta(F, u)
x_bar = -vjp_x(F, u)

The functions Jz_T_v, vjp_theta, and vjp_x are commonly implemented using the host AD system. Thus AD still performs local differentiation, but the layer-level derivative is specified analytically.

This is a useful pattern: use mathematics to define the global derivative, then use AD to compute the local products needed by that derivative.

Failure Modes

Implicit layers can fail in several ways.

The forward solver may not converge. Then the layer output is poorly defined.

The backward linear solve may not converge. Then the gradient is unreliable.

The implicit Jacobian may be singular or nearly singular. Then the derivative may be undefined or numerically unstable.

The forward and backward tolerances may be mismatched. A coarse forward solution with a strict backward solve can give misleading gradients.

The solver may use discontinuous control flow, such as adaptive stopping, branching, or active-set changes. The implicit derivative may ignore some algorithmic discontinuities.

The defining equation may have multiple solutions. The solver selects one branch, and the derivative is local to that branch.

Interface Design

A clean implicit layer exposes a small interface:

forward:
    solve F(z, x, theta) = 0
    return z

backward:
    receive z_bar
    solve Jz.T u = z_bar
    return x_bar, theta_bar

The layer should document which equation is being differentiated, which solver tolerances are used, and whether the gradient corresponds to the converged solution or the finite unrolled computation.

This distinction matters. Differentiating through 20 iterations of a solver and differentiating the equation solved by that solver are related, but not identical. An AD system can support both. The model designer must choose the semantics.