Meta-Learning

Meta-learning studies systems that improve how they learn. Instead of only optimizing model parameters for one task, a meta-learning method optimizes some part of the learning process itself: an initialization, an optimizer, a loss function, a data-selection rule, a representation, or an adaptation procedure.

A standard supervised learning problem learns parameters $\theta$ for one dataset. A meta-learning problem considers a distribution over tasks:

\tau \sim p(\tau).

Each task $\tau$ has its own data, loss, and adaptation objective. The meta-learner tries to find parameters or procedures that perform well after task-specific adaptation.

A common structure is:

\theta'_\tau = A(\theta, D^{\text{train}}_\tau),

L_{\text{meta}}(\theta) = \mathbb{E}_{\tau} \left[ L_\tau(\theta'_\tau, D^{\text{test}}_\tau) \right].

Here, $A$ is an adaptation algorithm. It may be one gradient step, several gradient steps, a learned optimizer, or another differentiable procedure. Automatic differentiation is central because the meta-objective often requires gradients through the adaptation process.

Tasks and Episodes

Meta-learning usually organizes data into episodes. Each episode samples a task and splits its data into two parts.

Component	Role
Support set	Used for task adaptation
Query set	Used for meta-objective
Inner loop	Learns or adapts on the support set
Outer loop	Updates meta-parameters using query loss

For few-shot classification, a task may contain a small number of labeled examples per class. The model adapts using the support set, then is evaluated on the query set.

The important distinction is that training and evaluation happen inside each episode. The outer objective rewards parameters that adapt well, not merely parameters that fit the support examples.

Inner and Outer Optimization

Meta-learning often has two nested optimization loops.

The inner loop adapts parameters for one task:

\theta_{k+1} = \theta_k - \alpha \nabla_{\theta_k} L_\tau^{\text{support}}(\theta_k).

After $K$ inner steps, the adapted parameters are $\theta_K$ . The outer loop updates the initial parameters $\theta_0$ using the query loss:

L_{\text{outer}} = L_\tau^{\text{query}}(\theta_K).

The meta-gradient is:

\nabla_{\theta_0} L_\tau^{\text{query}}(\theta_K).

This gradient depends on how $\theta_K$ changes with $\theta_0$ . Therefore, the outer backward pass must differentiate through the inner optimization steps.

Model-Agnostic Meta-Learning

Model-Agnostic Meta-Learning, commonly called MAML, learns an initialization that can adapt quickly to new tasks.

For one inner step:

\theta' = \theta - \alpha \nabla_\theta L_\tau^{\text{support}}(\theta).

The outer objective is:

L_\tau^{\text{query}}(\theta').

The meta-gradient is:

\nabla_\theta L_\tau^{\text{query}} \left( \theta - \alpha \nabla_\theta L_\tau^{\text{support}}(\theta) \right).

This expression contains a gradient inside another gradient. Differentiating it requires second-order information because $\theta'$ depends on $\nabla_\theta L_\tau^{\text{support}}$ .

In AD terms, MAML is nested differentiation. The inner gradient creates an adapted parameter value. The outer gradient differentiates through that construction.

Second-Order Terms

For one inner step, let:

g_s = \nabla_\theta L_s(\theta),

\theta' = \theta - \alpha g_s.

The outer loss is:

L_q(\theta').

By the chain rule:

\nabla_\theta L_q(\theta') = \left( \frac{\partial \theta'}{\partial \theta} \right)^\top \nabla_{\theta'} L_q(\theta').

Since

\frac{\partial \theta'}{\partial \theta} = I - \alpha \nabla^2_\theta L_s(\theta),

the meta-gradient is:

\nabla_\theta L_q(\theta') = \left( I - \alpha H_s \right)^\top \nabla_{\theta'} L_q(\theta'),

where $H_s$ is the Hessian of the support loss.

This does not require forming the full Hessian explicitly. Reverse mode AD can compute Hessian-vector products through nested differentiation.

First-Order Approximations

Full MAML can be expensive because it differentiates through inner gradients. First-order variants ignore the Hessian term and approximate:

\frac{\partial \theta'}{\partial \theta} \approx I.

Then:

\nabla_\theta L_q(\theta') \approx \nabla_{\theta'} L_q(\theta').

This reduces memory and computation. It also changes the optimization problem. The approximation may work well when inner steps are small or when second-order effects are not critical.

From an AD-system perspective, first-order MAML is usually implemented by detaching the inner gradient or adapted parameters at selected points. This cuts part of the derivative graph.

Differentiating Through Optimizers

Meta-learning often treats the optimizer as part of the differentiable program.

A simple SGD inner update is:

grad = gradient(support_loss, theta)
theta = theta - alpha * grad

If $\alpha$ is fixed, only $\theta$ is meta-learned. If $\alpha$ is trainable, AD can compute:

\frac{\partial L_{\text{outer}}}{\partial \alpha}.

For vector-valued or layer-wise learning rates, the update becomes:

\theta' = \theta - \alpha \odot g.

Then $\alpha$ is a meta-parameter. The outer loss can train it.

More complex learned optimizers replace the update rule itself:

\theta_{k+1} = \theta_k + u_\phi(g_k, h_k),

where $u_\phi$ is a learned update function with parameters $\phi$ , and $h_k$ is optimizer state. The outer loop differentiates through this optimizer to update $\phi$ .

Memory Cost of Nested Differentiation

Meta-learning can be memory-intensive. The outer backward pass may need all intermediate states from the inner loop.

For $K$ inner steps, the graph contains:

theta_0 -> grad_0 -> theta_1 -> grad_1 -> theta_2 -> ... -> theta_K -> query_loss

Reverse mode through this graph stores activations for each support loss and each update step. Memory grows with the number of inner steps, model size, and task batch size.

Common mitigations include:

Technique	Effect
First-order approximation	Drops second-order graph
Truncated inner loop	Limits differentiation depth
Checkpointing	Recomputes inner states during backward
Implicit differentiation	Differentiates optimality conditions
Smaller task batches	Reduces episode memory

The mathematical objective and the executable training program should be kept distinct. Changing the truncation or detach behavior changes the meta-gradient.

Implicit Meta-Gradients

When the inner loop approximately solves an optimization problem, we can sometimes use implicit differentiation instead of unrolling.

Suppose task adaptation finds:

\theta^\star_\tau = \arg\min_\theta L_\tau^{\text{support}}(\theta, \lambda),

where $\lambda$ contains meta-parameters such as regularization weights or representation parameters. The optimum satisfies:

\nabla_\theta L_\tau^{\text{support}}(\theta^\star_\tau, \lambda) = 0.

The query loss is:

L_\tau^{\text{query}}(\theta^\star_\tau, \lambda).

Implicit differentiation gives a meta-gradient without backpropagating through every inner optimization step. It requires solving a linear system involving the Hessian of the support objective.

This is useful when the inner optimization runs for many steps or uses a solver whose iterations are not convenient to store.

Meta-Learning Representations

Not all meta-learning differentiates through parameter updates. Another common approach learns a representation that supports fast task-specific heads.

Let:

r = f_\phi(x)

be a shared representation. For each task, a small classifier $w_\tau$ is fitted on support examples. The meta-objective updates $\phi$ so that the representation works well across tasks.

If the task head is trained by gradient descent, AD may differentiate through the head training. If the head has a closed-form solution, such as ridge regression, AD may differentiate through the linear solve.

This connects meta-learning with implicit layers and differentiable optimization.

Non-Differentiable Adaptation

Some adaptation procedures include discrete choices: selecting data, modifying architectures, choosing prompts, or applying rules. Ordinary AD cannot directly differentiate through these choices.

There are several options.

A differentiable relaxation replaces a hard choice with a soft one. A score-function estimator estimates gradients through sampling. A straight-through estimator uses a discrete forward pass and a surrogate backward pass. Reinforcement learning treats adaptation as a policy optimization problem.

Each option changes the meaning and quality of the gradient. AD supplies derivatives where the computation is differentiable. It does not make arbitrary discrete learning procedures differentiable by itself.

Practical Implementation Pattern

A MAML-style training step can be written as:

meta_loss = 0

for task in sample_tasks():
    theta_task = theta

    for inner_step in range(K):
        support_loss = loss(model(theta_task), task.support)
        support_grad = gradient(support_loss, theta_task)
        theta_task = theta_task - alpha * support_grad

    query_loss = loss(model(theta_task), task.query)
    meta_loss += query_loss

meta_loss = meta_loss / num_tasks
meta_grad = gradient(meta_loss, theta)
theta = outer_optimizer.update(theta, meta_grad)

The critical part is that theta_task must remain connected to theta if full meta-gradients are desired. If it is detached, the implementation becomes a first-order or partially stopped-gradient method.

AD Pitfalls

Meta-learning stresses AD systems in several ways.

Nested gradients require careful control over graph retention. Many frameworks free backward graphs after one use unless told otherwise.

In-place parameter updates can corrupt the graph. Inner-loop parameters are often represented functionally rather than by mutating the original model weights.

Gradient buffers can mix inner-loop and outer-loop gradients if they are not isolated.

Second-order gradients may fail for operations with custom backward rules that do not themselves support differentiation.

Randomness inside tasks can make meta-gradients noisy unless episode sampling is controlled.

Memory may grow unexpectedly if graphs are retained across episodes or logging keeps references to tensors.

These are systems issues around nested differentiation. The chain rule remains simple. The implementation discipline is difficult.

Interface to AD Systems

A meta-learning system benefits from a functional parameter interface:

loss = model_loss(params, batch)
grads = grad(loss, params)
new_params = update(params, grads)

This style treats parameters as explicit values. It avoids mutating global model state during inner adaptation. It also makes it easier to differentiate through updates.

The AD engine must support gradients of gradients, vector-Jacobian products through update rules, and careful graph lifetime management.

Meta-learning is therefore a direct test of an AD system’s composability. Basic training needs one derivative of one loss. Meta-learning needs derivatives through derivative-producing programs.