Sparse Expert Architectures

Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token.

Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE.

In an MoE transformer, most transformer layers remain shared, but some feedforward layers are replaced by collections of specialized subnetworks called experts. A routing mechanism selects which experts process each token.

This allows total parameter count to grow much larger than active compute per token.

Dense Versus Sparse Computation

In a standard transformer feedforward layer:

\text{FFN}(x)=W_2\phi(W_1x+b_1)+b_2.

Every token uses the same parameters.

In a sparse expert layer, we instead have multiple feedforward networks:

E_1,E_2,\ldots,E_N.

A router chooses a subset of experts for each token.

If token $x_t$ uses experts $i$ and $j$ , the computation becomes

y_t = g_iE_i(x_t)+g_jE_j(x_t),

where $g_i$ and $g_j$ are routing weights.

Only the selected experts are evaluated.

Why Sparse Experts Matter

Sparse experts separate total model capacity from active compute.

Suppose:

Model type	Parameters used per token
Dense transformer	All parameters
Sparse MoE	Small subset of parameters

An MoE model may contain hundreds of billions of parameters while activating only a small fraction for each token.

This gives several advantages:

Advantage	Explanation
Larger total capacity	More parameters overall
Lower active compute	Only selected experts run
Expert specialization	Different experts learn different patterns
Better scaling efficiency	More capacity per FLOP

The key idea is conditional computation. Different inputs activate different pathways.

Structure of an MoE Layer

A standard MoE transformer layer contains:

Shared attention layer.
Router network.
Multiple expert feedforward networks.
Aggregation mechanism.

The attention layer is usually dense and shared across all tokens. The feedforward block becomes sparse.

The structure is:

x \rightarrow \text{Attention} \rightarrow \text{Router} \rightarrow \text{Selected Experts} \rightarrow \text{Aggregation}.

Each expert is usually a standard feedforward network:

E_i(x)=W_{2,i}\phi(W_{1,i}x+b_{1,i})+b_{2,i}.

Router Networks

The router decides which experts process each token.

For token representation $x_t$ , the router computes expert scores:

r_t = W_rx_t,

where

r_t\in\mathbb{R}^{N},

and $N$ is the number of experts.

A softmax converts scores into routing probabilities:

p_t = \text{softmax}(r_t).

The router then selects the top $k$ experts.

For top-1 routing:

i^* = \arg\max_i p_{t,i}.

For top-2 routing:

i_1,i_2 = \text{TopK}(p_t,2).

The selected experts receive the token.

Top-k Routing

Most MoE systems use top- $k$ routing.

Routing type	Experts per token
Top-1	1
Top-2	2
Top-4	4

Top-1 routing is efficient because each token activates only one expert. Top-2 routing improves stability and quality because tokens can combine multiple expert outputs.

Suppose token $x_t$ selects experts $i$ and $j$ . Then:

y_t = g_iE_i(x_t) + g_jE_j(x_t),

where

g_i + g_j = 1.

The router weights determine how strongly each expert contributes.

Token Dispatch

Once routing decisions are made, tokens must be grouped by expert.

Suppose:

Token	Assigned expert
$x_1$	$E_2$
$x_2$	$E_5$
$x_3$	$E_2$
$x_4$	$E_1$

The system gathers tokens for each expert:

Expert 1: [x4]
Expert 2: [x1, x3]
Expert 5: [x2]

Each expert processes its assigned tokens independently.

After computation, outputs are scattered back to their original token positions.

This gather-scatter operation is one of the main systems challenges in MoE training.

Expert Specialization

Experts often develop specialized behavior.

Different experts may focus on:

Specialization	Example
Syntax	Grammar and structure
Mathematics	Numerical reasoning
Code	Programming tokens
Languages	Different natural languages
Retrieval	Citation or memory tokens
Vision regions	Different spatial patterns

Specialization is not manually assigned. It emerges from routing and optimization dynamics.

However, specialization is imperfect. Experts may overlap substantially or collapse into similar behavior if routing is poorly balanced.

Load Balancing

A major problem in MoE systems is expert imbalance.

Suppose one expert receives most tokens:

Expert 1: 90% of tokens
Others: very few tokens

Then:

Problem	Consequence
Hot experts	Compute bottlenecks
Idle experts	Wasted parameters
Poor specialization	Reduced model diversity
Communication imbalance	Slower distributed training

To avoid this, MoE systems use load-balancing losses.

A simplified balancing objective encourages:

Similar routing probability across experts.
Similar token counts across experts.

The total loss becomes

L = L_{\text{task}} + \lambda L_{\text{balance}}.

Here $\lambda$ controls the balancing strength.

Capacity Limits

Experts usually have a maximum token capacity per batch.

Suppose:

Expert	Maximum capacity
$E_i$	128 tokens

If too many tokens are routed to one expert, extra tokens may be:

Strategy	Description
Dropped	Ignore excess tokens
Re-routed	Send to another expert
Buffered	Process later
Dynamically resized	Expand expert capacity

Capacity limits prevent single experts from becoming overloaded.

The capacity factor controls allowed overflow:

\text{capacity} = \text{capacity factor} \times \frac{\text{tokens}}{\text{experts}}.

Distributed Expert Parallelism

MoE systems are naturally distributed.

Different experts can be placed on different accelerators:

GPU 1: Experts 1-4
GPU 2: Experts 5-8
GPU 3: Experts 9-12

Tokens are routed across devices.

This is called expert parallelism.

Compared with dense tensor parallelism, expert parallelism has different tradeoffs:

Dense parallelism	Expert parallelism
Split matrix computation	Split experts
Every device active for every token	Devices active only for routed tokens
Regular communication	Sparse communication
Predictable load	Routing imbalance possible

Communication becomes a central challenge. Tokens must move between devices efficiently.

Sparse Scaling Laws

MoE models often achieve better quality for a given active compute budget.

A sparse model may have:

Metric	Value
Total parameters	1 trillion
Active parameters per token	50 billion

The model behaves like a very large network in terms of capacity, but like a smaller network in terms of per-token compute.

This changes the scaling tradeoff:

Dense scaling	Sparse scaling
Capacity tied to compute	Capacity partially decoupled
More parameters always cost more FLOPs	Extra inactive experts are cheap
Compute grows with total size	Compute grows with active experts

Sparse scaling is therefore attractive when parameter memory is cheaper than compute.

Routing Instability

Routers can become unstable during training.

Common issues include:

Problem	Description
Expert collapse	Few experts dominate
Routing oscillation	Tokens rapidly switch experts
Dead experts	Some experts unused
Noisy specialization	Experts fail to stabilize
Overconfident routing	Router entropy collapses

Several stabilization techniques are common:

Technique	Purpose
Auxiliary balancing loss	Spread tokens evenly
Noisy routing	Encourage exploration
Temperature scaling	Smooth routing probabilities
Capacity constraints	Prevent overload
Top-2 routing	Improve robustness

Stable routing is essential for good expert utilization.

Switch Transformers

Switch Transformers simplified MoE routing using top-1 routing.

Instead of combining multiple experts, each token selects only one expert:

y_t = E_i(x_t).

This reduces communication and computation.

Advantages:

Benefit	Explanation
Simpler routing	One expert per token
Lower communication	Fewer transfers
Lower memory	Fewer active expert outputs
Faster training	Less aggregation overhead

The tradeoff is lower routing flexibility.

Switch-style routing showed that very large sparse models could scale efficiently with simpler systems design.

Shared and Specialized Layers

Most MoE transformers are hybrid architectures.

Component	Usually dense or sparse
Attention layers	Dense
Embeddings	Dense
Output head	Dense
Feedforward layers	Sparse experts

Attention layers remain shared because every token must exchange information globally. Sparse experts mainly replace the computationally expensive feedforward blocks.

This hybrid design balances communication cost and specialization.

MoE During Inference

Inference introduces additional challenges.

The model must:

Run the router.
Dispatch tokens to experts.
Gather outputs.
Manage KV cache.
Coordinate devices.

Latency can increase if routing creates communication bottlenecks.

Inference optimization techniques include:

Technique	Purpose
Expert caching	Reuse loaded expert weights
Expert placement optimization	Reduce communication
Token batching	Improve utilization
Routing locality	Keep tokens near experts
Quantized experts	Reduce memory bandwidth

Sparse inference efficiency depends heavily on systems engineering.

Sparse Experts Beyond Language

MoE ideas are also used in:

Domain	Application
Vision	Specialized image experts
Multimodal systems	Experts per modality
Speech	Acoustic specialization
Robotics	Task-conditioned policies
Retrieval systems	Memory-aware routing

The general principle is conditional computation: activate only the parts of the model needed for the current input.

MoE Versus Dense Models

Property	Dense transformer	Sparse MoE transformer
Active parameters	All	Small subset
Compute scaling	Grows with total size	Grows with active experts
Communication	Simpler	More complex
Parameter efficiency	Lower	Higher
Systems complexity	Lower	Higher
Specialization	Shared representation	Expert specialization

Dense models are simpler and often more stable. Sparse models offer better scaling efficiency at very large sizes.

A Minimal MoE Layer in PyTorch

A simplified educational MoE layer:

import torch
from torch import nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

class SimpleMoE(nn.Module):
    def __init__(self, d_model: int, d_ff: int, n_experts: int):
        super().__init__()

        self.router = nn.Linear(d_model, n_experts)

        self.experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(n_experts)
        ])

    def forward(self, x):
        # x: [B, T, D]

        B, T, D = x.shape

        scores = self.router(x)
        probs = torch.softmax(scores, dim=-1)

        top_expert = probs.argmax(dim=-1)

        out = torch.zeros_like(x)

        for expert_id, expert in enumerate(self.experts):
            mask = (top_expert == expert_id)

            if mask.any():
                selected = x[mask]
                result = expert(selected)
                out[mask] = result

        return out

This implementation is intentionally simple and inefficient. Real MoE systems use optimized grouped dispatch kernels and distributed expert execution.

Summary

Sparse expert architectures replace dense feedforward computation with conditionally activated experts. A router selects which experts process each token, allowing total parameter count to grow much larger than active compute.

MoE systems improve scaling efficiency by separating model capacity from per-token FLOPs. They introduce new challenges including routing stability, load balancing, communication overhead, capacity limits, and distributed dispatch.

Modern sparse transformers combine dense shared attention with sparse expert feedforward layers. This architecture has become an important approach for scaling very large foundation models efficiently.