Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token.
Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE.
In an MoE transformer, most transformer layers remain shared, but some feedforward layers are replaced by collections of specialized subnetworks called experts. A routing mechanism selects which experts process each token.
This allows total parameter count to grow much larger than active compute per token.
Dense Versus Sparse Computation
In a standard transformer feedforward layer:
Every token uses the same parameters.
In a sparse expert layer, we instead have multiple feedforward networks:
A router chooses a subset of experts for each token.
If token uses experts and , the computation becomes
where and are routing weights.
Only the selected experts are evaluated.
Why Sparse Experts Matter
Sparse experts separate total model capacity from active compute.
Suppose:
| Model type | Parameters used per token |
|---|---|
| Dense transformer | All parameters |
| Sparse MoE | Small subset of parameters |
An MoE model may contain hundreds of billions of parameters while activating only a small fraction for each token.
This gives several advantages:
| Advantage | Explanation |
|---|---|
| Larger total capacity | More parameters overall |
| Lower active compute | Only selected experts run |
| Expert specialization | Different experts learn different patterns |
| Better scaling efficiency | More capacity per FLOP |
The key idea is conditional computation. Different inputs activate different pathways.
Structure of an MoE Layer
A standard MoE transformer layer contains:
- Shared attention layer.
- Router network.
- Multiple expert feedforward networks.
- Aggregation mechanism.
The attention layer is usually dense and shared across all tokens. The feedforward block becomes sparse.
The structure is:
Each expert is usually a standard feedforward network:
Router Networks
The router decides which experts process each token.
For token representation , the router computes expert scores:
where
and is the number of experts.
A softmax converts scores into routing probabilities:
The router then selects the top experts.
For top-1 routing:
For top-2 routing:
The selected experts receive the token.
Top-k Routing
Most MoE systems use top- routing.
| Routing type | Experts per token |
|---|---|
| Top-1 | 1 |
| Top-2 | 2 |
| Top-4 | 4 |
Top-1 routing is efficient because each token activates only one expert. Top-2 routing improves stability and quality because tokens can combine multiple expert outputs.
Suppose token selects experts and . Then:
where
The router weights determine how strongly each expert contributes.
Token Dispatch
Once routing decisions are made, tokens must be grouped by expert.
Suppose:
| Token | Assigned expert |
|---|---|
The system gathers tokens for each expert:
Expert 1: [x4]
Expert 2: [x1, x3]
Expert 5: [x2]Each expert processes its assigned tokens independently.
After computation, outputs are scattered back to their original token positions.
This gather-scatter operation is one of the main systems challenges in MoE training.
Expert Specialization
Experts often develop specialized behavior.
Different experts may focus on:
| Specialization | Example |
|---|---|
| Syntax | Grammar and structure |
| Mathematics | Numerical reasoning |
| Code | Programming tokens |
| Languages | Different natural languages |
| Retrieval | Citation or memory tokens |
| Vision regions | Different spatial patterns |
Specialization is not manually assigned. It emerges from routing and optimization dynamics.
However, specialization is imperfect. Experts may overlap substantially or collapse into similar behavior if routing is poorly balanced.
Load Balancing
A major problem in MoE systems is expert imbalance.
Suppose one expert receives most tokens:
Expert 1: 90% of tokens
Others: very few tokensThen:
| Problem | Consequence |
|---|---|
| Hot experts | Compute bottlenecks |
| Idle experts | Wasted parameters |
| Poor specialization | Reduced model diversity |
| Communication imbalance | Slower distributed training |
To avoid this, MoE systems use load-balancing losses.
A simplified balancing objective encourages:
- Similar routing probability across experts.
- Similar token counts across experts.
The total loss becomes
Here controls the balancing strength.
Capacity Limits
Experts usually have a maximum token capacity per batch.
Suppose:
| Expert | Maximum capacity |
|---|---|
| 128 tokens |
If too many tokens are routed to one expert, extra tokens may be:
| Strategy | Description |
|---|---|
| Dropped | Ignore excess tokens |
| Re-routed | Send to another expert |
| Buffered | Process later |
| Dynamically resized | Expand expert capacity |
Capacity limits prevent single experts from becoming overloaded.
The capacity factor controls allowed overflow:
Distributed Expert Parallelism
MoE systems are naturally distributed.
Different experts can be placed on different accelerators:
GPU 1: Experts 1-4
GPU 2: Experts 5-8
GPU 3: Experts 9-12Tokens are routed across devices.
This is called expert parallelism.
Compared with dense tensor parallelism, expert parallelism has different tradeoffs:
| Dense parallelism | Expert parallelism |
|---|---|
| Split matrix computation | Split experts |
| Every device active for every token | Devices active only for routed tokens |
| Regular communication | Sparse communication |
| Predictable load | Routing imbalance possible |
Communication becomes a central challenge. Tokens must move between devices efficiently.
Sparse Scaling Laws
MoE models often achieve better quality for a given active compute budget.
A sparse model may have:
| Metric | Value |
|---|---|
| Total parameters | 1 trillion |
| Active parameters per token | 50 billion |
The model behaves like a very large network in terms of capacity, but like a smaller network in terms of per-token compute.
This changes the scaling tradeoff:
| Dense scaling | Sparse scaling |
|---|---|
| Capacity tied to compute | Capacity partially decoupled |
| More parameters always cost more FLOPs | Extra inactive experts are cheap |
| Compute grows with total size | Compute grows with active experts |
Sparse scaling is therefore attractive when parameter memory is cheaper than compute.
Routing Instability
Routers can become unstable during training.
Common issues include:
| Problem | Description |
|---|---|
| Expert collapse | Few experts dominate |
| Routing oscillation | Tokens rapidly switch experts |
| Dead experts | Some experts unused |
| Noisy specialization | Experts fail to stabilize |
| Overconfident routing | Router entropy collapses |
Several stabilization techniques are common:
| Technique | Purpose |
|---|---|
| Auxiliary balancing loss | Spread tokens evenly |
| Noisy routing | Encourage exploration |
| Temperature scaling | Smooth routing probabilities |
| Capacity constraints | Prevent overload |
| Top-2 routing | Improve robustness |
Stable routing is essential for good expert utilization.
Switch Transformers
Switch Transformers simplified MoE routing using top-1 routing.
Instead of combining multiple experts, each token selects only one expert:
This reduces communication and computation.
Advantages:
| Benefit | Explanation |
|---|---|
| Simpler routing | One expert per token |
| Lower communication | Fewer transfers |
| Lower memory | Fewer active expert outputs |
| Faster training | Less aggregation overhead |
The tradeoff is lower routing flexibility.
Switch-style routing showed that very large sparse models could scale efficiently with simpler systems design.
Shared and Specialized Layers
Most MoE transformers are hybrid architectures.
| Component | Usually dense or sparse |
|---|---|
| Attention layers | Dense |
| Embeddings | Dense |
| Output head | Dense |
| Feedforward layers | Sparse experts |
Attention layers remain shared because every token must exchange information globally. Sparse experts mainly replace the computationally expensive feedforward blocks.
This hybrid design balances communication cost and specialization.
MoE During Inference
Inference introduces additional challenges.
The model must:
- Run the router.
- Dispatch tokens to experts.
- Gather outputs.
- Manage KV cache.
- Coordinate devices.
Latency can increase if routing creates communication bottlenecks.
Inference optimization techniques include:
| Technique | Purpose |
|---|---|
| Expert caching | Reuse loaded expert weights |
| Expert placement optimization | Reduce communication |
| Token batching | Improve utilization |
| Routing locality | Keep tokens near experts |
| Quantized experts | Reduce memory bandwidth |
Sparse inference efficiency depends heavily on systems engineering.
Sparse Experts Beyond Language
MoE ideas are also used in:
| Domain | Application |
|---|---|
| Vision | Specialized image experts |
| Multimodal systems | Experts per modality |
| Speech | Acoustic specialization |
| Robotics | Task-conditioned policies |
| Retrieval systems | Memory-aware routing |
The general principle is conditional computation: activate only the parts of the model needed for the current input.
MoE Versus Dense Models
| Property | Dense transformer | Sparse MoE transformer |
|---|---|---|
| Active parameters | All | Small subset |
| Compute scaling | Grows with total size | Grows with active experts |
| Communication | Simpler | More complex |
| Parameter efficiency | Lower | Higher |
| Systems complexity | Lower | Higher |
| Specialization | Shared representation | Expert specialization |
Dense models are simpler and often more stable. Sparse models offer better scaling efficiency at very large sizes.
A Minimal MoE Layer in PyTorch
A simplified educational MoE layer:
import torch
from torch import nn
class Expert(nn.Module):
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model),
)
def forward(self, x):
return self.net(x)
class SimpleMoE(nn.Module):
def __init__(self, d_model: int, d_ff: int, n_experts: int):
super().__init__()
self.router = nn.Linear(d_model, n_experts)
self.experts = nn.ModuleList([
Expert(d_model, d_ff)
for _ in range(n_experts)
])
def forward(self, x):
# x: [B, T, D]
B, T, D = x.shape
scores = self.router(x)
probs = torch.softmax(scores, dim=-1)
top_expert = probs.argmax(dim=-1)
out = torch.zeros_like(x)
for expert_id, expert in enumerate(self.experts):
mask = (top_expert == expert_id)
if mask.any():
selected = x[mask]
result = expert(selected)
out[mask] = result
return outThis implementation is intentionally simple and inefficient. Real MoE systems use optimized grouped dispatch kernels and distributed expert execution.
Summary
Sparse expert architectures replace dense feedforward computation with conditionally activated experts. A router selects which experts process each token, allowing total parameter count to grow much larger than active compute.
MoE systems improve scaling efficiency by separating model capacity from per-token FLOPs. They introduce new challenges including routing stability, load balancing, communication overhead, capacity limits, and distributed dispatch.
Modern sparse transformers combine dense shared attention with sparse expert feedforward layers. This architecture has become an important approach for scaling very large foundation models efficiently.