# Sparse Expert Architectures

Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token.

Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE.

In an MoE transformer, most transformer layers remain shared, but some feedforward layers are replaced by collections of specialized subnetworks called experts. A routing mechanism selects which experts process each token.

This allows total parameter count to grow much larger than active compute per token.

### Dense Versus Sparse Computation

In a standard transformer feedforward layer:

$$
\text{FFN}(x)=W_2\phi(W_1x+b_1)+b_2.
$$

Every token uses the same parameters.

In a sparse expert layer, we instead have multiple feedforward networks:

$$
E_1,E_2,\ldots,E_N.
$$

A router chooses a subset of experts for each token.

If token $x_t$ uses experts $i$ and $j$, the computation becomes

$$
y_t = g_iE_i(x_t)+g_jE_j(x_t),
$$

where $g_i$ and $g_j$ are routing weights.

Only the selected experts are evaluated.

### Why Sparse Experts Matter

Sparse experts separate total model capacity from active compute.

Suppose:

| Model type | Parameters used per token |
|---|---:|
| Dense transformer | All parameters |
| Sparse MoE | Small subset of parameters |

An MoE model may contain hundreds of billions of parameters while activating only a small fraction for each token.

This gives several advantages:

| Advantage | Explanation |
|---|---|
| Larger total capacity | More parameters overall |
| Lower active compute | Only selected experts run |
| Expert specialization | Different experts learn different patterns |
| Better scaling efficiency | More capacity per FLOP |

The key idea is conditional computation. Different inputs activate different pathways.

### Structure of an MoE Layer

A standard MoE transformer layer contains:

1. Shared attention layer.
2. Router network.
3. Multiple expert feedforward networks.
4. Aggregation mechanism.

The attention layer is usually dense and shared across all tokens. The feedforward block becomes sparse.

The structure is:

$$
x
\rightarrow
\text{Attention}
\rightarrow
\text{Router}
\rightarrow
\text{Selected Experts}
\rightarrow
\text{Aggregation}.
$$

Each expert is usually a standard feedforward network:

$$
E_i(x)=W_{2,i}\phi(W_{1,i}x+b_{1,i})+b_{2,i}.
$$

### Router Networks

The router decides which experts process each token.

For token representation $x_t$, the router computes expert scores:

$$
r_t = W_rx_t,
$$

where

$$
r_t\in\mathbb{R}^{N},
$$

and $N$ is the number of experts.

A softmax converts scores into routing probabilities:

$$
p_t = \text{softmax}(r_t).
$$

The router then selects the top $k$ experts.

For top-1 routing:

$$
i^* = \arg\max_i p_{t,i}.
$$

For top-2 routing:

$$
i_1,i_2 = \text{TopK}(p_t,2).
$$

The selected experts receive the token.

### Top-k Routing

Most MoE systems use top-$k$ routing.

| Routing type | Experts per token |
|---|---:|
| Top-1 | 1 |
| Top-2 | 2 |
| Top-4 | 4 |

Top-1 routing is efficient because each token activates only one expert. Top-2 routing improves stability and quality because tokens can combine multiple expert outputs.

Suppose token $x_t$ selects experts $i$ and $j$. Then:

$$
y_t =
g_iE_i(x_t)
+
g_jE_j(x_t),
$$

where

$$
g_i + g_j = 1.
$$

The router weights determine how strongly each expert contributes.

### Token Dispatch

Once routing decisions are made, tokens must be grouped by expert.

Suppose:

| Token | Assigned expert |
|---|---|
| $x_1$ | $E_2$ |
| $x_2$ | $E_5$ |
| $x_3$ | $E_2$ |
| $x_4$ | $E_1$ |

The system gathers tokens for each expert:

```text id="c4n8a5"
Expert 1: [x4]
Expert 2: [x1, x3]
Expert 5: [x2]
```

Each expert processes its assigned tokens independently.

After computation, outputs are scattered back to their original token positions.

This gather-scatter operation is one of the main systems challenges in MoE training.

### Expert Specialization

Experts often develop specialized behavior.

Different experts may focus on:

| Specialization | Example |
|---|---|
| Syntax | Grammar and structure |
| Mathematics | Numerical reasoning |
| Code | Programming tokens |
| Languages | Different natural languages |
| Retrieval | Citation or memory tokens |
| Vision regions | Different spatial patterns |

Specialization is not manually assigned. It emerges from routing and optimization dynamics.

However, specialization is imperfect. Experts may overlap substantially or collapse into similar behavior if routing is poorly balanced.

### Load Balancing

A major problem in MoE systems is expert imbalance.

Suppose one expert receives most tokens:

```text id="x8br4e"
Expert 1: 90% of tokens
Others: very few tokens
```

Then:

| Problem | Consequence |
|---|---|
| Hot experts | Compute bottlenecks |
| Idle experts | Wasted parameters |
| Poor specialization | Reduced model diversity |
| Communication imbalance | Slower distributed training |

To avoid this, MoE systems use load-balancing losses.

A simplified balancing objective encourages:

1. Similar routing probability across experts.
2. Similar token counts across experts.

The total loss becomes

$$
L = L_{\text{task}} + \lambda L_{\text{balance}}.
$$

Here $\lambda$ controls the balancing strength.

### Capacity Limits

Experts usually have a maximum token capacity per batch.

Suppose:

| Expert | Maximum capacity |
|---|---:|
| $E_i$ | 128 tokens |

If too many tokens are routed to one expert, extra tokens may be:

| Strategy | Description |
|---|---|
| Dropped | Ignore excess tokens |
| Re-routed | Send to another expert |
| Buffered | Process later |
| Dynamically resized | Expand expert capacity |

Capacity limits prevent single experts from becoming overloaded.

The capacity factor controls allowed overflow:

$$
\text{capacity} =
\text{capacity factor}
\times
\frac{\text{tokens}}{\text{experts}}.
$$

### Distributed Expert Parallelism

MoE systems are naturally distributed.

Different experts can be placed on different accelerators:

```text id="t3r0zb"
GPU 1: Experts 1-4
GPU 2: Experts 5-8
GPU 3: Experts 9-12
```

Tokens are routed across devices.

This is called expert parallelism.

Compared with dense tensor parallelism, expert parallelism has different tradeoffs:

| Dense parallelism | Expert parallelism |
|---|---|
| Split matrix computation | Split experts |
| Every device active for every token | Devices active only for routed tokens |
| Regular communication | Sparse communication |
| Predictable load | Routing imbalance possible |

Communication becomes a central challenge. Tokens must move between devices efficiently.

### Sparse Scaling Laws

MoE models often achieve better quality for a given active compute budget.

A sparse model may have:

| Metric | Value |
|---|---:|
| Total parameters | 1 trillion |
| Active parameters per token | 50 billion |

The model behaves like a very large network in terms of capacity, but like a smaller network in terms of per-token compute.

This changes the scaling tradeoff:

| Dense scaling | Sparse scaling |
|---|---|
| Capacity tied to compute | Capacity partially decoupled |
| More parameters always cost more FLOPs | Extra inactive experts are cheap |
| Compute grows with total size | Compute grows with active experts |

Sparse scaling is therefore attractive when parameter memory is cheaper than compute.

### Routing Instability

Routers can become unstable during training.

Common issues include:

| Problem | Description |
|---|---|
| Expert collapse | Few experts dominate |
| Routing oscillation | Tokens rapidly switch experts |
| Dead experts | Some experts unused |
| Noisy specialization | Experts fail to stabilize |
| Overconfident routing | Router entropy collapses |

Several stabilization techniques are common:

| Technique | Purpose |
|---|---|
| Auxiliary balancing loss | Spread tokens evenly |
| Noisy routing | Encourage exploration |
| Temperature scaling | Smooth routing probabilities |
| Capacity constraints | Prevent overload |
| Top-2 routing | Improve robustness |

Stable routing is essential for good expert utilization.

### Switch Transformers

Switch Transformers simplified MoE routing using top-1 routing.

Instead of combining multiple experts, each token selects only one expert:

$$
y_t = E_i(x_t).
$$

This reduces communication and computation.

Advantages:

| Benefit | Explanation |
|---|---|
| Simpler routing | One expert per token |
| Lower communication | Fewer transfers |
| Lower memory | Fewer active expert outputs |
| Faster training | Less aggregation overhead |

The tradeoff is lower routing flexibility.

Switch-style routing showed that very large sparse models could scale efficiently with simpler systems design.

### Shared and Specialized Layers

Most MoE transformers are hybrid architectures.

| Component | Usually dense or sparse |
|---|---|
| Attention layers | Dense |
| Embeddings | Dense |
| Output head | Dense |
| Feedforward layers | Sparse experts |

Attention layers remain shared because every token must exchange information globally. Sparse experts mainly replace the computationally expensive feedforward blocks.

This hybrid design balances communication cost and specialization.

### MoE During Inference

Inference introduces additional challenges.

The model must:

1. Run the router.
2. Dispatch tokens to experts.
3. Gather outputs.
4. Manage KV cache.
5. Coordinate devices.

Latency can increase if routing creates communication bottlenecks.

Inference optimization techniques include:

| Technique | Purpose |
|---|---|
| Expert caching | Reuse loaded expert weights |
| Expert placement optimization | Reduce communication |
| Token batching | Improve utilization |
| Routing locality | Keep tokens near experts |
| Quantized experts | Reduce memory bandwidth |

Sparse inference efficiency depends heavily on systems engineering.

### Sparse Experts Beyond Language

MoE ideas are also used in:

| Domain | Application |
|---|---|
| Vision | Specialized image experts |
| Multimodal systems | Experts per modality |
| Speech | Acoustic specialization |
| Robotics | Task-conditioned policies |
| Retrieval systems | Memory-aware routing |

The general principle is conditional computation: activate only the parts of the model needed for the current input.

### MoE Versus Dense Models

| Property | Dense transformer | Sparse MoE transformer |
|---|---|---|
| Active parameters | All | Small subset |
| Compute scaling | Grows with total size | Grows with active experts |
| Communication | Simpler | More complex |
| Parameter efficiency | Lower | Higher |
| Systems complexity | Lower | Higher |
| Specialization | Shared representation | Expert specialization |

Dense models are simpler and often more stable. Sparse models offer better scaling efficiency at very large sizes.

### A Minimal MoE Layer in PyTorch

A simplified educational MoE layer:

```python id="2vf9wy"
import torch
from torch import nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

class SimpleMoE(nn.Module):
    def __init__(self, d_model: int, d_ff: int, n_experts: int):
        super().__init__()

        self.router = nn.Linear(d_model, n_experts)

        self.experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(n_experts)
        ])

    def forward(self, x):
        # x: [B, T, D]

        B, T, D = x.shape

        scores = self.router(x)
        probs = torch.softmax(scores, dim=-1)

        top_expert = probs.argmax(dim=-1)

        out = torch.zeros_like(x)

        for expert_id, expert in enumerate(self.experts):
            mask = (top_expert == expert_id)

            if mask.any():
                selected = x[mask]
                result = expert(selected)
                out[mask] = result

        return out
```

This implementation is intentionally simple and inefficient. Real MoE systems use optimized grouped dispatch kernels and distributed expert execution.

### Summary

Sparse expert architectures replace dense feedforward computation with conditionally activated experts. A router selects which experts process each token, allowing total parameter count to grow much larger than active compute.

MoE systems improve scaling efficiency by separating model capacity from per-token FLOPs. They introduce new challenges including routing stability, load balancing, communication overhead, capacity limits, and distributed dispatch.

Modern sparse transformers combine dense shared attention with sparse expert feedforward layers. This architecture has become an important approach for scaling very large foundation models efficiently.

