Skip to content

Sparse Expert Architectures

Dense transformers activate every parameter for every token.

Dense transformers activate every parameter for every token. As models become larger, this approach becomes increasingly expensive. A trillion-parameter dense model would require enormous compute for every forward pass, even if only part of the model is needed for a given token.

Sparse expert architectures address this problem by activating only a subset of parameters for each token. The most common form is the Mixture-of-Experts transformer, usually abbreviated as MoE.

In an MoE transformer, most transformer layers remain shared, but some feedforward layers are replaced by collections of specialized subnetworks called experts. A routing mechanism selects which experts process each token.

This allows total parameter count to grow much larger than active compute per token.

Dense Versus Sparse Computation

In a standard transformer feedforward layer:

FFN(x)=W2ϕ(W1x+b1)+b2. \text{FFN}(x)=W_2\phi(W_1x+b_1)+b_2.

Every token uses the same parameters.

In a sparse expert layer, we instead have multiple feedforward networks:

E1,E2,,EN. E_1,E_2,\ldots,E_N.

A router chooses a subset of experts for each token.

If token xtx_t uses experts ii and jj, the computation becomes

yt=giEi(xt)+gjEj(xt), y_t = g_iE_i(x_t)+g_jE_j(x_t),

where gig_i and gjg_j are routing weights.

Only the selected experts are evaluated.

Why Sparse Experts Matter

Sparse experts separate total model capacity from active compute.

Suppose:

Model typeParameters used per token
Dense transformerAll parameters
Sparse MoESmall subset of parameters

An MoE model may contain hundreds of billions of parameters while activating only a small fraction for each token.

This gives several advantages:

AdvantageExplanation
Larger total capacityMore parameters overall
Lower active computeOnly selected experts run
Expert specializationDifferent experts learn different patterns
Better scaling efficiencyMore capacity per FLOP

The key idea is conditional computation. Different inputs activate different pathways.

Structure of an MoE Layer

A standard MoE transformer layer contains:

  1. Shared attention layer.
  2. Router network.
  3. Multiple expert feedforward networks.
  4. Aggregation mechanism.

The attention layer is usually dense and shared across all tokens. The feedforward block becomes sparse.

The structure is:

xAttentionRouterSelected ExpertsAggregation. x \rightarrow \text{Attention} \rightarrow \text{Router} \rightarrow \text{Selected Experts} \rightarrow \text{Aggregation}.

Each expert is usually a standard feedforward network:

Ei(x)=W2,iϕ(W1,ix+b1,i)+b2,i. E_i(x)=W_{2,i}\phi(W_{1,i}x+b_{1,i})+b_{2,i}.

Router Networks

The router decides which experts process each token.

For token representation xtx_t, the router computes expert scores:

rt=Wrxt, r_t = W_rx_t,

where

rtRN, r_t\in\mathbb{R}^{N},

and NN is the number of experts.

A softmax converts scores into routing probabilities:

pt=softmax(rt). p_t = \text{softmax}(r_t).

The router then selects the top kk experts.

For top-1 routing:

i=argmaxipt,i. i^* = \arg\max_i p_{t,i}.

For top-2 routing:

i1,i2=TopK(pt,2). i_1,i_2 = \text{TopK}(p_t,2).

The selected experts receive the token.

Top-k Routing

Most MoE systems use top-kk routing.

Routing typeExperts per token
Top-11
Top-22
Top-44

Top-1 routing is efficient because each token activates only one expert. Top-2 routing improves stability and quality because tokens can combine multiple expert outputs.

Suppose token xtx_t selects experts ii and jj. Then:

yt=giEi(xt)+gjEj(xt), y_t = g_iE_i(x_t) + g_jE_j(x_t),

where

gi+gj=1. g_i + g_j = 1.

The router weights determine how strongly each expert contributes.

Token Dispatch

Once routing decisions are made, tokens must be grouped by expert.

Suppose:

TokenAssigned expert
x1x_1E2E_2
x2x_2E5E_5
x3x_3E2E_2
x4x_4E1E_1

The system gathers tokens for each expert:

Expert 1: [x4]
Expert 2: [x1, x3]
Expert 5: [x2]

Each expert processes its assigned tokens independently.

After computation, outputs are scattered back to their original token positions.

This gather-scatter operation is one of the main systems challenges in MoE training.

Expert Specialization

Experts often develop specialized behavior.

Different experts may focus on:

SpecializationExample
SyntaxGrammar and structure
MathematicsNumerical reasoning
CodeProgramming tokens
LanguagesDifferent natural languages
RetrievalCitation or memory tokens
Vision regionsDifferent spatial patterns

Specialization is not manually assigned. It emerges from routing and optimization dynamics.

However, specialization is imperfect. Experts may overlap substantially or collapse into similar behavior if routing is poorly balanced.

Load Balancing

A major problem in MoE systems is expert imbalance.

Suppose one expert receives most tokens:

Expert 1: 90% of tokens
Others: very few tokens

Then:

ProblemConsequence
Hot expertsCompute bottlenecks
Idle expertsWasted parameters
Poor specializationReduced model diversity
Communication imbalanceSlower distributed training

To avoid this, MoE systems use load-balancing losses.

A simplified balancing objective encourages:

  1. Similar routing probability across experts.
  2. Similar token counts across experts.

The total loss becomes

L=Ltask+λLbalance. L = L_{\text{task}} + \lambda L_{\text{balance}}.

Here λ\lambda controls the balancing strength.

Capacity Limits

Experts usually have a maximum token capacity per batch.

Suppose:

ExpertMaximum capacity
EiE_i128 tokens

If too many tokens are routed to one expert, extra tokens may be:

StrategyDescription
DroppedIgnore excess tokens
Re-routedSend to another expert
BufferedProcess later
Dynamically resizedExpand expert capacity

Capacity limits prevent single experts from becoming overloaded.

The capacity factor controls allowed overflow:

capacity=capacity factor×tokensexperts. \text{capacity} = \text{capacity factor} \times \frac{\text{tokens}}{\text{experts}}.

Distributed Expert Parallelism

MoE systems are naturally distributed.

Different experts can be placed on different accelerators:

GPU 1: Experts 1-4
GPU 2: Experts 5-8
GPU 3: Experts 9-12

Tokens are routed across devices.

This is called expert parallelism.

Compared with dense tensor parallelism, expert parallelism has different tradeoffs:

Dense parallelismExpert parallelism
Split matrix computationSplit experts
Every device active for every tokenDevices active only for routed tokens
Regular communicationSparse communication
Predictable loadRouting imbalance possible

Communication becomes a central challenge. Tokens must move between devices efficiently.

Sparse Scaling Laws

MoE models often achieve better quality for a given active compute budget.

A sparse model may have:

MetricValue
Total parameters1 trillion
Active parameters per token50 billion

The model behaves like a very large network in terms of capacity, but like a smaller network in terms of per-token compute.

This changes the scaling tradeoff:

Dense scalingSparse scaling
Capacity tied to computeCapacity partially decoupled
More parameters always cost more FLOPsExtra inactive experts are cheap
Compute grows with total sizeCompute grows with active experts

Sparse scaling is therefore attractive when parameter memory is cheaper than compute.

Routing Instability

Routers can become unstable during training.

Common issues include:

ProblemDescription
Expert collapseFew experts dominate
Routing oscillationTokens rapidly switch experts
Dead expertsSome experts unused
Noisy specializationExperts fail to stabilize
Overconfident routingRouter entropy collapses

Several stabilization techniques are common:

TechniquePurpose
Auxiliary balancing lossSpread tokens evenly
Noisy routingEncourage exploration
Temperature scalingSmooth routing probabilities
Capacity constraintsPrevent overload
Top-2 routingImprove robustness

Stable routing is essential for good expert utilization.

Switch Transformers

Switch Transformers simplified MoE routing using top-1 routing.

Instead of combining multiple experts, each token selects only one expert:

yt=Ei(xt). y_t = E_i(x_t).

This reduces communication and computation.

Advantages:

BenefitExplanation
Simpler routingOne expert per token
Lower communicationFewer transfers
Lower memoryFewer active expert outputs
Faster trainingLess aggregation overhead

The tradeoff is lower routing flexibility.

Switch-style routing showed that very large sparse models could scale efficiently with simpler systems design.

Shared and Specialized Layers

Most MoE transformers are hybrid architectures.

ComponentUsually dense or sparse
Attention layersDense
EmbeddingsDense
Output headDense
Feedforward layersSparse experts

Attention layers remain shared because every token must exchange information globally. Sparse experts mainly replace the computationally expensive feedforward blocks.

This hybrid design balances communication cost and specialization.

MoE During Inference

Inference introduces additional challenges.

The model must:

  1. Run the router.
  2. Dispatch tokens to experts.
  3. Gather outputs.
  4. Manage KV cache.
  5. Coordinate devices.

Latency can increase if routing creates communication bottlenecks.

Inference optimization techniques include:

TechniquePurpose
Expert cachingReuse loaded expert weights
Expert placement optimizationReduce communication
Token batchingImprove utilization
Routing localityKeep tokens near experts
Quantized expertsReduce memory bandwidth

Sparse inference efficiency depends heavily on systems engineering.

Sparse Experts Beyond Language

MoE ideas are also used in:

DomainApplication
VisionSpecialized image experts
Multimodal systemsExperts per modality
SpeechAcoustic specialization
RoboticsTask-conditioned policies
Retrieval systemsMemory-aware routing

The general principle is conditional computation: activate only the parts of the model needed for the current input.

MoE Versus Dense Models

PropertyDense transformerSparse MoE transformer
Active parametersAllSmall subset
Compute scalingGrows with total sizeGrows with active experts
CommunicationSimplerMore complex
Parameter efficiencyLowerHigher
Systems complexityLowerHigher
SpecializationShared representationExpert specialization

Dense models are simpler and often more stable. Sparse models offer better scaling efficiency at very large sizes.

A Minimal MoE Layer in PyTorch

A simplified educational MoE layer:

import torch
from torch import nn

class Expert(nn.Module):
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
        )

    def forward(self, x):
        return self.net(x)

class SimpleMoE(nn.Module):
    def __init__(self, d_model: int, d_ff: int, n_experts: int):
        super().__init__()

        self.router = nn.Linear(d_model, n_experts)

        self.experts = nn.ModuleList([
            Expert(d_model, d_ff)
            for _ in range(n_experts)
        ])

    def forward(self, x):
        # x: [B, T, D]

        B, T, D = x.shape

        scores = self.router(x)
        probs = torch.softmax(scores, dim=-1)

        top_expert = probs.argmax(dim=-1)

        out = torch.zeros_like(x)

        for expert_id, expert in enumerate(self.experts):
            mask = (top_expert == expert_id)

            if mask.any():
                selected = x[mask]
                result = expert(selected)
                out[mask] = result

        return out

This implementation is intentionally simple and inefficient. Real MoE systems use optimized grouped dispatch kernels and distributed expert execution.

Summary

Sparse expert architectures replace dense feedforward computation with conditionally activated experts. A router selects which experts process each token, allowing total parameter count to grow much larger than active compute.

MoE systems improve scaling efficiency by separating model capacity from per-token FLOPs. They introduce new challenges including routing stability, load balancing, communication overhead, capacity limits, and distributed dispatch.

Modern sparse transformers combine dense shared attention with sparse expert feedforward layers. This architecture has become an important approach for scaling very large foundation models efficiently.