TL;DR: The MoE Inference Penalty

Mixture-of-Experts (MoE) architectures are widely recognized for making massive LLM training highly efficient. However, when serving these models in production, throughput is not dictated by FLOP reduction. Instead, it largely depends on how effectively you reuse the weights fetched from high-bandwidth memory.

The Problem: Dense models are highly efficient at reusing loaded weights across a batch of tokens. MoE routing squanders this potential. Because tokens are routed to different experts, the batch becomes fragmented, forcing the hardware to pull more weights from memory for fewer tokens.

The KV Cache Squeeze: Furthermore, maintaining a massive pool of resident experts consumes valuable high-bandwidth memory, leaving significantly less headroom for the KV cache. This severely restricts admissible batch sizes, especially during long-context serving, which compounds the loss of weight reuse.

The Takeaway: To quantify this, we introduce the $qs$ inequality, a predictive criterion that identifies exactly when an MoE architecture becomes structurally disadvantaged compared to a quality-matched dense baseline.

Link to the paper: https://arxiv.org/abs/2603.08960

LinkedIn / Twitter