MoE is how the industry is scaling LLMs beyond dense transformer limits. Studying the key architectures and implementation details.
Reading List
- Switch Transformers (Fedus et al., 2021)
- ST-MoE (Zoph et al., 2022)
- Mixtral (Jiang et al., 2024)
- DeepSeek-MoE and fine-grained expert design
Key Insights
- Expert load balancing is the critical implementation challenge
- Token-choice vs expert-choice routing has major throughput implications
- Sparse models need different serving infrastructure than dense models