Related papers: Cross-token Modeling with Conditional Computation

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are…

Machine Learning · Computer Science 2024-05-27 Yuanhang Yang , Shiyi Qi , Wenchao Gu , Chaozheng Wang , Cuiyun Gao , Zenglin Xu

Efficient Language Modeling with Sparse all-MLP

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.…

Computation and Language · Computer Science 2022-06-02 Ping Yu , Mikel Artetxe , Myle Ott , Sam Shleifer , Hongyu Gong , Ves Stoyanov , Xian Li

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a…

Computation and Language · Computer Science 2024-09-25 Szymon Antoniak , Michał Krutul , Maciej Pióro , Jakub Krajewski , Jan Ludziejewski , Kamil Ciebiera , Krystian Król , Tomasz Odrzygóźdź , Marek Cygan , Sebastian Jaszczur

Scaling Vision with Sparse Mixture of Experts

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Carlos Riquelme , Joan Puigcerver , Basil Mustafa , Maxim Neumann , Rodolphe Jenatton , André Susano Pinto , Daniel Keysers , Neil Houlsby

M6-T: Exploring Sparse Expert Models and Beyond

Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality…

Machine Learning · Computer Science 2021-08-10 An Yang , Junyang Lin , Rui Men , Chang Zhou , Le Jiang , Xianyan Jia , Ang Wang , Jie Zhang , Jiamang Wang , Yong Li , Di Zhang , Wei Lin , Lin Qu , Jingren Zhou , Hongxia Yang

Task-Specific Expert Pruning for Sparse Mixture-of-Experts

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile…

Machine Learning · Computer Science 2022-06-03 Tianyu Chen , Shaohan Huang , Yuan Xie , Binxing Jiao , Daxin Jiang , Haoyi Zhou , Jianxin Li , Furu Wei

Multi-Head Mixture-of-Experts

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for…

Computation and Language · Computer Science 2024-04-24 Xun Wu , Shaohan Huang , Wenhui Wang , Furu Wei

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to…

Computation and Language · Computer Science 2023-07-06 Sheng Shen , Le Hou , Yanqi Zhou , Nan Du , Shayne Longpre , Jason Wei , Hyung Won Chung , Barret Zoph , William Fedus , Xinyun Chen , Tu Vu , Yuexin Wu , Wuyang Chen , Albert Webson , Yunxuan Li , Vincent Zhao , Hongkun Yu , Kurt Keutzer , Trevor Darrell , Denny Zhou

Sparse Crosscoders for diffing MoEs and Dense models

Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model…

Machine Learning · Computer Science 2026-03-09 Marmik Chaudhari , Nishkal Hundia , Idhant Gulati

Path-Constrained Mixture-of-Experts

Sparse Mixture-of-Experts (MoE) architectures route each token through a subset of experts at each layer independently. We propose viewing MoE computation through the lens of \emph{expert paths} -- the sequence of expert selections a token…

Machine Learning · Computer Science 2026-04-07 Zijin Gu , Tatiana Likhomanenko , Vimal Thilak , Jason Ramapuram , Navdeep Jaitly

Hierarchical Mixture-of-Experts with Two-Stage Optimization

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive…

Machine Learning · Computer Science 2026-05-12 Gleb Molodtsov , Alexander Miasnikov , Aleksandr Beznosikov

Scalable Training of Mixture-of-Experts Models with Megatron Core

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Zijie Yan , Hongxiao Bai , Xin Yao , Dennis Liu , Tong Liu , Hongbin Liu , Pingtian Li , Evan Wu , Shiqing Fan , Li Tao , Robin Zhang , Yuzhong Wang , Shifang Xu , Jack Chang , Xuwen Chen , Kunlun Li , Yan Bai , Gao Deng , Nan Zheng , Vijay Anand Korthikanti , Abhinav Khattar , Ethan He , Soham Govande , Sangkug Lym , Zhongbo Zhu , Qi Zhang , Haochen Yuan , Xiaowei Ren , Deyu Fu , Tailai Ma , Shunkang Zhang , Jiang Shao , Ray Wang , Vasudevan Rengasamy , Rachit Garg , Santosh Bhavani , Xipeng Li , Chandler Zhou , David Wu , Yingcan Wei , Ashwath Aithal , Michael Andersch , Mohammad Shoeybi , Jiajie Yao , June Yang

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally…

Machine Learning · Computer Science 2024-04-09 Bowen Pan , Yikang Shen , Haokun Liu , Mayank Mishra , Gaoyuan Zhang , Aude Oliva , Colin Raffel , Rameswar Panda

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to…

Machine Learning · Computer Science 2024-05-28 Joan Puigcerver , Carlos Riquelme , Basil Mustafa , Neil Houlsby

The Rise of Sparse Mixture-of-Experts: A Survey from Algorithmic Foundations to Decentralized Architectures and Vertical Domain Applications

The sparse Mixture of Experts(MoE) architecture has evolved as a powerful approach for scaling deep learning models to more parameters with comparable computation cost. As an important branch of large language model(LLM), MoE model only…

Machine Learning · Computer Science 2026-02-10 Dong Pan , Bingtao Li , Yongsheng Zheng , Jiren Ma , Victor Fei

FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational…

Machine Learning · Computer Science 2026-05-12 Xing Han , Shravan Chaudhari , Tanvi Ranade , Rama Chellappa , Suchi Saria

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such,…

Computer Vision and Pattern Recognition · Computer Science 2023-09-11 Erik Daxberger , Floris Weers , Bowen Zhang , Tom Gunter , Ruoming Pang , Marcin Eichner , Michael Emmersberger , Yinfei Yang , Alexander Toshev , Xianzhi Du

Secret mixtures of experts inside your LLM

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper…

Machine Learning · Computer Science 2025-12-23 Enric Boix-Adsera

Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization

Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior…

Machine Learning · Computer Science 2026-01-22 Adam Rokah , Daniel Veress , Caleb Caulk , Sourav Sharan

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Libo Sun , Po-wei Harn , Peixiong He , Xiao Qin