English
Related papers

Related papers: Cross-token Modeling with Conditional Computation

200 papers

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are…

Machine Learning · Computer Science 2024-05-27 Yuanhang Yang , Shiyi Qi , Wenchao Gu , Chaozheng Wang , Cuiyun Gao , Zenglin Xu

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.…

Computation and Language · Computer Science 2022-06-02 Ping Yu , Mikel Artetxe , Myle Ott , Sam Shleifer , Hongyu Gong , Ves Stoyanov , Xian Li

Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a…

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Carlos Riquelme , Joan Puigcerver , Basil Mustafa , Maxim Neumann , Rodolphe Jenatton , André Susano Pinto , Daniel Keysers , Neil Houlsby

Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still it is a mystery how MoE layers bring quality…

Machine Learning · Computer Science 2021-08-10 An Yang , Junyang Lin , Rui Men , Chang Zhou , Le Jiang , Xianyan Jia , Ang Wang , Jie Zhang , Jiamang Wang , Yong Li , Di Zhang , Wei Lin , Lin Qu , Jingren Zhou , Hongxia Yang

The sparse Mixture-of-Experts (MoE) model is powerful for large-scale pre-training and has achieved promising results due to its model capacity. However, with trillions of parameters, MoE is hard to be deployed on cloud or mobile…

Machine Learning · Computer Science 2022-06-03 Tianyu Chen , Shaohan Huang , Yuan Xie , Binxing Jiao , Daxin Jiang , Haoyi Zhou , Jianxin Li , Furu Wei

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for…

Computation and Language · Computer Science 2024-04-24 Xun Wu , Shaohan Huang , Wenhui Wang , Furu Wei

Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to…

Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model…

Machine Learning · Computer Science 2026-03-09 Marmik Chaudhari , Nishkal Hundia , Idhant Gulati

Sparse Mixture-of-Experts (MoE) architectures route each token through a subset of experts at each layer independently. We propose viewing MoE computation through the lens of \emph{expert paths} -- the sequence of expert selections a token…

Machine Learning · Computer Science 2026-04-07 Zijin Gu , Tatiana Likhomanenko , Vimal Thilak , Jason Ramapuram , Navdeep Jaitly

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive…

Machine Learning · Computer Science 2026-05-12 Gleb Molodtsov , Alexander Miasnikov , Aleksandr Beznosikov

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation,…

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally…

Machine Learning · Computer Science 2024-04-09 Bowen Pan , Yikang Shen , Haokun Liu , Mayank Mishra , Gaoyuan Zhang , Aude Oliva , Colin Raffel , Rameswar Panda

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to…

Machine Learning · Computer Science 2024-05-28 Joan Puigcerver , Carlos Riquelme , Basil Mustafa , Neil Houlsby

The sparse Mixture of Experts(MoE) architecture has evolved as a powerful approach for scaling deep learning models to more parameters with comparable computation cost. As an important branch of large language model(LLM), MoE model only…

Machine Learning · Computer Science 2026-02-10 Dong Pan , Bingtao Li , Yongsheng Zheng , Jiren Ma , Victor Fei

Real-world model deployment across multiple domains requires multimodal models to operate under two complementary regimes: (1) multi-task pretraining, tasks are co-available at design time where related tasks could borrow representational…

Machine Learning · Computer Science 2026-05-12 Xing Han , Shravan Chaudhari , Tanvi Ranade , Rama Chellappa , Suchi Saria

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such,…

Computer Vision and Pattern Recognition · Computer Science 2023-09-11 Erik Daxberger , Floris Weers , Bowen Zhang , Tom Gunter , Ruoming Pang , Marcin Eichner , Michael Emmersberger , Yinfei Yang , Alexander Toshev , Xianzhi Du

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper…

Machine Learning · Computer Science 2025-12-23 Enric Boix-Adsera

Mixture-of-Experts (MoE) architectures enable conditional computation by routing inputs to multiple expert subnetworks and are often motivated as a mechanism for scaling large language models. In this project, we instead study MoE behavior…

Machine Learning · Computer Science 2026-01-22 Adam Rokah , Daniel Veress , Caleb Caulk , Sourav Sharan

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Libo Sun , Po-wei Harn , Peixiong He , Xiao Qin
‹ Prev 1 2 3 10 Next ›