Related papers: Efficient Sparsely Activated Transformers

XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are…

Machine Learning · Computer Science 2024-05-27 Yuanhang Yang , Shiyi Qi , Wenchao Gu , Chaozheng Wang , Cuiyun Gao , Zenglin Xu

Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners

Traditional multi-task learning (MTL) methods use dense networks that use the same set of shared weights across several different tasks. This often creates interference where two or more tasks compete to pull model parameters in different…

Machine Learning · Computer Science 2022-04-19 Shashank Gupta , Subhabrata Mukherjee , Krishan Subudhi , Eduardo Gonzalez , Damien Jose , Ahmed H. Awadallah , Jianfeng Gao

Approximating Two-Layer Feedforward Networks for Efficient Transformers

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce…

Machine Learning · Computer Science 2023-11-22 Róbert Csordás , Kazuki Irie , Jürgen Schmidhuber

Secret mixtures of experts inside your LLM

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper…

Machine Learning · Computer Science 2025-12-23 Enric Boix-Adsera

Mixture of Experts Made Intrinsically Interpretable

Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a…

Machine Learning · Computer Science 2025-03-12 Xingyi Yang , Constantin Venhoff , Ashkan Khakzar , Christian Schroeder de Witt , Puneet K. Dokania , Adel Bibi , Philip Torr

Ultra-Sparse Memory Network

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from…

Machine Learning · Computer Science 2025-02-07 Zihao Huang , Qiyang Min , Hongzhi Huang , Defa Zhu , Yutao Zeng , Ran Guo , Xun Zhou

Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Svetlana Pavlitska , Haixi Fan , Konstantin Ditschuneit , J. Marius Zöllner

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without…

Sound · Computer Science 2021-05-10 Zhao You , Shulin Feng , Dan Su , Dong Yu

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Uranik Berisha , Jens Mehnert , Alexandru Paul Condurache

Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers…

Machine Learning · Computer Science 2026-03-17 Evandro S. Ortigossa , Eran Segal

Mixture-of-Expert Conformer for Streaming Multilingual ASR

End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer…

Computation and Language · Computer Science 2023-05-26 Ke Hu , Bo Li , Tara N. Sainath , Yu Zhang , Francoise Beaufays

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals…

Computation and Language · Computer Science 2025-11-06 Zijin Gu , Tatiana Likhomanenko , Navdeep Jaitly

Faster MoE LLM Inference for Extremely Large Models

Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE…

Computation and Language · Computer Science 2025-05-07 Haoqi Yang , Luohe Shi , Qiwei Li , Zuchao Li , Ping Wang , Bo Du , Mengjia Shen , Hai Zhao

Mixture of A Million Experts

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable…

Machine Learning · Computer Science 2024-07-08 Xu Owen He

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various…

Computation and Language · Computer Science 2022-11-21 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts…

Machine Learning · Computer Science 2024-11-13 Filip Szatkowski , Bartosz Wójcik , Mikołaj Piórczyński , Simone Scardapane

UMoE: Unifying Attention and FFN with Shared Experts

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending…

Machine Learning · Computer Science 2025-10-24 Yuanhang Yang , Chaozheng Wang , Jing Li

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with…

Machine Learning · Computer Science 2022-06-20 William Fedus , Barret Zoph , Noam Shazeer

LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and…

Machine Learning · Computer Science 2025-09-04 Krishna Teja Chitty-Venkata , Sandeep Madireddy , Murali Emani , Venkatram Vishwanath

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically…

Machine Learning · Computer Science 2017-01-24 Noam Shazeer , Azalia Mirhoseini , Krzysztof Maziarz , Andy Davis , Quoc Le , Geoffrey Hinton , Jeff Dean