English
Related papers

Related papers: Efficient Sparsely Activated Transformers

200 papers

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are…

Machine Learning · Computer Science 2024-05-27 Yuanhang Yang , Shiyi Qi , Wenchao Gu , Chaozheng Wang , Cuiyun Gao , Zenglin Xu

Traditional multi-task learning (MTL) methods use dense networks that use the same set of shared weights across several different tasks. This often creates interference where two or more tasks compete to pull model parameters in different…

How to reduce compute and memory requirements of neural networks (NNs) without sacrificing performance? Many recent works use sparse Mixtures of Experts (MoEs) to build resource-efficient large language models (LMs). Here we introduce…

Machine Learning · Computer Science 2023-11-22 Róbert Csordás , Kazuki Irie , Jürgen Schmidhuber

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper…

Machine Learning · Computer Science 2025-12-23 Enric Boix-Adsera

Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a…

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from…

Machine Learning · Computer Science 2025-02-07 Zihao Huang , Qiyang Min , Hongzhi Huang , Defa Zhu , Yutao Zeng , Ran Guo , Xun Zhou

Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Svetlana Pavlitska , Haixi Fan , Konstantin Ditschuneit , J. Marius Zöllner

Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without…

Sound · Computer Science 2021-05-10 Zhao You , Shulin Feng , Dan Su , Dong Yu

Vision Transformers have emerged as the state-of-the-art models in various Computer Vision tasks, but their high computational and resource demands pose significant challenges. While Mixture-of-Experts (MoE) can make these models more…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Uranik Berisha , Jens Mehnert , Alexandru Paul Condurache

Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers…

Machine Learning · Computer Science 2026-03-17 Evandro S. Ortigossa , Eran Segal

End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer…

Computation and Language · Computer Science 2023-05-26 Ke Hu , Bo Li , Tara N. Sainath , Yu Zhang , Francoise Beaufays

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals…

Computation and Language · Computer Science 2025-11-06 Zijin Gu , Tatiana Likhomanenko , Navdeep Jaitly

Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE…

Computation and Language · Computer Science 2025-05-07 Haoqi Yang , Luohe Shi , Qiwei Li , Zuchao Li , Ping Wang , Bo Du , Mengjia Shen , Hai Zhao

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable…

Machine Learning · Computer Science 2024-07-08 Xu Owen He

Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various…

Computation and Language · Computer Science 2022-11-21 Young Jin Kim , Rawn Henry , Raffy Fahim , Hany Hassan Awadalla

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts…

Machine Learning · Computer Science 2024-11-13 Filip Szatkowski , Bartosz Wójcik , Mikołaj Piórczyński , Simone Scardapane

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending…

Machine Learning · Computer Science 2025-10-24 Yuanhang Yang , Chaozheng Wang , Jing Li

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with…

Machine Learning · Computer Science 2022-06-20 William Fedus , Barret Zoph , Noam Shazeer

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and…

Machine Learning · Computer Science 2025-09-04 Krishna Teja Chitty-Venkata , Sandeep Madireddy , Murali Emani , Venkatram Vishwanath

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically…

Machine Learning · Computer Science 2017-01-24 Noam Shazeer , Azalia Mirhoseini , Krzysztof Maziarz , Andy Davis , Quoc Le , Geoffrey Hinton , Jeff Dean
‹ Prev 1 2 3 10 Next ›