Related papers: Sparse Universal Transformer

MoEUT: Mixture-of-Experts Universal Transformers

Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but…

Machine Learning · Computer Science 2024-10-15 Róbert Csordás , Kazuki Irie , Jürgen Schmidhuber , Christopher Potts , Christopher D. Manning

Universal Transformers

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them…

Computation and Language · Computer Science 2019-03-06 Mostafa Dehghani , Stephan Gouws , Oriol Vinyals , Jakob Uszkoreit , Łukasz Kaiser

Adaptivity and Modularity for Efficient Generalization Over Task Complexity

Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that…

Machine Learning · Computer Science 2023-10-16 Samira Abnar , Omid Saremi , Laurent Dinh , Shantel Wilson , Miguel Angel Bautista , Chen Huang , Vimal Thilak , Etai Littwin , Jiatao Gu , Josh Susskind , Samy Bengio

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Machine Learning · Computer Science 2021-11-29 Sebastian Jaszczur , Aakanksha Chowdhery , Afroz Mohiuddin , Łukasz Kaiser , Wojciech Gajewski , Henryk Michalewski , Jonni Kanerva

Sparse Fusion for Multimodal Transformers

Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Yi Ding , Alex Rich , Mason Wang , Noah Stier , Matthew Turk , Pradeep Sen , Tobias Höllerer

Multi-Unit Transformers for Neural Machine Translation

Transformer models achieve remarkable success in Neural Machine Translation. Many efforts have been devoted to deepening the Transformer by stacking several units (i.e., a combination of Multihead Attentions and FFN) in a cascade, while the…

Computation and Language · Computer Science 2020-10-26 Jianhao Yan , Fandong Meng , Jie Zhou

SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification

Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-19 Yuntao Gui , Xiao Yan , Peiqi Yin , Han Yang , James Cheng

Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers

Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and…

Computation and Language · Computer Science 2021-09-09 Machel Reid , Edison Marrese-Taylor , Yutaka Matsuo

Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts

Recent successes suggest that parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Shengzhuang Chen , Jihoon Tack , Yunqiao Yang , Yee Whye Teh , Jonathan Richard Schwarz , Ying Wei

SaiT: Sparse Vision Transformers through Adaptive Token Pruning

While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling…

Computer Vision and Pattern Recognition · Computer Science 2022-10-13 Ling Li , David Thorsley , Joseph Hassoun

Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small…

Computation and Language · Computer Science 2026-04-29 Penghao Kuang , Haoyi Wu , Kewei Tu

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning…

Computer Vision and Pattern Recognition · Computer Science 2020-07-24 Tao Jin , Siyu Huang , Ming Chen , Yingming Li , Zhongfei Zhang

Do Efficient Transformers Really Save Computation?

As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers…

Machine Learning · Computer Science 2024-11-12 Kai Yang , Jan Ackermann , Zhenyu He , Guhao Feng , Bohang Zhang , Yunzhen Feng , Qiwei Ye , Di He , Liwei Wang

Characterization of Dielectric Materials by Sparse Signal Processing with Iterative Dictionary Updates

Estimating parameters and properties of various materials without causing damage to the material under test (MUT) is important in many applications. Thus, in this letter, we address this by wireless sensing. Here, the accuracy of the…

Signal Processing · Electrical Eng. & Systems 2020-10-29 Udaya S. K. P. Miriya Thanthrige , Jan Barowski , Ilona Rolfes , Daniel Erni , Thomas Kaiser , Aydin Sezgin

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language…

Computation and Language · Computer Science 2022-05-03 Barret Zoph , Irwan Bello , Sameer Kumar , Nan Du , Yanping Huang , Jeff Dean , Noam Shazeer , William Fedus

From Sparse to Soft Mixtures of Experts

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to…

Machine Learning · Computer Science 2024-05-28 Joan Puigcerver , Carlos Riquelme , Basil Mustafa , Neil Houlsby

Pretrained Transformers as Universal Computation Engines

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual…

Machine Learning · Computer Science 2021-07-01 Kevin Lu , Aditya Grover , Pieter Abbeel , Igor Mordatch

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with…

Machine Learning · Computer Science 2022-06-20 William Fedus , Barret Zoph , Noam Shazeer

Multivariate Unified Skew-t Distributions And Their Properties

The unified skew-t (SUT) is a flexible parametric multivariate distribution that accounts for skewness and heavy tails in the data. A few of its properties can be found scattered in the literature or in a parameterization that does not…

Methodology · Statistics 2023-12-01 Kesen Wang , Maicon J. Karling , Reinaldo B. Arellano-Valle , Marc G. Genton

Ultra-Sparse Memory Network

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from…

Machine Learning · Computer Science 2025-02-07 Zihao Huang , Qiyang Min , Hongzhi Huang , Defa Zhu , Yutao Zeng , Ran Guo , Xun Zhou