English
Related papers

Related papers: Sparse Universal Transformer

200 papers

Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but…

Machine Learning · Computer Science 2024-10-15 Róbert Csordás , Kazuki Irie , Jürgen Schmidhuber , Christopher Potts , Christopher D. Manning

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them…

Computation and Language · Computer Science 2019-03-06 Mostafa Dehghani , Stephan Gouws , Oriol Vinyals , Jakob Uszkoreit , Łukasz Kaiser

Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that…

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without…

Computer Vision and Pattern Recognition · Computer Science 2021-11-29 Yi Ding , Alex Rich , Mason Wang , Noah Stier , Matthew Turk , Pradeep Sen , Tobias Höllerer

Transformer models achieve remarkable success in Neural Machine Translation. Many efforts have been devoted to deepening the Transformer by stacking several units (i.e., a combination of Multihead Attentions and FFN) in a cascade, while the…

Computation and Language · Computer Science 2020-10-26 Jianhao Yan , Fandong Meng , Jie Zhou

Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-19 Yuntao Gui , Xiao Yan , Peiqi Yin , Han Yang , James Cheng

Transformers have shown improved performance when compared to previous architectures for sequence processing such as RNNs. Despite their sizeable performance gains, as recently suggested, the model is computationally expensive to train and…

Computation and Language · Computer Science 2021-09-09 Machel Reid , Edison Marrese-Taylor , Yutaka Matsuo

Recent successes suggest that parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Shengzhuang Chen , Jihoon Tack , Yunqiao Yang , Yee Whye Teh , Jonathan Richard Schwarz , Ying Wei

While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling…

Computer Vision and Pattern Recognition · Computer Science 2022-10-13 Ling Li , David Thorsley , Joseph Hassoun

Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small…

Computation and Language · Computer Science 2026-04-29 Penghao Kuang , Haoyi Wu , Kewei Tu

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning…

Computer Vision and Pattern Recognition · Computer Science 2020-07-24 Tao Jin , Siyu Huang , Ming Chen , Yingming Li , Zhongfei Zhang

As transformer-based language models are trained on increasingly large datasets and with vast numbers of parameters, finding more efficient alternatives to the standard Transformer has become very valuable. While many efficient Transformers…

Machine Learning · Computer Science 2024-11-12 Kai Yang , Jan Ackermann , Zhenyu He , Guhao Feng , Bohang Zhang , Yunzhen Feng , Qiwei Ye , Di He , Liwei Wang

Estimating parameters and properties of various materials without causing damage to the material under test (MUT) is important in many applications. Thus, in this letter, we address this by wireless sensing. Here, the accuracy of the…

Signal Processing · Electrical Eng. & Systems 2020-10-29 Udaya S. K. P. Miriya Thanthrige , Jan Barowski , Ilona Rolfes , Daniel Erni , Thomas Kaiser , Aydin Sezgin

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language…

Computation and Language · Computer Science 2022-05-03 Barret Zoph , Irwan Bello , Sameer Kumar , Nan Du , Yanping Huang , Jeff Dean , Noam Shazeer , William Fedus

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to…

Machine Learning · Computer Science 2024-05-28 Joan Puigcerver , Carlos Riquelme , Basil Mustafa , Neil Houlsby

We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual…

Machine Learning · Computer Science 2021-07-01 Kevin Lu , Aditya Grover , Pieter Abbeel , Igor Mordatch

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with…

Machine Learning · Computer Science 2022-06-20 William Fedus , Barret Zoph , Noam Shazeer

The unified skew-t (SUT) is a flexible parametric multivariate distribution that accounts for skewness and heavy tails in the data. A few of its properties can be found scattered in the literature or in a parameterization that does not…

Methodology · Statistics 2023-12-01 Kesen Wang , Maicon J. Karling , Reinaldo B. Arellano-Valle , Marc G. Genton

It is widely acknowledged that the performance of Transformer models is logarithmically related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from…

Machine Learning · Computer Science 2025-02-07 Zihao Huang , Qiyang Min , Hongzhi Huang , Defa Zhu , Yutao Zeng , Ran Guo , Xun Zhou
‹ Prev 1 2 3 10 Next ›