Related papers: Encoder-Agnostic Adaptation for Conditional Langua…

Sentence Bottleneck Autoencoders from Transformer Language Models

Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the…

Computation and Language · Computer Science 2021-09-14 Ivan Montero , Nikolaos Pappas , Noah A. Smith

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

Adversarial Self-Attention for Language Understanding

Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances the self-attention mechanism to its…

Computation and Language · Computer Science 2023-02-09 Hongqiu Wu , Ruixue Ding , Hai Zhao , Pengjun Xie , Fei Huang , Min Zhang

Controlling the Focus of Pretrained Language Generation Models

The finetuning of pretrained transformer-based language generation models are typically conducted in an end-to-end manner, where the model learns to attend to relevant parts of the input by itself. However, there does not exist a mechanism…

Artificial Intelligence · Computer Science 2022-03-03 Jiabao Ji , Yoon Kim , James Glass , Tianxing He

VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and…

Computation and Language · Computer Science 2021-06-03 Fuli Luo , Wei Wang , Jiahao Liu , Yijia Liu , Bin Bi , Songfang Huang , Fei Huang , Luo Si

Preconditioned Attention: Enhancing Efficiency in Transformers

Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers…

Machine Learning · Computer Science 2026-03-31 Hemanth Saratchandran

Relaxed Attention for Transformer Models

The powerful modeling capabilities of all-attention-based transformer architectures often cause overfitting and - for natural language processing tasks - lead to an implicitly learned internal language model in the autoregressive…

Machine Learning · Computer Science 2022-09-21 Timo Lohrenz , Björn Möller , Zhengyang Li , Tim Fingscheidt

Temporal Attention for Language Models

Pretrained language models based on the transformer architecture have shown great success in NLP. Textual training data often comes from the web and is thus tagged with time-specific information, but most language models ignore this…

Computation and Language · Computer Science 2022-05-05 Guy D. Rosin , Kira Radinsky

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of…

Computation and Language · Computer Science 2019-12-30 Guangxiang Zhao , Junyang Lin , Zhiyuan Zhang , Xuancheng Ren , Qi Su , Xu Sun

Guiding Attention for Self-Supervised Learning with Transformers

In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained…

Computation and Language · Computer Science 2020-10-07 Ameet Deshpande , Karthik Narasimhan

Technical Report: Auxiliary Tuning and its Application to Conditional Text Generation

We introduce a simple and efficient method, called Auxiliary Tuning, for adapting a pre-trained Language Model to a novel task; we demonstrate this approach on the task of conditional text generation. Our approach supplements the original…

Computation and Language · Computer Science 2020-07-01 Yoel Zeldes , Dan Padnos , Or Sharir , Barak Peleg

Conditioned Natural Language Generation using only Unconditioned Language Model: An Exploration

Transformer-based language models have shown to be very powerful for natural language generation (NLG). However, text generation conditioned on some user inputs, such as topics or attributes, is non-trivial. Past approach relies on either…

Computation and Language · Computer Science 2020-11-17 Fan-Keng Sun , Cheng-I Lai

Discrete Variational Attention Models for Language Generation

Variational autoencoders have been widely applied for natural language generation, however, there are two long-standing problems: information under-representation and posterior collapse. The former arises from the fact that only the last…

Machine Learning · Computer Science 2021-06-17 Xianghong Fang , Haoli Bai , Zenglin Xu , Michael Lyu , Irwin King

Contextually Structured Token Dependency Encoding for Large Language Models

Token representation strategies within large-scale neural architectures often rely on contextually refined embeddings, yet conventional approaches seldom encode structured relationships explicitly within token interactions. Self-attention…

Computation and Language · Computer Science 2025-03-27 James Blades , Frederick Somerfield , William Langley , Susan Everingham , Maurice Witherington

Discovering Useful Sentence Representations from Large Pretrained Language Models

Despite the extensive success of pretrained language models as encoders for building NLP systems, they haven't seen prominence as decoders for sequence generation tasks. We explore the question of whether these models can be adapted to be…

Computation and Language · Computer Science 2020-08-21 Nishant Subramani , Nivedita Suresh

Masked Mixers for Language Generation and Retrieval

Attention mechanisms that confer selective focus on a strict subset of input elements are nearly ubiquitous in language models today. We posit there to be downside to the use of attention: most input information is lost. In support of this…

Computation and Language · Computer Science 2025-03-21 Benjamin L. Badger

Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-16 Timo Lohrenz , Patrick Schwarz , Zhengyang Li , Tim Fingscheidt

Latent Diffusion for Language Generation

Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have…

Computation and Language · Computer Science 2023-11-08 Justin Lovelace , Varsha Kishore , Chao Wan , Eliot Shekhtman , Kilian Q. Weinberger

Multilingual Transformer Encoders: a Word-Level Task-Agnostic Evaluation

Some Transformer-based models can perform cross-lingual transfer learning: those models can be trained on a specific task in one language and give relatively good results on the same task in another language, despite having been pre-trained…

Computation and Language · Computer Science 2022-07-20 Félix Gaschi , François Plesse , Parisa Rastin , Yannick Toussaint

Attention-Guided Adaptation for Code-Switching Speech Recognition

The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-15 Bobbi Aditya , Mahdin Rohmatillah , Liang-Hsuan Tai , Jen-Tzung Chien