Related papers: Learning Multiscale Transformer Models for Sequenc…

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on…

Computation and Language · Computer Science 2019-12-03 Qipeng Guo , Xipeng Qiu , Pengfei Liu , Xiangyang Xue , Zheng Zhang

MCSD: An Efficient Language Model with Diverse Fusion

Transformers excel in Natural Language Processing (NLP) due to their prowess in capturing long-term dependencies but suffer from exponential resource consumption with increasing sequence lengths. To address these challenges, we propose MCSD…

Computation and Language · Computer Science 2024-07-12 Hua Yang , Duohai Li , Shiman Li

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer

This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks. Aiming at the shortcomings of the traditional Transformer model in…

Computation and Language · Computer Science 2025-01-24 Jia Gao , Guiran Liu , Binrong Zhu , Shicheng Zhou , Hongye Zheng , Xiaoxuan Liao

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Lorenzo Basile , Valentino Maiorca , Diego Doimo , Francesco Locatello , Alberto Cazzaniga

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Point Cloud Learning with Transformer

Remarkable performance from Transformer networks in Natural Language Processing promote the development of these models in dealing with computer vision tasks such as image recognition and segmentation. In this paper, we introduce a novel…

Computer Vision and Pattern Recognition · Computer Science 2022-10-26 Qi Zhong , Xian-Feng Han

A Hierarchical Transformer for Unsupervised Parsing

The underlying structure of natural language is hierarchical; words combine into phrases, which in turn form clauses. An awareness of this hierarchical structure can aid machine learning models in performing many linguistic tasks. However,…

Machine Learning · Computer Science 2020-04-01 Ashok Thillaisundaram

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural…

Computation and Language · Computer Science 2020-04-02 Prakhar Thapak , Prodip Hore

Pre-Training a Graph Recurrent Network for Language Representation

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be…

Computation and Language · Computer Science 2022-10-27 Yile Wang , Linyi Yang , Zhiyang Teng , Ming Zhou , Yue Zhang

Learning Language-Specific Layers for Multilingual Machine Translation

Multilingual Machine Translation promises to improve translation quality between non-English languages. This is advantageous for several reasons, namely lower latency (no need to translate twice), and reduced error cascades (e.g., avoiding…

Computation and Language · Computer Science 2023-05-05 Telmo Pessoa Pires , Robin M. Schmidt , Yi-Hsiu Liao , Stephan Peitz

Multi-View Self-Attention Based Transformer for Speaker Recognition

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-28 Rui Wang , Junyi Ao , Long Zhou , Shujie Liu , Zhihua Wei , Tom Ko , Qing Li , Yu Zhang

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments…

Computation and Language · Computer Science 2020-05-05 Sandeep Subramanian , Ronan Collobert , Marc'Aurelio Ranzato , Y-Lan Boureau

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in…

Computation and Language · Computer Science 2023-07-28 Jiaoda Li , Ryan Cotterell , Mrinmaya Sachan

Multimodal Transformer for Unaligned Multimodal Language Sequences

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data…

Computation and Language · Computer Science 2019-06-04 Yao-Hung Hubert Tsai , Shaojie Bai , Paul Pu Liang , J. Zico Kolter , Louis-Philippe Morency , Ruslan Salakhutdinov

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Learning Source Phrase Representations for Neural Machine Translation

The Transformer translation model (Vaswani et al., 2017) based on a multi-head attention mechanism can be computed effectively in parallel and has significantly pushed forward the performance of Neural Machine Translation (NMT). Though…

Computation and Language · Computer Science 2020-06-26 Hongfei Xu , Josef van Genabith , Deyi Xiong , Qiuhui Liu , Jingyi Zhang

A Transformer with Stack Attention

Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in…

Computation and Language · Computer Science 2024-05-15 Jiaoda Li , Jennifer C. White , Mrinmaya Sachan , Ryan Cotterell

Scalable Transformers for Neural Machine Translation

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformer is challenging because different scenarios require…

Computation and Language · Computer Science 2021-06-21 Peng Gao , Shijie Geng , Yu Qiao , Xiaogang Wang , Jifeng Dai , Hongsheng Li