English
Related papers

Related papers: Linear Self-Attention Approximation via Trainable …

200 papers

The massive amount of trainable parameters in the pre-trained language models (PLMs) makes them hard to be deployed to multiple downstream tasks. To address this issue, parameter-efficient transfer learning methods have been proposed to…

Computation and Language · Computer Science 2022-10-27 Yifan Chen , Devamanyu Hazarika , Mahdi Namazifar , Yang Liu , Di Jin , Dilek Hakkani-Tur

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We…

Machine Learning · Computer Science 2025-09-24 Duke Nguyen , Du Yin , Aditya Joshi , Flora Salim

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To…

Computation and Language · Computer Science 2025-10-27 Mutian He , Philip N. Garner

Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like $k$-Nearest-Neighbor ($k$NN) attention have been introduced [Roy, Saffar,…

Machine Learning · Computer Science 2024-11-11 Themistoklis Haris

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering…

Machine Learning · Computer Science 2023-11-09 Xing Han , Tongzheng Ren , Tan Minh Nguyen , Khai Nguyen , Joydeep Ghosh , Nhat Ho

Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is…

Information Retrieval · Computer Science 2026-02-25 Timur Nabiev , Evgeny Frolov

Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their…

Computer Vision and Pattern Recognition · Computer Science 2022-04-25 Suwichaya Suwanwimolkul , Satoshi Komorita

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by…

Computation and Language · Computer Science 2021-09-03 Chuhan Wu , Fangzhao Wu , Tao Qi , Binxing Jiao , Daxin Jiang , Yongfeng Huang , Xing Xie

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Dongchen Han , Xuran Pan , Yizeng Han , Shiji Song , Gao Huang

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the…

Machine Learning · Computer Science 2022-09-13 Feyza Duman Keles , Pruthuvi Mahesakya Wijewardena , Chinmay Hegde

Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied…

Computer Vision and Pattern Recognition · Computer Science 2022-06-16 Lin Zheng , Huijie Pan , Lingpeng Kong

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc.…

Computation and Language · Computer Science 2023-03-20 Lovish Madaan , Srinadh Bhojanapalli , Himanshu Jain , Prateek Jain
‹ Prev 1 2 3 10 Next ›