English
Related papers

Related papers: Linearizing Transformer with Key-Value Memory

200 papers

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network…

Computation and Language · Computer Science 2022-04-14 Qingyang Wu , Zhenzhong Lan , Kun Qian , Jing Gu , Alborz Geramifard , Zhou Yu

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring…

Machine Learning · Computer Science 2020-12-22 Valerii Likhosherstov , Krzysztof Choromanski , Jared Davis , Xingyou Song , Adrian Weller

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and…

Computation and Language · Computer Science 2026-02-04 Ning Ding , Yehui Tang , Haochen Qin , Zhenli Zhou , Chao Xu , Lin Li , Kai Han , Heng Liao , Yunhe Wang

Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily…

Neural and Evolutionary Computing · Computer Science 2024-02-08 Zeyu Liu , Gourav Datta , Anni Li , Peter Anthony Beerel

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of…

Machine Learning · Computer Science 2020-02-19 Nikita Kitaev , Łukasz Kaiser , Anselm Levskaya

Transformer models gain popularity because of their superior inference accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing works on transformer inference…

Performance · Computer Science 2023-04-19 Yuan Feng , Hyeran Jeon , Filip Blagojevic , Cyril Guyot , Qing Li , Dong Li

Transformers have excelled in many tasks including vision. However, efficient deployment of transformer models in low-latency or high-throughput applications is hindered by the computation in the attention mechanism which involves expensive…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 John Yang , Le An , Su Inn Park

The Transformer architecture has led to significant gains in machine translation. However, most studies focus on only sentence-level translation without considering the context dependency within documents, leading to the inadequacy of…

Artificial Intelligence · Computer Science 2022-10-21 Yukun Feng , Feng Li , Ziang Song , Boyuan Zheng , Philipp Koehn

Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art…

Computation and Language · Computer Science 2025-09-30 Hongbo Liu , Jia Xu

Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in…

Computation and Language · Computer Science 2024-01-09 Yiming Wang , Jinyu Li

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation…

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear…

Computation and Language · Computer Science 2022-10-20 Zhen Qin , XiaoDong Han , Weixuan Sun , Dongxu Li , Lingpeng Kong , Nick Barnes , Yiran Zhong

Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked…

Computation and Language · Computer Science 2026-04-21 Tobias Grantner , Emanuel Sallinger , Martin Flechl

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input…

Machine Learning · Computer Science 2025-06-23 Hantao Yu , Josh Alman

One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when…

Machine Learning · Computer Science 2024-05-07 Yuzhen Mao , Martin Ester , Ke Li

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large…

Computation and Language · Computer Science 2026-05-19 Xuan Zhang , Fengzhuo Zhang , Cunxiao Du , Chao Du , Tianyu Pang , Wei Gao , Min Lin

The proliferation of Transformer models is often constrained by the significant computational and memory bandwidth demands of deployment. To address this, we present MXFormer, a novel, hybrid, weight-stationary Compute-in-Memory (CIM)…

Hardware Architecture · Computer Science 2026-02-16 George Karfakis , Samyak Chakrabarty , Vinod Kurian Jacob , Siyun Qiao , Subramanian S. Iyer , Sudhakar Pamarti , Puneet Gupta
‹ Prev 1 2 3 10 Next ›