Related papers: Linearizing Transformer with Key-Value Memory

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network…

Computation and Language · Computer Science 2022-04-14 Qingyang Wu , Zhenzhong Lan , Kun Qian , Jing Gu , Alborz Geramifard , Zhou Yu

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring…

Machine Learning · Computer Science 2020-12-22 Valerii Likhosherstov , Krzysztof Choromanski , Jared Davis , Xingyou Song , Adrian Weller

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and…

Computation and Language · Computer Science 2026-02-04 Ning Ding , Yehui Tang , Haochen Qin , Zhenli Zhou , Chao Xu , Lin Li , Kai Han , Heng Liao , Yunhe Wang

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily…

Neural and Evolutionary Computing · Computer Science 2024-02-08 Zeyu Liu , Gourav Datta , Anni Li , Peter Anthony Beerel

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Reformer: The Efficient Transformer

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of…

Machine Learning · Computer Science 2020-02-19 Nikita Kitaev , Łukasz Kaiser , Anselm Levskaya

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

Transformer models gain popularity because of their superior inference accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing works on transformer inference…

Performance · Computer Science 2023-04-19 Yuan Feng , Hyeran Jeon , Filip Blagojevic , Cyril Guyot , Qing Li , Dong Li

ReduceFormer: Attention with Tensor Reduction by Summation

Transformers have excelled in many tasks including vision. However, efficient deployment of transformer models in low-latency or high-throughput applications is hindered by the computation in the attention mechanism which involves expensive…

Computer Vision and Pattern Recognition · Computer Science 2024-06-12 John Yang , Le An , Su Inn Park

Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation

The Transformer architecture has led to significant gains in machine translation. However, most studies focus on only sentence-level translation without considering the context dependency within documents, leading to the inadequacy of…

Artificial Intelligence · Computer Science 2022-10-21 Yukun Feng , Feng Li , Ziang Song , Boyuan Zheng , Philipp Koehn

ResFormer: All-Time Reservoir Memory for Long Sequence Classification

Sequence classification is essential in NLP for understanding and categorizing language patterns in tasks like sentiment analysis, intent detection, and topic classification. Transformer-based models, despite achieving state-of-the-art…

Computation and Language · Computer Science 2025-09-30 Hongbo Liu , Jia Xu

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in…

Computation and Language · Computer Science 2024-01-09 Yiming Wang , Jinyu Li

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation…

Sound · Computer Science 2021-01-01 Yangyang Shi , Yongqiang Wang , Chunyang Wu , Ching-Feng Yeh , Julian Chan , Frank Zhang , Duc Le , Mike Seltzer

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

The Devil in Linear Transformer

Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear…

Computation and Language · Computer Science 2022-10-20 Zhen Qin , XiaoDong Han , Weixuan Sun , Dongxu Li , Lingpeng Kong , Nick Barnes , Yiran Zhong

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models

Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked…

Computation and Language · Computer Science 2026-04-21 Tobias Grantner , Emanuel Sallinger , Martin Flechl

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input…

Machine Learning · Computer Science 2025-06-23 Hantao Yu , Josh Alman

IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when…

Machine Learning · Computer Science 2024-05-07 Yuzhen Mao , Martin Ester , Ke Li

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large…

Computation and Language · Computer Science 2026-05-19 Xuan Zhang , Fengzhuo Zhang , Cunxiao Du , Chao Du , Tianyu Pang , Wei Gao , Min Lin

MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator

The proliferation of Transformer models is often constrained by the significant computational and memory bandwidth demands of deployment. To address this, we present MXFormer, a novel, hybrid, weight-stationary Compute-in-Memory (CIM)…

Hardware Architecture · Computer Science 2026-02-16 George Karfakis , Samyak Chakrabarty , Vinod Kurian Jacob , Siyun Qiao , Subramanian S. Iyer , Sudhakar Pamarti , Puneet Gupta