Related papers: Efficient Attention using a Fixed-Size Memory Repr…

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition

Encoder-decoder models have become an effective approach for sequence learning tasks like machine translation, image captioning and speech recognition, but have yet to show competitive results for handwritten text recognition. To this end,…

Computer Vision and Pattern Recognition · Computer Science 2019-07-16 Johannes Michael , Roger Labahn , Tobias Grüning , Jochen Zöllner

Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

Efficient Attention: Attention with Linear Complexities

Dot-product attention has wide applications in computer vision and natural language processing. However, its memory and computational costs grow quadratically with the input size. Such growth prohibits its application on high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2024-01-22 Zhuoran Shen , Mingyuan Zhang , Haiyu Zhao , Shuai Yi , Hongsheng Li

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead…

Computation and Language · Computer Science 2022-06-03 Hao Peng , Jungo Kasai , Nikolaos Pappas , Dani Yogatama , Zhaofeng Wu , Lingpeng Kong , Roy Schwartz , Noah A. Smith

Temporal Attention Model for Neural Machine Translation

Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention.…

Computation and Language · Computer Science 2016-08-10 Baskaran Sankaran , Haitao Mi , Yaser Al-Onaizan , Abe Ittycheriah

Harnessing Attention Mechanisms: Efficient Sequence Reduction using Attention-based Autoencoders

Many machine learning models use the manipulation of dimensions as a driving force to enable models to identify and learn important features in data. In the case of sequential data this manipulation usually happens on the token dimension…

Machine Learning · Computer Science 2023-10-24 Daniel Biermann , Fabrizio Palumbo , Morten Goodwin , Ole-Christoffer Granmo

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching…

Hardware Architecture · Computer Science 2025-01-15 Rya Sanovar , Srikant Bharadwaj , Renee St. Amant , Victor Rühle , Saravan Rajmohan

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Fatih Ilhan , Gaowen Liu , Ramana Rao Kompella , Selim Furkan Tekin , Tiansheng Huang , Zachary Yahn , Yichang Xu , Ling Liu

Local Monotonic Attention Mechanism for End-to-End Speech and Language Processing

Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target…

Computation and Language · Computer Science 2017-11-06 Andros Tjandra , Sakriani Sakti , Satoshi Nakamura

Power Law Guided Dynamic Sifting for Efficient Attention

Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate…

Machine Learning · Computer Science 2025-06-06 Nirav Koley , Prajwal Singhania , Abhinav Bhatele

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

Long Sequence Modeling with Attention Tensorization: From Sequence to Tensor Learning

As the demand for processing extended textual data grows, the ability to handle long-range dependencies and maintain computational efficiency is more critical than ever. One of the key issues for long-sequence modeling using attention-based…

Computation and Language · Computer Science 2025-05-26 Aosong Feng , Rex Ying , Leandros Tassiulas

Contextually Structured Token Dependency Encoding for Large Language Models

Token representation strategies within large-scale neural architectures often rely on contextually refined embeddings, yet conventional approaches seldom encode structured relationships explicitly within token interactions. Self-attention…

Computation and Language · Computer Science 2025-03-27 James Blades , Frederick Somerfield , William Langley , Susan Everingham , Maurice Witherington

Monotonic segmental attention for automatic speech recognition

We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable…

Computation and Language · Computer Science 2022-10-27 Albert Zeyer , Robin Schmitt , Wei Zhou , Ralf Schlüter , Hermann Ney

Slim attention: cut your context memory in half without loss -- K-cache is all you need for MHA

Slim attention shrinks the context memory size by 2x for transformer models with MHA (multi-head attention), which can speed up inference by up to 2x for large context windows. Slim attention is an exact, mathematically identical…

Machine Learning · Computer Science 2025-06-04 Nils Graef , Andrew Wasielewski

Sparse Sinkhorn Attention

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to…

Machine Learning · Computer Science 2020-02-27 Yi Tay , Dara Bahri , Liu Yang , Donald Metzler , Da-Cheng Juan

Simple linear attention language models balance the recall-throughput tradeoff

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the…

Computation and Language · Computer Science 2025-03-10 Simran Arora , Sabri Eyuboglu , Michael Zhang , Aman Timalsina , Silas Alberti , Dylan Zinsley , James Zou , Atri Rudra , Christopher Ré

Efficient Attentions for Long Document Summarization

The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional…

Computation and Language · Computer Science 2021-04-13 Luyang Huang , Shuyang Cao , Nikolaus Parulian , Heng Ji , Lu Wang

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different…

Computation and Language · Computer Science 2020-10-06 Alessandro Raganato , Yves Scherrer , Jörg Tiedemann