English
Related papers

Related papers: Online Vector Quantized Attention

200 papers

Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism's computational cost limits its practicality for long sequences. Although there are existing attention variants that improve…

Machine Learning · Computer Science 2024-04-19 Zicheng Liu , Li Wang , Siyuan Li , Zedong Wang , Haitao Lin , Stan Z. Li

Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV…

Computation and Language · Computer Science 2025-01-28 Zunhai Su , Wang Shen , Linge Li , Zhe Chen , Hanyu Wei , Huangqi Yu , Kehong Yuan

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of…

Machine Learning · Computer Science 2025-11-11 Myunghyun Rhee , Sookyung Choi , Euiseok Kim , Joonseop Sim , Youngpyo Joo , Hoshik Kim

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Jiedong Zhuang , Lu Lu , Ming Dai , Rui Hu , Jian Chen , Qiang Liu , Haoji Hu

In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to…

Information Retrieval · Computer Science 2025-08-26 Kaiyuan Li , Yongxiang Tang , Yanhua Cheng , Yong Bai , Yanxiang Zeng , Chao Wang , Xialong Liu , Peng Jiang

As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce…

Machine Learning · Computer Science 2025-12-24 Tenghui Li , Guoxu Zhou , Xuyang Zhao , Yuning Qiu , Qibin Zhao

Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for…

Machine Learning · Computer Science 2025-10-22 Tao Bu , Qiangang Wang , Bowen Zeng , Hanwen Sun , Yunpeng Huang , Chun Cao , Jingwei Xu

The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally…

Computation and Language · Computer Science 2024-09-24 Shashank Rajput , Ying Sheng , Sean Owen , Vitaliy Chiley

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant…

Computation and Language · Computer Science 2024-07-19 Bingli Liao , Danilo Vasconcellos Vargas

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Wei Tao , Xiaoyang Qu , Peiqiang Wang , Guokuan Li , Jiguang Wan , Kai Lu , Jianzong Wang

This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing…

Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. In this paper we study the streaming complexity of attention approximation, a key…

Machine Learning · Computer Science 2026-03-25 Ekaterina Kochetkova , Kshiteej Sheth , Insu Han , Amir Zandieh , Michael Kapralov

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the…

Self-attention is a method of encoding sequences of vectors by relating these vectors to each-other based on pairwise similarities. These models have recently shown promising results for modeling discrete sequences, but they are non-trivial…

Computation and Language · Computer Science 2018-06-19 Matthias Sperber , Jan Niehues , Graham Neubig , Sebastian Stüker , Alex Waibel

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been…

Computation and Language · Computer Science 2025-03-12 Yucheng Li , Huiqiang Jiang , Qianhui Wu , Xufang Luo , Surin Ahn , Chengruidong Zhang , Amir H. Abdi , Dongsheng Li , Jianfeng Gao , Yuqing Yang , Lili Qiu

The success of the self-attention mechanism in classical machine learning models has inspired the development of quantum analogs aimed at reducing computational overhead. Self-attention integrates learnable query and key matrices to…

The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model…

Computation and Language · Computer Science 2024-04-15 Shichen Dong , Wen Cheng , Jiayu Qin , Wei Wang

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of…

Computation and Language · Computer Science 2026-02-04 Tal Halevi , Yarden Tzach , Ronit D. Gross , Shalom Rosner , Ido Kanter

Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the…

Computation and Language · Computer Science 2019-02-18 Baosong Yang , Jian Li , Derek Wong , Lidia S. Chao , Xing Wang , Zhaopeng Tu
‹ Prev 1 2 3 10 Next ›