Related papers: Online Vector Quantized Attention

LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory

Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism's computational cost limits its practicality for long sequences. Although there are existing attention variants that improve…

Machine Learning · Computer Science 2024-04-19 Zicheng Liu , Li Wang , Siyuan Li , Zedong Wang , Haitao Lin , Stan Z. Li

AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models

Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV…

Computation and Language · Computer Science 2025-01-28 Zunhai Su , Wang Shen , Linge Li , Zhe Chen , Hanyu Wei , Huangqi Yu , Kehong Yuan

MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference

The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of…

Machine Learning · Computer Science 2025-11-11 Myunghyun Rhee , Sookyung Choi , Euiseok Kim , Joonseop Sim , Youngpyo Joo , Hoshik Kim

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Jiedong Zhuang , Lu Lu , Ming Dai , Rui Hu , Jian Chen , Qiang Liu , Haoji Hu

VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to…

Information Retrieval · Computer Science 2025-08-26 Kaiyuan Li , Yongxiang Tang , Yanhua Cheng , Yong Bai , Yanxiang Zeng , Chao Wang , Xialong Liu , Peng Jiang

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

As the length of input text increases, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce…

Machine Learning · Computer Science 2025-12-24 Tenghui Li , Guoxu Zhou , Xuyang Zhao , Yuning Qiu , Qibin Zhao

Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for…

Machine Learning · Computer Science 2025-10-22 Tao Bu , Qiangang Wang , Bowen Zeng , Hanwen Sun , Yunpeng Huang , Chun Cao , Jingwei Xu

Inference-Friendly Models With MixAttention

The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally…

Computation and Language · Computer Science 2024-09-24 Shashank Rajput , Ying Sheng , Sean Owen , Vitaliy Chiley

Beyond KV Caching: Shared Attention for Efficient LLMs

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant…

Computation and Language · Computer Science 2024-07-19 Bingli Liao , Danilo Vasconcellos Vargas

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Wei Tao , Xiaoyang Qu , Peiqiang Wang , Guokuan Li , Jiguang Wan , Kai Lu , Jianzong Wang

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing…

Machine Learning · Computer Science 2024-07-15 Ben Athiwaratkun , Sujan Kumar Gonugondla , Sanjay Krishna Gouda , Haifeng Qian , Hantian Ding , Qing Sun , Jun Wang , Jiacheng Guo , Liangfu Chen , Parminder Bhatia , Ramesh Nallapati , Sudipta Sengupta , Bing Xiang

Streaming Attention Approximation via Discrepancy Theory

Large language models (LLMs) have achieved impressive success, but their high memory requirements present challenges for long-context token generation. In this paper we study the streaming complexity of attention approximation, a key…

Machine Learning · Computer Science 2026-03-25 Ekaterina Kochetkova , Kshiteej Sheth , Insu Han , Amir Zandieh , Michael Kapralov

Kwai Summary Attention Technical Report

Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the…

Computation and Language · Computer Science 2026-04-28 Chenglong Chu , Guorui Zhou , Guowang Zhang , Han Li , Hao Peng , Hongtao Cheng , Jian Liang , Jiangxia Cao , Kun Gai , Lingzhi Zhou , Lu Ren , Qi Zhang , Ruiming Tang , Ruitao Wang , Xinchen Luo , Yi Su , Zhiyuan Liang , Ziqi Wang , Boyang Ding , Chengru Song , Dunju Zang , Hui Wang , Jiao Ou , Jiaxin Deng , Jijun Shi , Jinghao Zhang , Junmin Chen , Lejian Ren , Minxuan Lv , Qianqian Wang , Qigen Hu , Shiyao Wang , Siyang Mao , Tao Wang , Xingmei Wang , Zhixin Ling , Ziming Li , Zixing Zhang

Self-Attentional Acoustic Models

Self-attention is a method of encoding sequences of vectors by relating these vectors to each-other based on pairwise similarities. These models have recently shown promising results for modeling discrete sequences, but they are non-trivial…

Computation and Language · Computer Science 2018-06-19 Matthias Sperber , Jan Niehues , Graham Neubig , Sebastian Stüker , Alex Waibel

SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been…

Computation and Language · Computer Science 2025-03-12 Yucheng Li , Huiqiang Jiang , Qianhui Wu , Xufang Luo , Surin Ahn , Chengruidong Zhang , Amir H. Abdi , Dongsheng Li , Jianfeng Gao , Yuqing Yang , Lili Qiu

A Hybrid Transformer Architecture with a Quantized Self-Attention Mechanism Applied to Molecular Generation

The success of the self-attention mechanism in classical machine learning models has inspired the development of quantum analogs aimed at reducing computational overhead. Self-attention integrates learnable query and key matrices to…

Quantum Physics · Physics 2025-08-05 Anthony M. Smaldone , Yu Shee , Gregory W. Kyro , Marwa H. Farag , Zohim Chandani , Elica Kyoseva , Victor S. Batista

QAQ: Quality Adaptive Quantization for LLM KV Cache

The emergence of LLMs has ignited a fresh surge of breakthroughs in NLP applications, particularly in domains such as question-answering systems and text generation. As the need for longer context grows, a significant bottleneck in model…

Computation and Language · Computer Science 2024-04-15 Shichen Dong , Wen Cheng , Jiayu Qin , Wei Wang

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Self-attention vector output similarities reveal how machines pay attention

The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of…

Computation and Language · Computer Science 2026-02-04 Tal Halevi , Yarden Tzach , Ronit D. Gross , Shalom Rosner , Ido Kanter

Context-Aware Self-Attention Networks

Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the…

Computation and Language · Computer Science 2019-02-18 Baosong Yang , Jian Li , Derek Wong , Lidia S. Chao , Xing Wang , Zhaopeng Tu