Related papers: Multi-matrix Factorization Attention

MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as…

Machine Learning · Computer Science 2026-03-25 Dong Liu , Yanxuan Yu , Ben Lengerich , Ying Nian Wu

Sparse Query Attention (SQA): A Computationally Efficient Attention Mechanism with Query Heads Reduction

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with…

Machine Learning · Computer Science 2025-10-03 Adam Filipek

Multi-head Temporal Latent Attention

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the…

Machine Learning · Computer Science 2025-11-04 Keqi Deng , Philip C. Woodland

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…

Machine Learning · Computer Science 2024-05-22 William Brandon , Mayank Mishra , Aniruddha Nrusimha , Rameswar Panda , Jonathan Ragan Kelly

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many…

Machine Learning · Computer Science 2026-03-19 Zhongzhu Zhou , Fengxiang Bie , Ziyan Chen , Zhenyu Zhang , Yibo Yang , Junxiong Wang , Ben Athiwaratkun , Xiaoxia Wu , Shuaiwen Leon Song

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache,…

Computation and Language · Computer Science 2026-03-18 Tomas Figliolia , Nicholas Alonso , Rishi Iyer , Quentin Anthony , Beren Millidge

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA,…

Computation and Language · Computer Science 2025-10-06 Tao Ji , Bin Guo , Yuanbin Wu , Qipeng Guo , Lixing Shen , Zhan Chen , Xipeng Qiu , Qi Zhang , Tao Gui

Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory…

Artificial Intelligence · Computer Science 2025-12-25 Esmail Gumaan

Tucker Attention: A generalization of approximate attention mechanisms

The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods…

Machine Learning · Computer Science 2026-04-01 Timon Klein , Jonas Kusch , Sebastian Sager , Stefan Schnake , Steffen Schotthöfer

Multi-Head Low-Rank Attention

Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth…

Machine Learning · Computer Science 2026-03-03 Songtao Liu , Hongwu Peng , Zhiwei Zhang , Zhengyu Chen , Yue Guo

Knocking-Heads Attention

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head…

Computation and Language · Computer Science 2025-10-28 Zhanchao Zhou , Xiaodong Chen , Haoxing Chen , Zhenzhong Lan , Jianguo Li

Effectively Compress KV Heads for LLM

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate…

Computation and Language · Computer Science 2024-06-12 Hao Yu , Zelan Yang , Shen Li , Yong Li , Jianxin Wu

Hardware-Centric Analysis of DeepSeek's Multi-Head Latent Attention

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key, and value tensors into a compact latent space. This architectural change reduces the KV-cache size and…

Hardware Architecture · Computer Science 2026-04-10 Robin Geens , Marian Verhelst

Lossless KV Cache Compression to 2%

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is…

Computation and Language · Computer Science 2024-10-22 Zhen Yang , J. N. Han , Kan Wu , Ruobing Xie , An Wang , Xingwu Sun , Zhanhui Kang

Hardware-Efficient Attention for Fast Decoding

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the…

Machine Learning · Computer Science 2025-05-28 Ted Zadouri , Hubert Strauss , Tri Dao

Low-Rank Key Value Attention

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each…

Machine Learning · Computer Science 2026-04-09 James O'Neill , Robert Clancy , Mariia Matskevichus , Fergal Reid

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory…

Computation and Language · Computer Science 2025-09-09 Guihong Li , Mehdi Rezagholizadeh , Mingyu Yang , Vikram Appia , Emad Barsoum

QCQA: Quality and Capacity-aware grouped Query Attention

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as…

Computation and Language · Computer Science 2024-06-18 Vinay Joshi , Prashant Laddha , Shambhavi Sinha , Om Ji Omer , Sreenivas Subramoney

EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs

Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational…

Computation and Language · Computer Science 2025-09-23 Zhengge Cai , Haowen Hou

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA)…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Xiaoran Fan , Zhichao Sun , Tao Ji , Lixing Shen , Tao Gui