English
Related papers

Related papers: Multi-matrix Factorization Attention

200 papers

As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as…

Machine Learning · Computer Science 2026-03-25 Dong Liu , Yanxuan Yu , Ben Lengerich , Ying Nian Wu

The Transformer architecture, underpinned by the Multi-Head Attention (MHA) mechanism, has become the de facto standard for state-of-the-art models in artificial intelligence. However, the quadratic computational complexity of MHA with…

Machine Learning · Computer Science 2025-10-03 Adam Filipek

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the…

Machine Learning · Computer Science 2025-11-04 Keqi Deng , Philip C. Woodland

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…

Machine Learning · Computer Science 2024-05-22 William Brandon , Mayank Mishra , Aniruddha Nrusimha , Rameswar Panda , Jonathan Ragan Kelly

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many…

Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache,…

Computation and Language · Computer Science 2026-03-18 Tomas Figliolia , Nicholas Alonso , Rishi Iyer , Quentin Anthony , Beren Millidge

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA,…

Computation and Language · Computer Science 2025-10-06 Tao Ji , Bin Guo , Yuanbin Wu , Qipeng Guo , Lixing Shen , Zhan Chen , Xipeng Qiu , Qi Zhang , Tao Gui

The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory…

Artificial Intelligence · Computer Science 2025-12-25 Esmail Gumaan

The pursuit of reducing the memory footprint of the self-attention mechanism in multi-headed self attention (MHA) spawned a rich portfolio of methods, e.g., group-query attention (GQA) and multi-head latent attention (MLA). The methods…

Machine Learning · Computer Science 2026-04-01 Timon Klein , Jonas Kusch , Sebastian Sager , Stefan Schnake , Steffen Schotthöfer

Long-context inference in large language models is bottlenecked by Key--Value (KV) cache loading during the decoding stage, where the sequential nature of generation requires repeatedly transferring the KV cache from off-chip High-Bandwidth…

Machine Learning · Computer Science 2026-03-03 Songtao Liu , Hongwu Peng , Zhiwei Zhang , Zhengyu Chen , Yue Guo

Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head…

Computation and Language · Computer Science 2025-10-28 Zhanchao Zhou , Xiaodong Chen , Haoxing Chen , Zhenzhong Lan , Jianguo Li

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate…

Computation and Language · Computer Science 2024-06-12 Hao Yu , Zelan Yang , Shen Li , Yong Li , Jianxin Wu

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key, and value tensors into a compact latent space. This architectural change reduces the KV-cache size and…

Hardware Architecture · Computer Science 2026-04-10 Robin Geens , Marian Verhelst

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is…

Computation and Language · Computer Science 2024-10-22 Zhen Yang , J. N. Han , Kan Wu , Ruobing Xie , An Wang , Xingwu Sun , Zhanhui Kang

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the…

Machine Learning · Computer Science 2025-05-28 Ted Zadouri , Hubert Strauss , Tri Dao

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each…

Machine Learning · Computer Science 2026-04-09 James O'Neill , Robert Clancy , Mariia Matskevichus , Fergal Reid

Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory…

Computation and Language · Computer Science 2025-09-09 Guihong Li , Mehdi Rezagholizadeh , Mingyu Yang , Vikram Appia , Emad Barsoum

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as…

Computation and Language · Computer Science 2024-06-18 Vinay Joshi , Prashant Laddha , Shambhavi Sinha , Om Ji Omer , Sreenivas Subramoney

Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational…

Computation and Language · Computer Science 2025-09-23 Zhengge Cai , Haowen Hou

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA)…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Xiaoran Fan , Zhichao Sun , Tao Ji , Lixing Shen , Tao Gui
‹ Prev 1 2 3 10 Next ›