Multi-matrix Factorization Attention

Jingcheng Hu; Houyi Li; Yinmin Zhang; Zili Wang; Shuigeng Zhou; Xiangyu Zhang; Heung-Yeung Shum; Daxin Jiang

Multi-matrix Factorization Attention

Machine Learning 2025-01-15 v2 Computation and Language

Authors: Jingcheng Hu , Houyi Li , Yinmin Zhang , Zili Wang , Shuigeng Zhou , Xiangyu Zhang , Heung-Yeung Shum , Daxin Jiang

View on arXiv ↗ PDF ↗

Abstract

We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.

Keywords

key-value store sequence alignment nonnegative matrix factorization

Cite

@article{arxiv.2412.19255,
  title  = {Multi-matrix Factorization Attention},
  author = {Jingcheng Hu and Houyi Li and Yinmin Zhang and Zili Wang and Shuigeng Zhou and Xiangyu Zhang and Heung-Yeung Shum and Daxin Jiang},
  journal= {arXiv preprint arXiv:2412.19255},
  year   = {2025}
}

Multi-matrix Factorization Attention

Abstract

Keywords

Cite

Related papers