English

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Machine Learning 2024-10-16 v3

Abstract

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Keywords

Cite

@article{arxiv.2406.09297,
  title  = {MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding},
  author = {Zayd Muhammad Kawakibi Zuhri and Muhammad Farid Adilazuarda and Ayu Purwarianti and Alham Fikri Aji},
  journal= {arXiv preprint arXiv:2406.09297},
  year   = {2024}
}
R2 v1 2026-06-28T17:04:50.470Z