Related papers: Dynamic Memory Compression: Retrofitting LLMs for …

Inference-Time Hyper-Scaling with KV Cache Compression

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than…

Machine Learning · Computer Science 2025-11-10 Adrian Łańcucki , Konrad Staniszewski , Piotr Nawrot , Edoardo M. Ponti

CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models

Large Language Models (LLMs) need to adapt to the continuous changes in data, tasks, and user preferences. Due to their massive size and the high costs associated with training, LLMs are not suitable for frequent retraining. However,…

Computation and Language · Computer Science 2024-12-11 Dongfang Li , Zetian Sun , Xinshuo Hu , Baotian Hu , Min Zhang

Memory Bank Compression for Continual Adaptation of Large Language Models

Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously…

Machine Learning · Computer Science 2026-01-05 Thomas Katraouras , Dimitrios Rafailidis

LoMA: Lossless Compressed Memory Attention

Large Language Models (LLMs) face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate…

Machine Learning · Computer Science 2024-02-06 Yumeng Wang , Zhenyang Xiao

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity…

Hardware Architecture · Computer Science 2025-04-23 Rui Xie , Asad Ul Haq , Linsen Ma , Yunhua Fang , Zirak Burzin Engineer , Liu Liu , Tong Zhang

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can…

Computation and Language · Computer Science 2025-04-16 Jinwu Hu , Wei Zhang , Yufeng Wang , Yu Hu , Bin Xiao , Mingkui Tan , Qing Du

Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines

Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Junwan Kim , Hyunkyung Bae

Effectively Compress KV Heads for LLM

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate…

Computation and Language · Computer Science 2024-06-12 Hao Yu , Zelan Yang , Shen Li , Yong Li , Jianxin Wu

KV Cache Compression for Inference Efficiency in LLMs: A Review

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Generative inference in Large Language Models (LLMs) is impeded by the growing memory demands of Key-Value (KV) cache, especially for longer sequences. Traditional KV cache eviction strategies, which discard less critical KV pairs based on…

Computation and Language · Computer Science 2025-03-14 Zhongwei Wan , Xinjian Wu , Yu Zhang , Yi Xin , Chaofan Tao , Zhihong Zhu , Xin Wang , Siqi Luo , Jing Xiong , Longyue Wang , Mi Zhang

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory…

Computation and Language · Computer Science 2025-09-09 Guihong Li , Mehdi Rezagholizadeh , Mingyu Yang , Vikram Appia , Emad Barsoum

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Mohsen Hariri , Shaochen Zhong , Vipin Chaudhary , Yang Sui , Xia Hu , Anshumali Shrivastava

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a…

Computation and Language · Computer Science 2025-02-24 Weilan Wang , Yu Mao , Dongdong Tang , Hongchao Du , Nan Guan , Chun Jason Xue

Developing Adaptive Context Compression Techniques for Large Language Models (LLMs) in Long-Running Interactions

Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Payal Fofadiya , Sunil Tiwari

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…

Machine Learning · Computer Science 2025-06-10 Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Linfeng Zhang

A Survey on Transformer Compression

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory…

Machine Learning · Computer Science 2024-04-09 Yehui Tang , Yunhe Wang , Jianyuan Guo , Zhijun Tu , Kai Han , Hailin Hu , Dacheng Tao

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a…

Computation and Language · Computer Science 2024-06-11 Chensen Huang , Guibo Zhu , Xuepeng Wang , Yifei Luo , Guojing Ge , Haoran Chen , Dong Yi , Jinqiao Wang