English
Related papers

Related papers: Dynamic Memory Compression: Retrofitting LLMs for …

200 papers

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than…

Machine Learning · Computer Science 2025-11-10 Adrian Łańcucki , Konrad Staniszewski , Piotr Nawrot , Edoardo M. Ponti

Large Language Models (LLMs) need to adapt to the continuous changes in data, tasks, and user preferences. Due to their massive size and the high costs associated with training, LLMs are not suitable for frequent retraining. However,…

Computation and Language · Computer Science 2024-12-11 Dongfang Li , Zetian Sun , Xinshuo Hu , Baotian Hu , Min Zhang

Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously…

Machine Learning · Computer Science 2026-01-05 Thomas Katraouras , Dimitrios Rafailidis

Large Language Models (LLMs) face limitations due to the high demand on GPU memory and computational resources when handling long contexts. While sparsify the Key-Value (KV) cache of transformer model is a typical strategy to alleviate…

Machine Learning · Computer Science 2024-02-06 Yumeng Wang , Zhenyang Xiao

The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly…

Machine Learning · Computer Science 2024-10-07 Rongzhi Zhang , Kuang Wang , Liyuan Liu , Shuohang Wang , Hao Cheng , Chao Zhang , Yelong Shen

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity…

Hardware Architecture · Computer Science 2025-04-23 Rui Xie , Asad Ul Haq , Linsen Ma , Yunhua Fang , Zirak Burzin Engineer , Liu Liu , Tong Zhang

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can…

Computation and Language · Computer Science 2025-04-16 Jinwu Hu , Wei Zhang , Yufeng Wang , Yu Hu , Bin Xiao , Mingkui Tan , Qing Du

Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Junwan Kim , Hyunkyung Bae

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate…

Computation and Language · Computer Science 2024-06-12 Hao Yu , Zelan Yang , Shen Li , Yong Li , Jianxin Wu

Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-11 Yanyu Liu , Jingying Fu , Sixiang Liu , Yitian Zou , You Fu , Jiehan Zhou , Shouhua Zhang

Generative inference in Large Language Models (LLMs) is impeded by the growing memory demands of Key-Value (KV) cache, especially for longer sequences. Traditional KV cache eviction strategies, which discard less critical KV pairs based on…

Computation and Language · Computer Science 2025-03-14 Zhongwei Wan , Xinjian Wu , Yu Zhang , Yi Xin , Chaofan Tao , Zhihong Zhu , Xin Wang , Siqi Luo , Jing Xiong , Longyue Wang , Mi Zhang

Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory…

Computation and Language · Computer Science 2025-09-09 Guihong Li , Mehdi Rezagholizadeh , Mingyu Yang , Vikram Appia , Emad Barsoum

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Mohsen Hariri , Shaochen Zhong , Vipin Chaudhary , Yang Sui , Xia Hu , Anshumali Shrivastava

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a…

Computation and Language · Computer Science 2025-02-24 Weilan Wang , Yu Mao , Dongdong Tang , Hongchao Du , Nan Guan , Chun Jason Xue

Large Language Models (LLMs) often experience performance degradation during long-running interactions due to increasing context length, memory saturation, and computational overhead. This paper presents an adaptive context compression…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Payal Fofadiya , Sunil Tiwari

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…

Machine Learning · Computer Science 2025-06-10 Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Linfeng Zhang

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory…

Machine Learning · Computer Science 2024-04-09 Yehui Tang , Yunhe Wang , Jianyuan Guo , Zhijun Tu , Kai Han , Hailin Hu , Dacheng Tao

To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a…

Computation and Language · Computer Science 2024-06-11 Chensen Huang , Guibo Zhu , Xuepeng Wang , Yifei Luo , Guojing Ge , Haoran Chen , Dong Yi , Jinqiao Wang
‹ Prev 1 2 3 10 Next ›