Related papers: InstCache: A Predictive Cache for LLM Serving

IC-Cache: Efficient Large Language Model Serving via In-context Caching

Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs…

Machine Learning · Computer Science 2025-09-05 Yifan Yu , Yu Gan , Nikhil Sarda , Lillian Tsai , Jiaming Shen , Yanqi Zhou , Arvind Krishnamurthy , Fan Lai , Henry M. Levy , David Culler

Efficient LLM Inference with Kcache

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt…

Computation and Language · Computer Science 2024-04-26 In Gim , Guojun Chen , Seung-seob Lee , Nikhil Sarda , Anurag Khandelwal , Lin Zhong

Online Scheduling for LLM Inference with KV Cache Constraints

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A…

Machine Learning · Computer Science 2026-01-16 Patrick Jaillet , Jiashuo Jiang , Konstantina Mellou , Marco Molinaro , Chara Podimata , Zijie Zhou

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across…

Machine Learning · Computer Science 2025-12-08 Yuhan Liu , Yihua Cheng , Jiayi Yao , Yuwei An , Xiaokun Chen , Shaoting Feng , Yuyang Huang , Samuel Shen , Rui Zhang , Kuntai Du , Junchen Jiang

MeanCache: User-Centric Semantic Caching for LLM Web Services

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion…

Machine Learning · Computer Science 2025-09-15 Waris Gill , Mohamed Elidrisi , Pallavi Kalapatapu , Ammar Ahmed , Ali Anwar , Muhammad Ali Gulzar

AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely…

Computation and Language · Computer Science 2025-11-13 Dinghong Song , Yuan Feng , Yiwei Wang , Shangye Chen , Cyril Guyot , Filip Blagojevic , Hyeran Jeon , Pengfei Su , Dong Li

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching…

Computation and Language · Computer Science 2026-03-03 Harsh Vardhan Bansal

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Jiahao Wang , Jinbo Han , Xingda Wei , Sijie Shen , Dingyan Zhang , Chenguang Fang , Rong Chen , Wenyuan Yu , Haibo Chen

InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks

Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are…

Cryptography and Security · Computer Science 2024-12-02 Xinyao Zheng , Husheng Han , Shangyi Shi , Qiyan Fang , Zidong Du , Xing Hu , Qi Guo

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work…

Hardware Architecture · Computer Science 2025-12-02 Zhongchun Zhou , Chengtao Lai , Wei Zhang

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework

Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in…

Multimedia · Computer Science 2025-03-12 Jianian Zhu , Hang Wu , Haojie Wang , Yinghui Li , Biao Hou , Ruixuan Li , Jidong Zhai

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In…

Computation and Language · Computer Science 2024-06-04 Jiaxing Li , Chi Xu , Feng Wang , Isaac M von Riedemann , Cong Zhang , Jiangchuan Liu

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

A Generative Caching System for Large Language Models

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money…

Databases · Computer Science 2025-03-25 Arun Iyengar , Ashish Kundu , Ramana Kompella , Sai Nandan Mamidi

ToolCaching: Towards Efficient Caching for LLM Tool-calling

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to…

Software Engineering · Computer Science 2026-01-23 Yi Zhai , Dian Shen , Junzhou Luo , Bin Yang

LLM-dCache: Improving Tool-Augmented LLMs with GPT-Driven Localized Data Caching

As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Simranjit Singh , Michael Fore , Andreas Karatzas , Chaehong Lee , Yanan Jian , Longfei Shangguan , Fuxun Yu , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin