English
Related papers

Related papers: InstCache: A Predictive Cache for LLM Serving

200 papers

Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs…

Machine Learning · Computer Science 2025-09-05 Yifan Yu , Yu Gan , Nikhil Sarda , Lillian Tsai , Jiaming Shen , Yanqi Zhou , Arvind Krishnamurthy , Fan Lai , Henry M. Levy , David Culler

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt…

Computation and Language · Computer Science 2024-04-26 In Gim , Guojun Chen , Seung-seob Lee , Nikhil Sarda , Anurag Khandelwal , Lin Zhong

Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A…

Machine Learning · Computer Science 2026-01-16 Patrick Jaillet , Jiashuo Jiang , Konstantina Mellou , Marco Molinaro , Chara Podimata , Zijie Zhou

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across…

Machine Learning · Computer Science 2025-12-08 Yuhan Liu , Yihua Cheng , Jiayi Yao , Yuwei An , Xiaokun Chen , Shaoting Feng , Yuyang Huang , Samuel Shen , Rui Zhang , Kuntai Du , Junchen Jiang

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion…

Machine Learning · Computer Science 2025-09-15 Waris Gill , Mohamed Elidrisi , Pallavi Kalapatapu , Ammar Ahmed , Ali Anwar , Muhammad Ali Gulzar

Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely…

Computation and Language · Computer Science 2025-11-13 Dinghong Song , Yuan Feng , Yiwei Wang , Shangye Chen , Cyril Guyot , Filip Blagojevic , Hyeran Jeon , Pengfei Su , Dong Li

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching…

Computation and Language · Computer Science 2026-03-03 Harsh Vardhan Bansal

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Jiahao Wang , Jinbo Han , Xingda Wei , Sijie Shen , Dingyan Zhang , Chenguang Fang , Rong Chen , Wenyuan Yu , Haibo Chen

Large language models (LLMs) possess extensive knowledge and question-answering capabilities, having been widely deployed in privacy-sensitive domains like finance and medical consultation. During LLM inferences, cache-sharing methods are…

Cryptography and Security · Computer Science 2024-12-02 Xinyao Zheng , Husheng Han , Shangyi Shi , Qiyan Fang , Zidong Du , Xing Hu , Qi Guo

Large Language Models (LLMs) have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work…

Hardware Architecture · Computer Science 2025-12-02 Zhongchun Zhou , Chengtao Lai , Wei Zhang

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

Multi-modal Large Language Models (MLLMs) serving systems commonly employ KV-cache compression to reduce memory footprint. However, existing compression methods introduce significant processing overhead and queuing delays, particularly in…

Multimedia · Computer Science 2025-03-12 Jianian Zhu , Hang Wu , Haojie Wang , Yinghui Li , Biao Hou , Ruixuan Li , Jidong Zhai

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In…

Computation and Language · Computer Science 2024-06-04 Jiaxing Li , Chi Xu , Feng Wang , Isaac M von Riedemann , Cong Zhang , Jiangchuan Liu

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money…

Databases · Computer Science 2025-03-25 Arun Iyengar , Ashish Kundu , Ramana Kompella , Sai Nandan Mamidi

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to…

Software Engineering · Computer Science 2026-01-23 Yi Zhai , Dian Shen , Junzhou Luo , Bin Yang

As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Simranjit Singh , Michael Fore , Andreas Karatzas , Chaehong Lee , Yanan Jian , Longfei Shangguan , Fuxun Yu , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin
‹ Prev 1 2 3 10 Next ›