Related papers: IC-Cache: Efficient Large Language Model Serving v…

A Generative Caching System for Large Language Models

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money…

Databases · Computer Science 2025-03-25 Arun Iyengar , Ashish Kundu , Ramana Kompella , Sai Nandan Mamidi

ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of…

Computation and Language · Computer Science 2025-07-16 Jianxin Yan , Wangze Ni , Lei Chen , Xuemin Lin , Peng Cheng , Zhan Qin , Kui Ren

InstCache: A Predictive Cache for LLM Serving

The revolutionary capabilities of Large Language Models (LLMs) are attracting rapidly growing popularity and leading to soaring user requests to inference serving systems. Caching techniques, which leverage data reuse to reduce computation,…

Computation and Language · Computer Science 2025-07-15 Longwei Zou , Yan Liu , Jiamu Kang , Tingfeng Liu , Jiangang Kong , Yangdong Deng

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In…

Computation and Language · Computer Science 2024-06-04 Jiaxing Li , Chi Xu , Feng Wang , Isaac M von Riedemann , Cong Zhang , Jiangchuan Liu

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains that are not attainable by a single model. In existing designs, LLMs communicate through text, forcing…

Computation and Language · Computer Science 2026-03-04 Tianyu Fu , Zihan Min , Hanling Zhang , Jichao Yan , Guohao Dai , Wanli Ouyang , Yu Wang

MeanCache: User-Centric Semantic Caching for LLM Web Services

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion…

Machine Learning · Computer Science 2025-09-15 Waris Gill , Mohamed Elidrisi , Pallavi Kalapatapu , Ammar Ahmed , Ali Anwar , Muhammad Ali Gulzar

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by…

Machine Learning · Computer Science 2025-05-28 Junhao Hu , Wenrui Huang , Weidong Wang , Haoyi Wang , Tiancheng Hu , Qin Zhang , Hao Feng , Xusheng Chen , Yizhou Shan , Tao Xie

LLM-dCache: Improving Tool-Augmented LLMs with GPT-Driven Localized Data Caching

As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Simranjit Singh , Michael Fore , Andreas Karatzas , Chaehong Lee , Yanan Jian , Longfei Shangguan , Fuxun Yu , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Adaptive Contextual Caching for Mobile Edge Large Language Model Service

Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external…

Networking and Internet Architecture · Computer Science 2025-01-17 Guangyuan Liu , Yinqiu Liu , Jiacheng Wang , Hongyang Du , Dusit Niyato , Jiawen Kang , Zehui Xiong

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt…

Computation and Language · Computer Science 2024-04-26 In Gim , Guojun Chen , Seung-seob Lee , Nikhil Sarda , Anurag Khandelwal , Lin Zhong

Efficient LLM Inference with Kcache

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult…

Machine Learning · Computer Science 2025-09-11 Muhammad Taha Cheema , Abeer Aamir , Khawaja Gul Muhammad , Naveed Anwar Bhatti , Ihsan Ayyub Qazi , Zafar Ayyub Qazi

GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching

Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial…

Machine Learning · Computer Science 2024-12-10 Sajal Regmi , Chetan Phakami Pun

Distilling Many-Shot In-Context Learning into a Cheat Sheet

Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which…

Computation and Language · Computer Science 2025-09-26 Ukyo Honda , Soichiro Murakami , Peinan Zhang

Large Language Models Know What Makes Exemplary Contexts

In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without…

Computation and Language · Computer Science 2024-08-21 Quanyu Long , Jianda Chen , Wenya Wang , Sinno Jialin Pan

In-Context Learning can Perform Continual Learning Like Humans

Large language models (LLMs) can adapt to new tasks via in-context learning (ICL) without parameter updates, making them powerful learning engines for fast adaptation. While extensive research has examined ICL as a few-shot learner, whether…

Machine Learning · Computer Science 2025-09-30 Liuwang Kang , Fan Wang , Shaoshan Liu , Hung-Chyun Chou , Chuan Lin , Ning Ding

Cascadia: An Efficient Cascade Serving System for Large Language Models

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Youhe Jiang , Fangcheng Fu , Wanru Zhao , Stephan Rabanser , Jintao Zhang , Nicholas D. Lane , Binhang Yuan

ToolCaching: Towards Efficient Caching for LLM Tool-calling

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to…

Software Engineering · Computer Science 2026-01-23 Yi Zhai , Dian Shen , Junzhou Luo , Bin Yang