English
Related papers

Related papers: IC-Cache: Efficient Large Language Model Serving v…

200 papers

Caching has the potential to be of significant benefit for accessing large language models (LLMs) due to their high latencies which typically range from a small number of seconds to well over a minute. Furthermore, many LLMs charge money…

Databases · Computer Science 2025-03-25 Arun Iyengar , Ashish Kundu , Ramana Kompella , Sai Nandan Mamidi

Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of…

Computation and Language · Computer Science 2025-07-16 Jianxin Yan , Wangze Ni , Lei Chen , Xuemin Lin , Peng Cheng , Zhan Qin , Kui Ren

The revolutionary capabilities of Large Language Models (LLMs) are attracting rapidly growing popularity and leading to soaring user requests to inference serving systems. Caching techniques, which leverage data reuse to reduce computation,…

Computation and Language · Computer Science 2025-07-15 Longwei Zou , Yan Liu , Jiamu Kang , Tingfeng Liu , Jiangang Kong , Yangdong Deng

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In…

Computation and Language · Computer Science 2024-06-04 Jiaxing Li , Chi Xu , Feng Wang , Isaac M von Riedemann , Cong Zhang , Jiangchuan Liu

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains that are not attainable by a single model. In existing designs, LLMs communicate through text, forcing…

Computation and Language · Computer Science 2026-03-04 Tianyu Fu , Zihan Min , Hanling Zhang , Jichao Yan , Guohao Dai , Wanli Ouyang , Yu Wang

Large Language Models (LLMs) like ChatGPT and Llama have revolutionized natural language processing and search engine dynamics. However, these models incur exceptionally high computational costs. For instance, GPT-3 consists of 175 billion…

Machine Learning · Computer Science 2025-09-15 Waris Gill , Mohamed Elidrisi , Pallavi Kalapatapu , Ammar Ahmed , Ali Anwar , Muhammad Ali Gulzar

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by…

Machine Learning · Computer Science 2025-05-28 Junhao Hu , Wenrui Huang , Weidong Wang , Haoyi Wang , Tiancheng Hu , Qin Zhang , Hao Feng , Xusheng Chen , Yizhou Shan , Tao Xie

As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Simranjit Singh , Michael Fore , Andreas Karatzas , Chaehong Lee , Yanan Jian , Longfei Shangguan , Fuxun Yu , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external…

Networking and Internet Architecture · Computer Science 2025-01-17 Guangyuan Liu , Yinqiu Liu , Jiacheng Wang , Hongyang Du , Dusit Niyato , Jiawen Kang , Zehui Xiong

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt…

Computation and Language · Computer Science 2024-04-26 In Gim , Guojun Chen , Seung-seob Lee , Nikhil Sarda , Anurag Khandelwal , Lin Zhong

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult…

Large Language Models (LLMs), such as GPT, have revolutionized artificial intelligence by enabling nuanced understanding and generation of human-like text across a wide range of applications. However, the high computational and financial…

Machine Learning · Computer Science 2024-12-10 Sajal Regmi , Chetan Phakami Pun

Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which…

Computation and Language · Computer Science 2025-09-26 Ukyo Honda , Soichiro Murakami , Peinan Zhang

In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without…

Computation and Language · Computer Science 2024-08-21 Quanyu Long , Jianda Chen , Wenya Wang , Sinno Jialin Pan

Large language models (LLMs) can adapt to new tasks via in-context learning (ICL) without parameter updates, making them powerful learning engines for fast adaptation. While extensive research has examined ICL as a few-shot learner, whether…

Machine Learning · Computer Science 2025-09-30 Liuwang Kang , Fan Wang , Shaoshan Liu , Hung-Chyun Chou , Chuan Lin , Ning Ding

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Youhe Jiang , Fangcheng Fu , Wanru Zhao , Stephan Rabanser , Jintao Zhang , Nicholas D. Lane , Binhang Yuan

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to…

Software Engineering · Computer Science 2026-01-23 Yi Zhai , Dian Shen , Junzhou Luo , Bin Yang
‹ Prev 1 2 3 10 Next ›