Related papers: RAC: Relation-Aware Cache Replacement for Large La…

Cache Mechanism for Agent RAG Systems

Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance,…

Computation and Language · Computer Science 2025-11-06 Shuhang Lin , Zhencan Peng , Lingyao Li , Xiao Lin , Xi Zhu , Yongfeng Zhang

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Adaptive Contextual Caching for Mobile Edge Large Language Model Service

Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external…

Networking and Internet Architecture · Computer Science 2025-01-17 Guangyuan Liu , Yinqiu Liu , Jiacheng Wang , Hongyang Du , Dusit Niyato , Jiawen Kang , Zehui Xiong

RAC: Efficient LLM Factuality Correction with Retrieval Augmentation

Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency…

Computation and Language · Computer Science 2024-10-22 Changmao Li , Jeffrey Flanigan

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-26 Chao Jin , Zili Zhang , Xuanlin Jiang , Fangyue Liu , Xin Liu , Xuanzhe Liu , Xin Jin

SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models

Large Language Models (LLMs) have become increasingly popular, transforming a wide range of applications across various domains. However, the real-world effectiveness of their query cache systems has not been thoroughly investigated. In…

Computation and Language · Computer Science 2024-06-04 Jiaxing Li , Chi Xu , Feng Wang , Isaac M von Riedemann , Cong Zhang , Jiangchuan Liu

IC-Cache: Efficient Large Language Model Serving via In-context Caching

Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 70% of user requests to LLMs…

Machine Learning · Computer Science 2025-09-05 Yifan Yu , Yu Gan , Nikhil Sarda , Lillian Tsai , Jiaming Shen , Yanqi Zhou , Arvind Krishnamurthy , Fan Lai , Henry M. Levy , David Culler

REFRAG: Rethinking RAG based Decoding

Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing…

Computation and Language · Computer Science 2025-10-14 Xiaoqiang Lin , Aritra Ghosh , Bryan Kian Hsiang Low , Anshumali Shrivastava , Vijai Mohan

ToolCaching: Towards Efficient Caching for LLM Tool-calling

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to…

Software Engineering · Computer Science 2026-01-23 Yi Zhai , Dian Shen , Junzhou Luo , Bin Yang

Context Awareness Gate For Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions. Previous research has predominantly focused on improving the…

Machine Learning · Computer Science 2025-01-07 Mohammad Hassan Heydari , Arshia Hemmat , Erfan Naman , Afsaneh Fatemi

ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of…

Computation and Language · Computer Science 2025-07-16 Jianxin Yan , Wangze Ni , Lei Chen , Xuemin Lin , Peng Cheng , Zhan Qin , Kui Ren

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully…

Information Retrieval · Computer Science 2026-04-08 Uday Allu , Sonu Kedia , Tanmay Odapally , Biddwan Ahmed

LRC: Dependency-Aware Cache Management for Data Analytics Clusters

Memory caches are being aggressively used in today's data-parallel systems such as Spark, Tez, and Piccolo. However, prevalent systems employ rather simple cache management policies--notably the Least Recently Used (LRU) policy--that are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-27 Yinghao Yu , Wei Wang , Jun Zhang , Khaled Ben Letaief

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing

Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this…

Computation and Language · Computer Science 2025-03-06 Kuan Li , Liwen Zhang , Yong Jiang , Pengjun Xie , Fei Huang , Shuai Wang , Minhao Cheng

Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient…

Computation and Language · Computer Science 2025-11-18 Yixuan Wang , Shiyu Ji , Yijun Liu , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

Clustered Retrieved Augmented Generation (CRAG)

Providing external knowledge to Large Language Models (LLMs) is a key point for using these models in real-world applications for several reasons, such as incorporating up-to-date content in a real-time manner, providing access to…

Computation and Language · Computer Science 2024-06-04 Simon Akesson , Frances A. Santos

RALLRec: Improving Retrieval Augmented Large Language Model Recommendation with Representation Learning

Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant…

Information Retrieval · Computer Science 2025-02-12 Jian Xu , Sichun Luo , Xiangyu Chen , Haoming Huang , Hanxu Hou , Linqi Song

Rephrase and Contrast: Fine-Tuning Language Models for Enhanced Understanding of Communication and Computer Networks

Large language models (LLMs) are being widely researched across various disciplines, with significant recent efforts focusing on adapting LLMs for understanding of how communication networks operate. However, over-reliance on prompting…

Computation and Language · Computer Science 2024-10-22 Liujianfu Wang , Yuyang Du , Jingqi Lin , Kexin Chen , Soung Chang Liew

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Owing to recent advancements, Large Language Models (LLMs) can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in…

Machine Learning · Computer Science 2024-02-07 Tomoyuki Kagaya , Thong Jing Yuan , Yuxuan Lou , Jayashree Karlekar , Sugiri Pranata , Akira Kinose , Koki Oguri , Felix Wick , Yang You

REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering

Considering the limited internal parametric knowledge, retrieval-augmented generation (RAG) has been widely used to extend the knowledge scope of large language models (LLMs). Despite the extensive efforts on RAG research, in existing…

Computation and Language · Computer Science 2024-11-22 Yuhao Wang , Ruiyang Ren , Junyi Li , Wayne Xin Zhao , Jing Liu , Ji-Rong Wen