Related papers: Accelerating Deep Learning Classification with Err…

Approximations in Deep Learning

The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded…

Hardware Architecture · Computer Science 2022-12-09 Etienne Dupuis , Silviu-Ioan Filip , Olivier Sentieys , David Novo , Ian O'Connor , Alberto Bosio

Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…

Databases · Computer Science 2025-10-28 Shai Bergman , Anne-Marie Kermarrec , Diana Petrescu , Rafael Pires , Mathis Randl , Martijn de Vos , Ji Zhang

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they…

Computation and Language · Computer Science 2026-05-06 Jinyu Guo , Zhihan Zhang , Jiehui Xie , Md. Tamim Iqbal , Dongshen Han , Lik-Hang Lee , Sung-Ho Bae , Jie Zou , Yang Yang , Chaoning Zhang

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through…

Machine Learning · Computer Science 2026-04-27 Peng Chen , Jiaji Zhang , Hailiang Zhao , Yirong Zhang , Shenyao Chen , Jiahong Yu , Xueyan Tang , Yixuan Wang , Hao Li , Jianping Zou , Gang Xiong , Kingsum Chow , Shuibing He , Shuiguang Deng

Improving the Performance of DNN-based Software Services using Automated Layer Caching

Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or…

Machine Learning · Computer Science 2022-09-20 Mohammadamin Abedi , Yanni Iouannou , Pooyan Jamshidi , Hadi Hemmati

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Accelerating Deep Learning Inference via Freezing

Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher…

Machine Learning · Computer Science 2020-02-10 Adarsh Kumar , Arjun Balasubramanian , Shivaram Venkataraman , Aditya Akella

Robust Learning-Augmented Caching: An Experimental Study

Effective caching is crucial for the performance of modern-day computing systems. A key optimization problem arising in caching -- which item to evict to make room for a new item -- cannot be optimally solved without knowing the future.…

Machine Learning · Computer Science 2021-06-29 Jakub Chłędowski , Adam Polak , Bartosz Szabucki , Konrad Zolna

CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge

The success of deep neural networks (DNN) in machine perception applications such as image classification and speech recognition comes at the cost of high computation and storage complexity. Inference of uncompressed large scale DNN models…

Machine Learning · Computer Science 2020-07-06 Yihao Fang , Shervin Manzuri Shalmani , Rong Zheng

Distributed Deep Learning at the Edge: A Novel Proactive and Cooperative Caching Framework for Mobile Edge Networks

This letter proposes two novel proactive cooperative caching approaches using deep learning (DL) to predict users' content demand in a mobile edge caching network. In the first approach, a (central) content server takes responsibilities to…

Networking and Internet Architecture · Computer Science 2018-12-14 Yuris Mulya Saputra , Dinh Thai Hoang , Diep N. Nguyen , Eryk Dutkiewicz , Dusit Niyato , Dong In Kim

A Simple Cache Model for Image Recognition

Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or fine-tune…

Computer Vision and Pattern Recognition · Computer Science 2018-11-27 A. Emin Orhan

Advancing Semantic Caching for LLMs with Domain-Specific Embeddings and Synthetic Data

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in…

Machine Learning · Computer Science 2025-04-04 Waris Gill , Justin Cechmanek , Tyler Hutcherson , Srijith Rajamohan , Jen Agarwal , Muhammad Ali Gulzar , Manvinder Singh , Benoit Dion

Look-ups are not (yet) all you need for deep learning inference

Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by…

Machine Learning · Computer Science 2022-07-14 Calvin McCarter , Nicholas Dronen

A Deep Reinforcement Learning-Based Caching Strategy for IoT Networks with Transient Data

The Internet of Things (IoT) has been continuously rising in the past few years, and its potentials are now more apparent. However, transient data generation and limited energy resources are the major bottlenecks of these networks. Besides,…

Networking and Internet Architecture · Computer Science 2022-03-25 Hongda Wu , Ali Nasehzadeh , Ping Wang

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…

Machine Learning · Computer Science 2025-06-10 Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Linfeng Zhang

A Deep Reinforcement Learning-Based Framework for Content Caching

Content caching at the edge nodes is a promising technique to reduce the data traffic in next-generation wireless networks. Inspired by the success of Deep Reinforcement Learning (DRL) in solving complicated control problems, this work…

Information Theory · Computer Science 2017-12-22 Chen Zhong , M. Cenk Gursoy , Senem Velipasalar

Deep Learning Approximation: Zero-Shot Neural Network Speedup

Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called…

Computer Vision and Pattern Recognition · Computer Science 2018-06-18 Michele Pratusevich

Continuous Semantic Caching for Low-Cost LLM Serving

As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching…

Machine Learning · Computer Science 2026-04-23 Baran Atalar , Xutong Liu , Jinhang Zuo , Siwei Wang , Wei Chen , Carlee Joe-Wong

dKV-Cache: The Cache for Diffusion Language Models

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…

Computation and Language · Computer Science 2025-05-22 Xinyin Ma , Runpeng Yu , Gongfan Fang , Xinchao Wang

Performance Model for Similarity Caching

Similarity caching allows requests for an item to be served by a similar item. Applications include recommendation systems, multimedia retrieval, and machine learning. Recently, many similarity caching policies have been proposed, like…

Networking and Internet Architecture · Computer Science 2023-09-22 Younes Ben Mazziane , Sara Alouf , Giovanni Neglia , Daniel S. Menasche