English
Related papers

Related papers: Accelerating Deep Learning Classification with Err…

200 papers

The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded…

Hardware Architecture · Computer Science 2022-12-09 Etienne Dupuis , Silviu-Ioan Filip , Olivier Sentieys , David Novo , Ian O'Connor , Alberto Bosio

Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they…

Computation and Language · Computer Science 2026-05-06 Jinyu Guo , Zhihan Zhang , Jiehui Xie , Md. Tamim Iqbal , Dongshen Han , Lik-Hang Lee , Sung-Ho Bae , Jie Zou , Yang Yang , Chaoning Zhang

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through…

Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or…

Machine Learning · Computer Science 2022-09-20 Mohammadamin Abedi , Yanni Iouannou , Pooyan Jamshidi , Hadi Hemmati

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…

Machine Learning · Computer Science 2026-02-16 Xutong Liu , Baran Atalar , Xiangxiang Dai , Jinhang Zuo , Siwei Wang , John C. S. Lui , Wei Chen , Carlee Joe-Wong

Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher…

Machine Learning · Computer Science 2020-02-10 Adarsh Kumar , Arjun Balasubramanian , Shivaram Venkataraman , Aditya Akella

Effective caching is crucial for the performance of modern-day computing systems. A key optimization problem arising in caching -- which item to evict to make room for a new item -- cannot be optimally solved without knowing the future.…

Machine Learning · Computer Science 2021-06-29 Jakub Chłędowski , Adam Polak , Bartosz Szabucki , Konrad Zolna

The success of deep neural networks (DNN) in machine perception applications such as image classification and speech recognition comes at the cost of high computation and storage complexity. Inference of uncompressed large scale DNN models…

Machine Learning · Computer Science 2020-07-06 Yihao Fang , Shervin Manzuri Shalmani , Rong Zheng

This letter proposes two novel proactive cooperative caching approaches using deep learning (DL) to predict users' content demand in a mobile edge caching network. In the first approach, a (central) content server takes responsibilities to…

Networking and Internet Architecture · Computer Science 2018-12-14 Yuris Mulya Saputra , Dinh Thai Hoang , Diep N. Nguyen , Eryk Dutkiewicz , Dusit Niyato , Dong In Kim

Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or fine-tune…

Computer Vision and Pattern Recognition · Computer Science 2018-11-27 A. Emin Orhan

This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in…

Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by…

Machine Learning · Computer Science 2022-07-14 Calvin McCarter , Nicholas Dronen

The Internet of Things (IoT) has been continuously rising in the past few years, and its potentials are now more apparent. However, transient data generation and limited energy resources are the major bottlenecks of these networks. Besides,…

Networking and Internet Architecture · Computer Science 2022-03-25 Hongda Wu , Ali Nasehzadeh , Ping Wang

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…

Machine Learning · Computer Science 2025-06-10 Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Linfeng Zhang

Content caching at the edge nodes is a promising technique to reduce the data traffic in next-generation wireless networks. Inspired by the success of Deep Reinforcement Learning (DRL) in solving complicated control problems, this work…

Information Theory · Computer Science 2017-12-22 Chen Zhong , M. Cenk Gursoy , Senem Velipasalar

Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called…

Computer Vision and Pattern Recognition · Computer Science 2018-06-18 Michele Pratusevich

As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching…

Machine Learning · Computer Science 2026-04-23 Baran Atalar , Xutong Liu , Jinhang Zuo , Siwei Wang , Wei Chen , Carlee Joe-Wong

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…

Computation and Language · Computer Science 2025-05-22 Xinyin Ma , Runpeng Yu , Gongfan Fang , Xinchao Wang

Similarity caching allows requests for an item to be served by a similar item. Applications include recommendation systems, multimedia retrieval, and machine learning. Recently, many similarity caching policies have been proposed, like…

Networking and Internet Architecture · Computer Science 2023-09-22 Younes Ben Mazziane , Sara Alouf , Giovanni Neglia , Daniel S. Menasche
‹ Prev 1 2 3 10 Next ›