Related papers: Accelerating Deep Learning Classification with Err…
The design and implementation of Deep Learning (DL) models is currently receiving a lot of attention from both industrials and academics. However, the computational workload associated with DL is often out of reach for low-power embedded…
Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…
The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they…
In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as \textsc{LRU} can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through…
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or…
Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved…
Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher…
Effective caching is crucial for the performance of modern-day computing systems. A key optimization problem arising in caching -- which item to evict to make room for a new item -- cannot be optimally solved without knowing the future.…
The success of deep neural networks (DNN) in machine perception applications such as image classification and speech recognition comes at the cost of high computation and storage complexity. Inference of uncompressed large scale DNN models…
This letter proposes two novel proactive cooperative caching approaches using deep learning (DL) to predict users' content demand in a mobile edge caching network. In the first approach, a (central) content server takes responsibilities to…
Training large-scale image recognition models is computationally expensive. This raises the question of whether there might be simple ways to improve the test performance of an already trained model without having to re-train or fine-tune…
This report investigates enhancing semantic caching effectiveness by employing specialized, fine-tuned embedding models. Semantic caching relies on embedding similarity rather than exact key matching, presenting unique challenges in…
Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by…
The Internet of Things (IoT) has been continuously rising in the past few years, and its potentials are now more apparent. However, transient data generation and limited energy resources are the major bottlenecks of these networks. Besides,…
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…
Content caching at the edge nodes is a promising technique to reduce the data traffic in next-generation wireless networks. Inspired by the success of Deep Reinforcement Learning (DRL) in solving complicated control problems, this work…
Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called…
As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching…
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…
Similarity caching allows requests for an item to be served by a similar item. Applications include recommendation systems, multimedia retrieval, and machine learning. Recently, many similarity caching policies have been proposed, like…