English
Related papers

Related papers: WindVE: Collaborative CPU-NPU Vector Embedding

200 papers

Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often…

Artificial Intelligence · Computer Science 2024-09-24 Rakshith Jayanth , Neelesh Gupta , Viktor Prasanna

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Ditto PS , Jithin VG , Adarsh MS

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Xiangyu Li , Chengyu Yin , Weijun Wang , Jianyu Wei , Ting Cao , Yunxin Liu

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory…

Performance · Computer Science 2024-03-05 Xuanlei Zhao , Bin Jia , Haotian Zhou , Ziming Liu , Shenggan Cheng , Yang You

Neuromorphic hardware platforms can significantly lower the energy overhead of a machine learning inference task. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a Non-…

Neural and Evolutionary Computing · Computer Science 2022-03-11 Shihao Song , Adarsha Balaji , Anup Das , Nagarajan Kandasamy

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall…

In the conventional cloud service model, computing resources are allocated for tenants on a pay-per-use basis. However, the performance of applications that communicate inside this network is unpredictable because network resources are not…

Networking and Internet Architecture · Computer Science 2018-10-09 Feras Fattohi

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model…

Computation and Language · Computer Science 2025-11-27 Sihyeong Park , Sungryeol Jeon , Chaelyn Lee , Seokhun Jeon , Byung-Soo Kim , Jemin Lee

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…

Machine Learning · Computer Science 2024-11-26 Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica

Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty…

Machine Learning · Computer Science 2026-05-18 Ruicheng Ao , Gan Luo , David Simchi-Levi , Xinshang Wang

As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware…

Performance · Computer Science 2026-04-21 Mao Lin , Xi Wang , Guilherme Cox , Dong Li , Hyeran Jeon

Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of…

Machine Learning · Computer Science 2023-03-03 Alex Kogan

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance…

Machine Learning · Computer Science 2026-02-20 Zhuojin Li , Marco Paolieri , Leana Golubchik

Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM…

Hardware Architecture · Computer Science 2025-07-02 Zhican Wang , Hongxiang Fan , Haroon Waris , Gang Wang , Zhenyu Li , Jianfei Jiang , Yanan Sun , Guanghui He

Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver accurate, context-aware responses. However, co-locating the vector retriever and the LLM on shared GPU infrastructure…

Machine Learning · Computer Science 2026-01-21 Junkyum Kim , Divya Mahajan

LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a…

Machine Learning · Computer Science 2025-04-17 Kihyun Kim , Jinwoo Kim , Hyunsun Chung , Myung-Hoon Cha , Hong-Yeon Kim , Youngjae Kim

Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. Heterogeneous hardware, unreliable client devices, and energy constraints often characterize edge computing…

Machine Learning · Computer Science 2024-11-05 Herbert Woisetschläger , Alexander Erben , Ruben Mayer , Shiqiang Wang , Hans-Arno Jacobsen
‹ Prev 1 2 3 10 Next ›