Related papers: WindVE: Collaborative CPU-NPU Vector Embedding

Benchmarking Edge AI Platforms for High-Performance ML Inference

Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often…

Artificial Intelligence · Computer Science 2024-09-24 Rakshith Jayanth , Neelesh Gupta , Viktor Prasanna

Inference Acceleration for Large Language Models on CPUs

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Ditto PS , Jithin VG , Adarsh MS

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Xiangyu Li , Chengyu Yin , Weijun Wang , Jianyu Wei , Ting Cao , Yunxin Liu

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory…

Performance · Computer Science 2024-03-05 Xuanlei Zhao , Bin Jia , Haotian Zhou , Ziming Liu , Shenggan Cheng , Yang You

Design-Technology Co-Optimization for NVM-based Neuromorphic Processing Elements

Neuromorphic hardware platforms can significantly lower the energy overhead of a machine learning inference task. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a Non-…

Neural and Evolutionary Computing · Computer Science 2022-03-11 Shihao Song , Adarsha Balaji , Anup Das , Nagarajan Kandasamy

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Model-enhanced Vector Index

Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall…

Information Retrieval · Computer Science 2023-11-10 Hailin Zhang , Yujing Wang , Qi Chen , Ruiheng Chang , Ting Zhang , Ziming Miao , Yingyan Hou , Yang Ding , Xupeng Miao , Haonan Wang , Bochen Pang , Yuefeng Zhan , Hao Sun , Weiwei Deng , Qi Zhang , Fan Yang , Xing Xie , Mao Yang , Bin Cui

Competitive Online Virtual Cluster Embedding Algorithms

In the conventional cloud service model, computing resources are allocated for tenants on a pay-per-use basis. However, the performance of applications that communicate inside this network is unpredictable because network resources are not…

Networking and Internet Architecture · Computer Science 2018-10-09 Feras Fattohi

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model…

Computation and Language · Computer Science 2025-11-27 Sihyeong Park , Sungryeol Jeon , Chaelyn Lee , Seokhun Jeon , Byung-Soo Kim , Jemin Lee

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…

Machine Learning · Computer Science 2024-11-26 Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty…

Machine Learning · Computer Science 2026-05-18 Ruicheng Ao , Gan Luo , David Simchi-Levi , Xinshang Wang

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware…

Performance · Computer Science 2026-04-21 Mao Lin , Xi Wang , Guilherme Cox , Dong Li , Hyeran Jeon

Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle

Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of…

Machine Learning · Computer Science 2023-03-03 Alex Kogan

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance…

Machine Learning · Computer Science 2026-02-20 Zhuojin Li , Marco Paolieri , Leana Golubchik

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM…

Hardware Architecture · Computer Science 2025-07-02 Zhican Wang , Hongxiang Fan , Haroon Waris , Gang Wang , Zhenyu Li , Jianfei Jiang , Yanan Sun , Guanghui He

VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG

Retrieval-Augmented Generation (RAG) systems combine vector similarity search with large language models (LLMs) to deliver accurate, context-aware responses. However, co-locating the vector retriever and the LLM on shared GPU infrastructure…

Machine Learning · Computer Science 2026-01-21 Junkyum Kim , Divya Mahajan

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a…

Machine Learning · Computer Science 2025-04-17 Kihyun Kim , Jinwoo Kim , Hyunsun Chung , Myung-Hoon Cha , Hong-Yeon Kim , Youngjae Kim

FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems

Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. Heterogeneous hardware, unreliable client devices, and energy constraints often characterize edge computing…

Machine Learning · Computer Science 2024-11-05 Herbert Woisetschläger , Alexander Erben , Ruben Mayer , Shiqiang Wang , Hans-Arno Jacobsen