Related papers: Accelerating Mobile Inference through Fine-Grained…

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems

Large Language Models (LLMs) have achieved impressive results across various tasks, yet their high computational demands pose deployment challenges, especially on consumer-grade hardware. Mixture of Experts (MoE) models provide an efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 En-Ming Huang , Li-Shang Lin , Chun-Yi Lee

Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and Pitfalls

There is a growing demand to deploy computation-intensive deep learning (DL) models on resource-constrained mobile devices for real-time intelligent applications. Equipped with a variety of processing units such as CPUs, GPUs, and NPUs, the…

Machine Learning · Computer Science 2024-05-06 Sicong Liu , Wentao Zhou , Zimu Zhou , Bin Guo , Minfan Wang , Cheng Fang , Zheng Lin , Zhiwen Yu

On-Device Neural Net Inference with Mobile GPUs

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing…

Machine Learning · Computer Science 2019-10-16 Juhyun Lee , Nikolay Chirkov , Ekaterina Ignasheva , Yury Pisarchyk , Mogan Shieh , Fabio Riccardi , Raman Sarokin , Andrei Kulik , Matthias Grundmann

Towards Collaborative Intelligence Friendly Architectures for Deep Learning

Modern mobile devices are equipped with high-performance hardware resources such as graphics processing units (GPUs), making the end-side intelligent services more feasible. Even recently, specialized silicons as neural engines are being…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-04 Amir Erfan Eshratifar , Amirhossein Esmaili , Massoud Pedram

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

Discrete optimization is a central problem in artificial intelligence. The optimization of the aggregated cost of a network of cost functions arises in a variety of problems including (W)CSP, DCOP, as well as optimization in stochastic…

Artificial Intelligence · Computer Science 2018-01-12 Ferdinando Fioretto , Enrico Pontelli , William Yeoh , Rina Dechter

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-24 Zinuo Cai , Hao Wang , Tao Song , Yang Hua , Ruhui Ma , Haibing Guan

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained…

Machine Learning · Computer Science 2022-02-22 Anish Das , Young D. Kwon , Jagmohan Chauhan , Cecilia Mascolo

Neural Network Inference on Mobile SoCs

The ever-increasing demand from mobile Machine Learning (ML) applications calls for evermore powerful on-chip computing resources. Mobile devices are empowered with heterogeneous multi-processor Systems-on-Chips (SoCs) to process ML…

Machine Learning · Computer Science 2021-02-03 Siqi Wang , Anuj Pathania , Tulika Mitra

Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference

The common assumption in on-device AI is that GPUs, with their superior parallel processing, always provide the best performance for large language model (LLM) inference. In this work, we challenge this notion by empirically demonstrating…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-13 Haolin Zhang , Jeff Huang

Computer Vision Accelerators for Mobile Systems based on OpenCL GPGPU Co-Processing

In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm -- image inpainting-based object removal algorithm -- on mobile devices. To take advantage of the computation power of the mobile…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-03-19 Guohui Wang , Yingen Xiong , Jay Yun , Joseph R. Cavallaro

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs

Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-16 Jiakun Fan , Yanglin Zhang , Xiangchen Li , Dimitrios S. Nikolopoulos

Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference

With the rapid advancement of artificial intelligence technologies such as ChatGPT, AI agents, and video generation, contemporary mobile systems have begun integrating these AI capabilities on local devices to enhance privacy and reduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-07 Le Chen , Dahu Feng , Erhu Feng , Yingrui Wang , Rong Zhao , Yubin Xia , Pinjie Xu , Haibo Chen

Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing

Deploying deep neural networks (DNNs) on resource-constrained mobile devices presents significant challenges, particularly in achieving real-time performance while simultaneously coping with limited computational resources and battery life.…

Networking and Internet Architecture · Computer Science 2025-09-24 Zekai Sun , Xiuxian Guan , Zheng Lin , Zihan Fang , Xiangming Cai , Zhe Chen , Fangming Liu , Heming Cui , Jie Xiong , Wei Ni , Chau Yuen

Communication-Computation Efficient Device-Edge Co-Inference via AutoML

Device-edge co-inference, which partitions a deep neural network between a resource-constrained mobile device and an edge server, recently emerges as a promising paradigm to support intelligent mobile applications. To accelerate the…

Machine Learning · Computer Science 2021-09-01 Xinjie Zhang , Jiawei Shao , Yuyi Mao , Jun Zhang

Cloud-based or On-device: An Empirical Study of Mobile Deep Inference

Modern mobile applications are benefiting significantly from the advancement in deep learning, e.g., implementing real-time image recognition and conversational system. Given a trained deep learning model, applications usually need to…

Performance · Computer Science 2019-03-01 Tian Guo

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

MDInference: Balancing Inference Accuracy and Latency for Mobile Applications

Deep Neural Networks are allowing mobile devices to incorporate a wide range of features into user applications. However, the computational complexity of these models makes it difficult to run them effectively on resource-constrained mobile…

Performance · Computer Science 2020-04-02 Samuel S. Ogden , Tian Guo

Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments

Current computational systems are heterogeneous by nature, featuring a combination of CPUs and GPUs. As the latter are becoming an established platform for high-performance computing, the focus is shifting towards the seamless programming…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-23 Fábio Soldado , Fernando Alexandre , Hervé Paulino

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens