Related papers: Accelerating Machine Learning Inference with GPUs …

GPU-accelerated machine learning inference as a service for computing in neutrino experiments

Machine learning algorithms are becoming increasingly prevalent and performant in the reconstruction of events in accelerator-based neutrino experiments. These sophisticated algorithms can be computationally expensive. At the same time, the…

Computational Physics · Physics 2021-03-25 Michael Wang , Tingjun Yang , Maria Acosta Flechas , Philip Harris , Benjamin Hawks , Burt Holzman , Kyle Knoepfel , Jeffrey Krupa , Kevin Pedro , Nhan Tran

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing…

Performance · Computer Science 2021-01-13 Yoshiaki Inoue

FPGA-accelerated machine learning inference as a service for particle physics computing

New heterogeneous computing paradigms on dedicated hardware with increased parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting solutions with large potential gains. The growing applications of machine learning…

Data Analysis, Statistics and Probability · Physics 2019-10-17 Javier Duarte , Philip Harris , Scott Hauck , Burt Holzman , Shih-Chieh Hsu , Sergo Jindariani , Suffian Khan , Benjamin Kreis , Brian Lee , Mia Liu , Vladimir Lončar , Jennifer Ngadiuba , Kevin Pedro , Brandon Perez , Maurizio Pierini , Dylan Rankin , Nhan Tran , Matthew Trahms , Aristeidis Tsaris , Colin Versteeg , Ted W. Way , Dustin Werran , Zhenbin Wu

Benchmarking Edge AI Platforms for High-Performance ML Inference

Edge computing's growing prominence, due to its ability to reduce communication latency and enable real-time processing, is promoting the rise of high-performance, heterogeneous System-on-Chip solutions. While current approaches often…

Artificial Intelligence · Computer Science 2024-09-24 Rakshith Jayanth , Neelesh Gupta , Viktor Prasanna

DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference

Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs),…

Hardware Architecture · Computer Science 2026-05-04 Ali Emre Oztas , Mahir Demir , James Garside , Mikel Luj'an

GPU coprocessors as a service for deep learning inference in high energy physics

In the next decade, the demands for computing in large scientific experiments are expected to grow tremendously. During the same time period, CPU performance increases will be limited. At the CERN Large Hadron Collider (LHC), these two…

Computational Physics · Physics 2021-04-26 Jeffrey Krupa , Kelvin Lin , Maria Acosta Flechas , Jack Dinsmore , Javier Duarte , Philip Harris , Scott Hauck , Burt Holzman , Shih-Chieh Hsu , Thomas Klijnsma , Mia Liu , Kevin Pedro , Dylan Rankin , Natchanon Suaysom , Matt Trahms , Nhan Tran

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Intensive computation is entering data centers with multiple workloads of deep learning. To balance the compute efficiency, performance, and total cost of ownership (TCO), the use of a field-programmable gate array (FPGA) with…

Computer Vision and Pattern Recognition · Computer Science 2019-09-19 Xiaoyu Yu , Yuwei Wang , Jie Miao , Ephrem Wu , Heng Zhang , Yu Meng , Bo Zhang , Biao Min , Dewei Chen , Jianlin Gao

Accelerating Multi-Model Inference by Merging DNNs of Different Weights

Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is to solve downstream tasks, forming the transfer learning paradigm. However, when serving multiple instances of…

Machine Learning · Computer Science 2020-09-29 Joo Seong Jeong , Soojeong Kim , Gyeong-In Yu , Yunseong Lee , Byung-Gon Chun

Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs

Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-24 Zinuo Cai , Hao Wang , Tao Song , Yang Hua , Ruhui Ma , Haibing Guan

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Fuxun Yu , Shawn Bray , Di Wang , Longfei Shangguan , Xulong Tang , Chenchen Liu , Xiang Chen

AutoGNN: End-to-End Hardware-Driven Graph Preprocessing for Enhanced GNN Performance

Graph neural network (GNN) inference faces significant bottlenecks in preprocessing, which often dominate overall inference latency. We introduce AutoGNN, an FPGA-based accelerator designed to address these challenges by leveraging FPGA's…

Hardware Architecture · Computer Science 2026-02-03 Seungkwan Kang , Seungjun Lee , Donghyun Gouk , Miryeong Kwon , Hyunkyu Choi , Junhyeok Jang , Sangwon Lee , Huiwon Choi , Jie Zhang , Wonil Choi , Mahmut Taylan Kandemir , Myoungsoo Jung

Inference Acceleration for Large Language Models on CPUs

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Ditto PS , Jithin VG , Adarsh MS

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Accelerating the deep learning inference is very important for real-time applications. In this paper, we propose a novel method to fuse the layers of convolutional neural networks (CNNs) on Graphics Processing Units (GPUs), which applies…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-30 Xueying Wang , Guangli Li , Xiao Dong , Jiansong Li , Lei Liu , Xiaobing Feng

Multi-user Co-inference with Batch Processing Capable Edge Server

Graphics processing units (GPUs) can improve deep neural network inference throughput via batch processing, where multiple tasks are concurrently processed. We focus on novel scenarios that the energy-constrained mobile devices offload…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Wenqi Shi , Sheng Zhou , Zhisheng Niu , Miao Jiang , Lu Geng

FlowGNN: A Dataflow Architecture for Real-Time Workload-Agnostic Graph Neural Network Inference

Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-20 Rishov Sarkar , Stefan Abi-Karam , Yuqi He , Lakshmi Sathidevi , Cong Hao

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-08 Shijian Li , Robert J. Walls , Tian Guo

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-31 Pierrick Pochelu , Serge G. Petiton , Bruno Conche

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing. To ensure inference accuracy in inference task partitioning, we consider the receptive-field when performing…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-12 Nan Li , Alexandros Iosifidis , Qi Zhang

Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators

The demonstrated success of transfer learning has popularized approaches that involve pretraining models from massive data sources and subsequent finetuning towards a specific task. While such approaches have become the norm in fields such…

Machine Learning · Computer Science 2022-11-10 Jenna A. Bilbrey , Kristina M. Herman , Henry Sprueill , Soritis S. Xantheas , Payel Das , Manuel Lopez Roldan , Mike Kraus , Hatem Helal , Sutanay Choudhury

cuConv: A CUDA Implementation of Convolution for CNN Inference

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-28 Marc Jordà , Pedro Valero-Lara , Antonio J. Peña