Related papers: Symphony: Optimized DNN Model Serving using Deferr…

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-27 Aditya Dhakal , Sameer G. Kulkarni , K. K. Ramakrishnan

An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks

Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. Therefore, the demand is growing to make them answer a heavy workload of requests with available computational…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-31 Pierrick Pochelu , Serge G. Petiton , Bruno Conche

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

With the fast development of deep neural networks (DNNs), many real-world applications are adopting multiple models to conduct compound tasks, such as co-running classification, detection, and segmentation models on autonomous vehicles.…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-30 Fuxun Yu , Shawn Bray , Di Wang , Longfei Shangguan , Xulong Tang , Chenchen Liu , Xiang Chen

Efficient Pipeline Planning for Expedited Distributed DNN Training

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

The research interest in specialized hardware accelerators for deep neural networks (DNN) spikes recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-11 Cong Guo , Yangjie Zhou , Jingwen Leng , Yuhao Zhu , Zidong Du , Quan Chen , Chao Li , Bin Yao , Minyi Guo

Adaptive Scheduling for Edge-Assisted DNN Serving

Deep neural networks (DNNs) have been widely used in various video analytic tasks. These tasks demand real-time responses. Due to the limited processing power on mobile devices, a common way to support such real-time analytics is to offload…

Networking and Internet Architecture · Computer Science 2023-05-04 Jian He , Chenxi Yang , Zhaoyuan He , Ghufran Baig , Lili Qiu

Symphony: A Decentralized Multi-Agent Framework for Scalable Collective Intelligence

Most existing Large Language Model (LLM)-based agent frameworks rely on centralized orchestration, incurring high deployment costs, rigid communication topologies, and limited adaptability. To address these challenges, we introduce…

Machine Learning · Computer Science 2025-08-28 Ji Wang , Kashing Chen , Xinyuan Song , Ke Zhang , Lynn Ai , Eric Yang , Bill Shi

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge

As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-08 Jiahe Cao , Xiaomeng Li , Qiang Liu , Tao Han , Ning Zhang , Weisong Shi

Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads

As artificial intelligence (AI) and machine learning (ML) technologies disrupt a wide range of industries, cloud datacenters face ever-increasing demand in inference workloads. However, conventional CPU-based servers cannot handle excessive…

Hardware Architecture · Computer Science 2022-06-08 Jung-Hoon Kim , Sungyeob Yoo , Seungjae Moon , Joo-Young Kim

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-19 Jian Tian , Shuailong Li , Yang Cao , Wenbo Cui , Minghan Zhu , Wenkang Wu , Jianming Zhang , Yanpeng Wang , Zhiwen Xiao , Zhenyu Hou , Dou Shen

Demand Layering for Real-Time DNN Inference with Minimized Memory Usage

When executing a deep neural network (DNN), its model parameters are loaded into GPU memory before execution, incurring a significant GPU memory burden. There are studies that reduce GPU memory usage by exploiting CPU memory as a swap…

Machine Learning · Computer Science 2022-10-11 Mingoo Ji , Saehanseul Yi , Changjin Koo , Sol Ahn , Dongjoo Seo , Nikil Dutt , Jong-Chan Kim

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures is on the increase. Therefore, decreasing latency and increasing throughput of deep neural network (DNN) inference applications that empower those services have…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-29 Seyed Morteza Nabavinejad , Masoumeh Ebrahimi , Sherief Reda

HarmonyBatch: Batching multi-SLO DNN Inference with Heterogeneous Serverless Functions

Deep Neural Network (DNN) inference on serverless functions is gaining prominence due to its potential for substantial budget savings. Existing works on serverless DNN inference solely optimize batching requests from one application with a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-10 Jiabin Chen , Fei Xu , Yikun Gu , Li Chen , Fangming Liu , Zhi Zhou

GPU Cluster Scheduling for Network-Sensitive Deep Learning

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler…

Performance · Computer Science 2025-11-11 Aakash Sharma , Vivek M. Bhasi , Sonali Singh , George Kesidis , Mahmut T. Kandemir , Chita R. Das

Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting

Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-01 Zhixin Zhao , Yitao Hu , Ziqi Gong , Guotao Yang , Wenxin Li , Xiulong Liu , Keqiu Li , Hao Wang

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-03 Paras Jain , Xiangxi Mo , Ajay Jain , Harikaran Subbaraj , Rehan Sohail Durrani , Alexey Tumanov , Joseph Gonzalez , Ion Stoica

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

Recent years have witnessed increasing interest in machine learning inferences on serverless computing for its auto-scaling and cost effective properties. Existing serverless computing, however, lacks effective job scheduling methods to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-26 Xinning Hui , Yuanchao Xu , Zhishan Guo , Xipeng Shen

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing…

Neural and Evolutionary Computing · Computer Science 2020-08-11 Aditya Dhakal , Junguk Cho , Sameer G. Kulkarni , K. K. Ramakrishnan , Puneet Sharma

Scheduling Techniques of AI Models on Modern Heterogeneous Edge GPU -- A Critical Review

In recent years, the development of specialized edge computing devices has significantly increased, driven by the growing demand for AI models. These devices, such as the NVIDIA Jetson series, must efficiently handle increased data…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-03 Ashiyana Abdul Majeed , Mahmoud Meribout

HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

Giant Deep Neural Networks (DNNs), have become indispensable for accurate and robust support of large-scale cloud based AI services. However, serving giant DNNs is prohibitively expensive from an energy consumption viewpoint easily…

Machine Learning · Computer Science 2025-05-20 Leyang Xue , Yao Fu , Luo Mai , Mahesh K. Marina