English
Related papers

Related papers: Multi-model Machine Learning Inference Serving wit…

200 papers

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…

Machine Learning · Computer Science 2025-12-25 Haidong Zhao , Nikolaos Georgantas

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-01 Yunseong Kim , Yujeong Choi , Minsoo Rhu

Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Kamil Kojs

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Chuhao Xu , Zijun Li , Quan Chen , Han Zhao , Xueyan Tang , Minyi Guo

Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Tingyang Sun , Ting He , Bo Ji , Parimal Parag

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-03 Paras Jain , Xiangxi Mo , Ajay Jain , Harikaran Subbaraj , Rehan Sohail Durrani , Alexey Tumanov , Joseph Gonzalez , Ion Stoica

In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-24 Munkyu Lee , Sihoon Seong , Minki Kang , Jihyuk Lee , Gap-Joo Na , In-Geol Chun , Dimitrios Nikolopoulos , Cheol-Ho Hong

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…

Artificial Intelligence · Computer Science 2025-07-30 Yufei Li , Zexin Li , Yinglun Zhu , Cong Liu

Modern applications increasingly rely on inference serving systems to provide low-latency insights with a diverse set of machine learning models. Existing systems often utilize resource elasticity to scale with demand. However, many…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-13 Joel Wolfrath , Daniel Frink , Abhishek Chandra

Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under…

Machine Learning · Computer Science 2026-05-01 Haidong Zhao , Nikolaos Georgantas

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve…

Systems and Control · Electrical Eng. & Systems 2025-06-17 Ryan Quach , Yidi Wang , Ali Jahanshahi , Daniel Wong , Hyoseung Kim

Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Yichao Yuan , Lin Ma , Nishil Talati

The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Sohaib Ahmad , Hui Guan , Ramesh K. Sitaraman

Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-26 Fuxun Yu , Di Wang , Longfei Shangguan , Minjia Zhang , Chenchen Liu , Xiang Chen

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-13 Michael Lui , Yavuz Yetim , Özgür Özkan , Zhuoran Zhao , Shin-Yeh Tsai , Carole-Jean Wu , Mark Hempstead
‹ Prev 1 2 3 10 Next ›