Related papers: Optimizing Prediction Serving on Low-Latency Serve…

DataFlower: Exploiting the Data-flow Paradigm for Serverless Workflow Orchestration

Serverless computing that runs functions with auto-scaling is a popular task execution pattern in the cloud-native era. By connecting serverless functions into workflows, tenants can achieve complex functionality. Prior researches adopt the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-01 Zijun Li , Chuhao Xu , Quan Chen , Jieru Zhao , Chen Chen , Minyi Guo

ServeFlow: A Fast-Slow Model Architecture for Network Traffic Analysis

Network traffic analysis increasingly uses complex machine learning models as the internet consolidates and traffic gets more encrypted. However, over high-bandwidth networks, flows can easily arrive faster than model inference rates. The…

Networking and Internet Architecture · Computer Science 2024-10-25 Shinan Liu , Ted Shaowang , Gerry Wan , Jeewon Chae , Jonatas Marques , Sanjay Krishnan , Nick Feamster

Prediction-driven resource provisioning for serverless container runtimes

In recent years Serverless Computing has emerged as a compelling cloud based model for the development of a wide range of data-intensive applications. However, rapid container provisioning introduces non-trivial challenges for FaaS cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-28 Dimitrios Tomaras , Michail Tsenos , Vana Kalogeraki

Hierarchical Prediction-based Management for LMaaS Systems

Large Language Models (LLMs) have revolutionized numerous domains, driving the rise of Language-Model-as-a-Service (LMaaS) platforms that process millions of queries daily. These platforms must minimize latency and meet Service Level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Zhihan Jiang , Yujie Huang , Guangba Yu , Junjie Huang , Jiazhen Gu , Michael R. Lyu

Clipper: A Low-Latency Online Prediction Serving System

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-01 Daniel Crankshaw , Xin Wang , Giulio Zhou , Michael J. Franklin , Joseph E. Gonzalez , Ion Stoica

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high…

Machine Learning · Computer Science 2018-10-16 Yunseong Lee , Alberto Scolari , Byung-Gon Chun , Marco Domenico Santambrogio , Markus Weimer , Matteo Interlandi

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Jashwant Raj Gunasekaran , Prashanth Thinakaran , Cyan Subhra Mishra , Mahmut Taylan Kandemir , Chita R. Das

AI-based Resource Allocation: Reinforcement Learning for Adaptive Auto-scaling in Serverless Environments

Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-01 Lucia Schuler , Somaya Jamil , Niklas Kühl

Predicting Intermediate Storage Performance for Workflow Applications

Configuring a storage system to better serve an application is a challenging task complicated by a multidimensional, discrete configuration space and the high cost of space exploration (e.g., by running the application with different…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-06-11 Lauro Beltrão Costa , Abmar Barros , Samer Al-Kiswany , Hao Yang , Emalayan Vairavanathan , Matei Ripeanu

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…

Machine Learning · Computer Science 2024-07-26 Yao Fu , Leyang Xue , Yeqi Huang , Andrei-Octavian Brabete , Dmitrii Ustiugov , Yuvraj Patel , Luo Mai

Taming Cold Starts: Proactive Serverless Scheduling with Model Predictive Control

Serverless computing has transformed cloud application deployment by introducing a fine-grained, event-driven execution model that abstracts away infrastructure management. Its on-demand nature makes it especially appealing for…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-14 Chanh Nguyen , Monowar Bhuyan , Erik Elmroth

Adaptive Serverless Resource Management via Slot-Survival Prediction and Event-Driven Lifecycle Control

Serverless computing eliminates infrastructure management overhead but introduces significant challenges regarding cold start latency and resource utilization. Traditional static resource allocation often leads to inefficiencies under…

Artificial Intelligence · Computer Science 2026-04-08 Zeyu Wang , Cuiqianhe Du , Renyue Zhang , Kejian Tong , Qi He , Qiyuan Tian

CloudProphet: A Machine Learning-Based Performance Prediction for Public Clouds

Computing servers have played a key role in developing and processing emerging compute-intensive applications in recent years. Consolidating multiple virtual machines (VMs) inside one server to run various applications introduces severe…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-29 Darong Huang , Luis Costero , Ali Pahlevan , Marina Zapater , David Atienza

A Survey of Serverless Machine Learning Model Inference

Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-23 Kamil Kojs

BeeFlow: Behavior Tree-based Serverless Workflow Modeling and Scheduling for Resource-Constrained Edge Clusters

Serverless computing has gained popularity in edge computing due to its flexible features, including the pay-per-use pricing model, auto-scaling capabilities, and multi-tenancy support. Complex Serverless-based applications typically rely…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-01 Ke Luo , Tao Ouyang , Zhi Zhou , Xu Chen

A System for Microserving of LLMs

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-18 Hongyi Jin , Ruihang Lai , Charlie F. Ruan , Yingcheng Wang , Todd C. Mowry , Xupeng Miao , Zhihao Jia , Tianqi Chen

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihong Jin , Ze Yang

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

A Deep Reinforcement Learning based Algorithm for Time and Cost Optimized Scaling of Serverless Applications

Serverless computing has gained a strong traction in the cloud computing community in recent years. Among the many benefits of this novel computing model, the rapid auto-scaling capability of user applications takes prominence. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-23 Anupama Mampage , Shanika Karunasekera , Rajkumar Buyya

On the Cost of Model-Serving Frameworks: An Experimental Evaluation

In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights,…

Machine Learning · Computer Science 2024-11-18 Pasquale De Rosa , Yérom-David Bromberg , Pascal Felber , Djob Mvondo , Valerio Schiavoni