Related papers: MLProxy: SLA-Aware Reverse Proxy for Machine Learn…

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…

Machine Learning · Computer Science 2025-05-21 Yifan Sui , Hao Wang , Hanfei Yu , Yitao Hu , Jianxun Li , Hao Wang

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…

Machine Learning · Computer Science 2024-11-26 Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-27 Yujeong Choi , Yunseong Kim , Minsoo Rhu

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of…

Machine Learning · Computer Science 2023-05-22 Mehran Salmani , Saeid Ghafouri , Alireza Sanaee , Kamran Razavi , Max Mühlhäuser , Joseph Doyle , Pooyan Jamshidi , Mohsen Sharifi

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

Optimizing Prediction Serving on Low-Latency Serverless Dataflow

Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-14 Vikram Sreekanti , Harikaran Subbaraj , Chenggang Wu , Joseph E. Gonzalez , Joseph M. Hellerstein

SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while…

Machine Learning · Computer Science 2025-11-03 Zongshun Zhang , Ibrahim Matta

BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs

In recent years, the Mixture-of-Experts (MoE) architecture has been widely applied to large language models (LLMs), providing a promising solution that activates only a subset of the model's parameters during computation, thereby reducing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-10 Jianmin Hu , Minxian Xu , Kejiang Ye , Chengzhong Xu

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Jashwant Raj Gunasekaran , Prashanth Thinakaran , Cyan Subhra Mishra , Mahmut Taylan Kandemir , Chita R. Das

Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications…

Databases · Computer Science 2025-11-05 Yuting Yang , Tiancheng Yuan , Jamal Hashim , Thiago Garrett , Jeffrey Qian , Ann Zhang , Yifan Wang , Weijia Song , Ken Birman

AI-based Resource Allocation: Reinforcement Learning for Adaptive Auto-scaling in Serverless Environments

Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-01 Lucia Schuler , Somaya Jamil , Niklas Kühl

Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions

As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Amine Barrak , Emna Ksontini

Towards Resource-Efficient Serverless LLM Inference with SLINFER

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Chuhao Xu , Zijun Li , Quan Chen , Han Zhao , Xueyan Tang , Minyi Guo

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of…

Databases · Computer Science 2026-04-16 Yeounoh Chung , Rushabh Desai , Jian He , Yu Xiao , Thibaud Hottelier , Yves-Laurent Kom Samo , Pushkar Khadilkar , Xianshun Chen , Sam Idicula , Fatma Özcan , Alon Halevy , Yannis Papakonstantinou

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…

Artificial Intelligence · Computer Science 2025-07-30 Yufei Li , Zexin Li , Yinglun Zhu , Cong Liu

Serving deep learning models in a serverless platform

Serverless computing has emerged as a compelling paradigm for the development and deployment of a wide range of event based cloud applications. At the same time, cloud providers and enterprise companies are heavily adopting machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-13 Vatche Ishakian , Vinod Muthusamy , Aleksander Slominski

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance…

Machine Learning · Computer Science 2026-05-20 Vima Gupta , Jae Hyung Ju , Kartik Sinha , Ada Gavrilovska , Anand Padmanabha Iyer

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Minchen Yu , Rui Yang , Chaobo Jia , Zhaoyuan Su , Sheng Yao , Tingfeng Lan , Yuchen Yang , Zirui Wang , Yue Cheng , Wei Wang , Ao Wang , Ruichuan Chen

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints.…

Machine Learning · Computer Science 2025-08-06 Seraj Al Mahmud Mostafa , Aravind Mohan , Jianwu Wang