English
Related papers

Related papers: MLProxy: SLA-Aware Reverse Proxy for Machine Learn…

200 papers

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…

Machine Learning · Computer Science 2025-05-21 Yifan Sui , Hao Wang , Hanfei Yu , Yitao Hu , Jianxun Li , Hao Wang

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…

Machine Learning · Computer Science 2024-11-26 Yilong Zhao , Shuo Yang , Kan Zhu , Lianmin Zheng , Baris Kasikci , Yang Zhou , Jiarong Xing , Ion Stoica

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-27 Yujeong Choi , Yunseong Kim , Minsoo Rhu

The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of…

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-14 Vikram Sreekanti , Harikaran Subbaraj , Chenggang Wu , Joseph E. Gonzalez , Joseph M. Hellerstein

Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while…

Machine Learning · Computer Science 2025-11-03 Zongshun Zhang , Ibrahim Matta

In recent years, the Mixture-of-Experts (MoE) architecture has been widely applied to large language models (LLMs), providing a promising solution that activates only a subset of the model's parameters during computation, thereby reducing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-10 Jianmin Hu , Minxian Xu , Kejiang Ye , Chengzhong Xu

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Jashwant Raj Gunasekaran , Prashanth Thinakaran , Cyan Subhra Mishra , Mahmut Taylan Kandemir , Chita R. Das

There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications…

Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-01 Lucia Schuler , Somaya Jamil , Niklas Kühl

As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-18 Amine Barrak , Emna Ksontini

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Chuhao Xu , Zijun Li , Quan Chen , Han Zhao , Xueyan Tang , Minyi Guo

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of…

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…

Artificial Intelligence · Computer Science 2025-07-30 Yufei Li , Zexin Li , Yinglun Zhu , Cong Liu

Serverless computing has emerged as a compelling paradigm for the development and deployment of a wide range of event based cloud applications. At the same time, cloud providers and enterprise companies are heavily adopting machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-13 Vatche Ishakian , Vinod Muthusamy , Aleksander Slominski

Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance…

Machine Learning · Computer Science 2026-05-20 Vima Gupta , Jae Hyung Ju , Kartik Sinha , Ada Gavrilovska , Anand Padmanabha Iyer

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Minchen Yu , Rui Yang , Chaobo Jia , Zhaoyuan Su , Sheng Yao , Tingfeng Lan , Yuchen Yang , Zirui Wang , Yue Cheng , Wei Wang , Ao Wang , Ruichuan Chen

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints.…

Machine Learning · Computer Science 2025-08-06 Seraj Al Mahmud Mostafa , Aravind Mohan , Jianwu Wang
‹ Prev 1 2 3 10 Next ›