Related papers: MLProxy: SLA-Aware Reverse Proxy for Machine Learn…
Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…
Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and…
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be…
The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of…
The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…
Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous…
Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while…
In recent years, the Mixture-of-Experts (MoE) architecture has been widely applied to large language models (LLMs), providing a promising solution that activates only a subset of the model's parameters during computation, thereby reducing…
We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications…
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider…
As data-intensive applications grow, batch processing in limited-resource environments faces scalability and resource management challenges. Serverless computing offers a flexible alternative, enabling dynamic resource allocation and…
The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…
Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of…
Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…
Serverless computing has emerged as a compelling paradigm for the development and deployment of a wide range of event based cloud applications. At the same time, cloud providers and enterprise companies are heavily adopting machine learning…
Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance…
Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup…
Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…
Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints.…