Related papers: Multi-model Machine Learning Inference Serving wit…
Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as…
In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…
In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the…
Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…
The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore…
Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…
Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU…
In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption.…
Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in…
Modern applications increasingly rely on inference serving systems to provide low-latency insights with a diverse set of machine learning models. Existing systems often utilize resource elasticity to scale with demand. However, many…
Machine learning (ML) inference serving systems host deep neural network (DNN) models and schedule incoming inference requests across deployed GPUs. However, limited support for task prioritization and insufficient latency estimation under…
The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…
Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…
The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…
As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve…
Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes…
The rapid adoption of machine learning (ML) has underscored the importance of serving ML models with high throughput and resource efficiency. Traditional approaches to managing increasing query demands have predominantly focused on hardware…
Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong…
Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…