Related papers: Optimizing Prediction Serving on Low-Latency Serve…
Serverless computing that runs functions with auto-scaling is a popular task execution pattern in the cloud-native era. By connecting serverless functions into workflows, tenants can achieve complex functionality. Prior researches adopt the…
Network traffic analysis increasingly uses complex machine learning models as the internet consolidates and traffic gets more encrypted. However, over high-bandwidth networks, flows can easily arrive faster than model inference rates. The…
In recent years Serverless Computing has emerged as a compelling cloud based model for the development of a wide range of data-intensive applications. However, rapid container provisioning introduces non-trivial challenges for FaaS cloud…
Large Language Models (LLMs) have revolutionized numerous domains, driving the rise of Language-Model-as-a-Service (LMaaS) platforms that process millions of queries daily. These platforms must minimize latency and meet Service Level…
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and…
Machine Learning models are often composed of pipelines of transformations. While this design allows to efficiently execute single model components at training time, prediction serving has different requirements such as low latency, high…
We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years. It promises the user services at large scale and low cost while eliminating the need for infrastructure management. On cloud provider…
Configuring a storage system to better serve an application is a challenging task complicated by a multidimensional, discrete configuration space and the high cost of space exploration (e.g., by running the application with different…
This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…
Serverless computing has transformed cloud application deployment by introducing a fine-grained, event-driven execution model that abstracts away infrastructure management. Its on-demand nature makes it especially appealing for…
Serverless computing eliminates infrastructure management overhead but introduces significant challenges regarding cold start latency and resource utilization. Traditional static resource allocation often leads to inefficiencies under…
Computing servers have played a key role in developing and processing emerging compute-intensive applications in recent years. Consolidating multiple virtual machines (VMs) inside one server to run various applications introduces severe…
Recent developments in Generative AI, Computer Vision, and Natural Language Processing have led to an increased integration of AI models into various products. This widespread adoption of AI requires significant efforts in deploying these…
Serverless computing has gained popularity in edge computing due to its flexible features, including the pay-per-use pricing model, auto-scaling capabilities, and multi-tenancy support. Complex Serverless-based applications typically rely…
The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as…
The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud…
This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…
Serverless computing has gained a strong traction in the cloud computing community in recent years. Among the many benefits of this novel computing model, the rapid auto-scaling capability of user applications takes prominence. However, the…
In machine learning (ML), the inference phase is the process of applying pre-trained models to new, unseen data with the objective of making predictions. During the inference phase, end-users interact with ML services to gain insights,…