English
Related papers

Related papers: Cloud Native System for LLM Inference Serving

200 papers

The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in…

Large language models (LLMs) are transforming society, powering applications from smartphone assistants to autonomous driving. Yet cloud-based LLM services alone cannot serve a growing class of applications, including those operating under…

Signal Processing · Electrical Eng. & Systems 2026-05-12 Liangqi Yuan , Wenzhi Fang , Shiqiang Wang , H. Vincent Poor , Christopher G. Brinton

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Ilias Bournias , Lukas Cavigelli , Georgios Zacharopoulos

Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models…

Computation and Language · Computer Science 2024-09-24 Adarsh MS , Jithin VG , Ditto PS

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Jashwant Raj Gunasekaran , Prashanth Thinakaran , Cyan Subhra Mishra , Mahmut Taylan Kandemir , Chita R. Das

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Businesses have made increasing adoption and incorporation of cloud technology into internal processes in the last decade. The cloud-based deployment provides on-demand availability without active management. More recently, the concept of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Ying Mao , Yuqi Fu , Suwen Gu , Wenrui Mu , Long Cheng , Qingzhi Liu

Large Language Models (LLMs) have demonstrated remarkable language understanding and generation capabilities. However, training, deploying, and accessing these models pose notable challenges, including resource-intensive demands, extended…

Computation and Language · Computer Science 2024-01-31 Souvika Sarkar , Mohammad Fakhruddin Babar , Monowar Hasan , Shubhra Kanti Karmaker

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference…

Artificial Intelligence · Computer Science 2026-03-24 Haoyu Qiao , Hao Zhang , Shanwen Mao , Siyao Cheng , Jie Liu

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Youhe Jiang , Fangcheng Fu , Xiaozhe Yao , Taiyi Wang , Bin Cui , Ana Klimovic , Eiko Yoneki

The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the…

Databases · Computer Science 2025-06-30 James Pan , Guoliang Li

As large language models (LLMs) evolve, deploying them solely in the cloud or compressing them for edge devices has become inadequate due to concerns about latency, privacy, cost, and personalization. This survey explores a collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-23 Senyao Li , Haozhao Wang , Wenchao Xu , Rui Zhang , Song Guo , Jingling Yuan , Xian Zhong , Tianwei Zhang , Ruixuan Li

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihong Jin , Ze Yang

Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-10 Hongpeng Jin , Yanzhao Wu

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model…

Computation and Language · Computer Science 2025-11-27 Sihyeong Park , Sungryeol Jeon , Chaelyn Lee , Seokhun Jeon , Byung-Soo Kim , Jemin Lee

With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Cheng Ji , Huaiying Luo
‹ Prev 1 2 3 10 Next ›