Related papers: Cloud Native System for LLM Inference Serving

Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda

The rapid rise of Large Language Models (LLMs) has revolutionized various artificial intelligence (AI) applications, from natural language processing to code generation. However, the computational demands of these models, particularly in…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-21 Minxian Xu , Jingfeng Wu , Shengye Song , Satish Narayana Srirama , Bahman Javad , Rajiv Ranjan , Devki Nandan Jha , Sa Wang , Wenhong Tian , Huanle Xu , Li Li , Zizhao Mo , Shuo Ren , Thomas Kunz , Petar Kochovski , Vlado Stankovski , Kejiang Ye , Chengzhong Xu , Rajkumar Buyya

Large Language Models over Networks: Collaborative Intelligence under Resource Constraints

Large language models (LLMs) are transforming society, powering applications from smartphone assistants to autonomous driving. Yet cloud-based LLM services alone cannot serve a growing class of applications, including those operating under…

Signal Processing · Electrical Eng. & Systems 2026-05-12 Liangqi Yuan , Wenzhi Fang , Shiqiang Wang , H. Vincent Poor , Christopher G. Brinton

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Ilias Bournias , Lukas Cavigelli , Georgios Zacharopoulos

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models…

Computation and Language · Computer Science 2024-09-24 Adarsh MS , Jithin VG , Ditto PS

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-24 Jashwant Raj Gunasekaran , Prashanth Thinakaran , Cyan Subhra Mishra , Mahmut Taylan Kandemir , Chita R. Das

DeServe: Towards Affordable Offline LLM Inference via Decentralization

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes

Businesses have made increasing adoption and incorporation of cloud technology into internal processes in the last decade. The cloud-based deployment provides on-demand availability without active management. More recently, the concept of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Ying Mao , Yuqi Fu , Suwen Gu , Wenrui Mu , Long Cheng , Qingzhi Liu

LLMs as On-demand Customizable Service

Large Language Models (LLMs) have demonstrated remarkable language understanding and generation capabilities. However, training, deploying, and accessing these models pose notable challenges, including resource-intensive demands, extended…

Computation and Language · Computer Science 2024-01-31 Souvika Sarkar , Mohammad Fakhruddin Babar , Monowar Hasan , Shubhra Kanti Karmaker

ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference…

Artificial Intelligence · Computer Science 2026-03-24 Haoyu Qiao , Hao Zhang , Shanwen Mao , Siyao Cheng , Jie Liu

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Youhe Jiang , Fangcheng Fu , Xiaozhe Yao , Taiyi Wang , Bin Cui , Ana Klimovic , Eiko Yoneki

A Survey of LLM Inference Systems

The past few years has witnessed specialized large language model (LLM) inference systems, such as vLLM, SGLang, Mooncake, and DeepFlow, alongside rapid LLM adoption via services like ChatGPT. Driving these system design efforts is the…

Databases · Computer Science 2025-06-30 James Pan , Guoliang Li

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

As large language models (LLMs) evolve, deploying them solely in the cloud or compressing them for edge devices has become inadequate due to concerns about latency, privacy, cost, and personalization. This survey explores a collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-23 Senyao Li , Haozhao Wang , Wenchao Xu , Rui Zhang , Song Guo , Jingling Yuan , Xian Zhong , Tianwei Zhang , Ruixuan Li

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling

The rapid expansion of AI inference services in the cloud necessitates a robust scalability solution to manage dynamic workloads and maintain high performance. This study proposes a comprehensive scalability optimization framework for cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Yihong Jin , Ze Yang

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-10 Hongpeng Jin , Yanzhao Wu

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model…

Computation and Language · Computer Science 2025-11-27 Sihyeong Park , Sungryeol Jeon , Chaelyn Lee , Seokhun Jeon , Byung-Soo Kim , Jemin Lee

Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing

With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Cheng Ji , Huaiying Luo