Related papers: {\lambda}Scale: Enabling Fast Scaling for Serverle…

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…

Machine Learning · Computer Science 2024-07-26 Yao Fu , Leyang Xue , Yeqi Huang , Andrei-Octavian Brabete , Dmitrii Ustiugov , Yuvraj Patel , Luo Mai

Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

DeepServe: Serverless Large Language Model Serving at Scale

In this paper, we propose DEEPSERVE, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DEEPSERVE addresses key challenges such as resource allocation, serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-10 Junhao Hu , Jiang Xu , Zhixia Liu , Yulong He , Yuetao Chen , Hao Xu , Jiang Liu , Jie Meng , Baoquan Zhang , Shining Wan , Gengyuan Dan , Zhiyu Dong , Zhihao Ren , Changhong Liu , Tao Xie , Dayun Lin , Qin Zhang , Yue Yu , Hao Feng , Xusheng Chen , Yizhou Shan

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…

Machine Learning · Computer Science 2025-05-21 Yifan Sui , Hao Wang , Hanfei Yu , Yitao Hu , Jianxun Li , Hao Wang

RAPID-LLM: Resilience-Aware Performance analysis of Infrastructure for Distributed LLM Training and Inference

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an…

Performance · Computer Science 2025-12-23 George Karfakis , Faraz Tahmasebi , Binglu Chen , Lime Yao , Saptarshi Mitra , Tianyue Pan , Hyoukjun Kwon , Puneet Gupta

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-02 Jaehong Cho , Minsu Kim , Hyunmin Choi , Guseul Heo , Jongse Park

UELLM: A Unified and Efficient Approach for LLM Inference Serving

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

DeServe: Towards Affordable Offline LLM Inference via Decentralization

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication

Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-25 Joe Oakley , Hakan Ferhatosmanoglu

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-13 Michael Lui , Yavuz Yetim , Özgür Özkan , Zhuoran Zhao , Shin-Yeh Tsai , Carole-Jean Wu , Mark Hempstead

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Ilias Bournias , Lukas Cavigelli , Georgios Zacharopoulos

Fast Distributed Inference Serving for Large Language Models

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use…

Machine Learning · Computer Science 2024-09-26 Bingyang Wu , Yinmin Zhong , Zili Zhang , Shengyu Liu , Fangyue Liu , Yuanhang Sun , Gang Huang , Xuanzhe Liu , Xin Jin

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

Large language models (LLMs) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Mingyu Sun , Xiao Zhang , Shen Qu , Yan Li , Mengbai Xiao , Yuan Yuan , Dongxiao Yu

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing

Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-24 Mingjin Zhang , Jiannong Cao , Xiaoming Shen , Zeyang Cui

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation

As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-24 Jingzhi Fang , Yanyan Shen , Yue Wang , Lei Chen