English
Related papers

Related papers: {\lambda}Scale: Enabling Fast Scaling for Serverle…

200 papers

This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers,…

Machine Learning · Computer Science 2024-07-26 Yao Fu , Leyang Xue , Yeqi Huang , Andrei-Octavian Brabete , Dmitrii Ustiugov , Yuvraj Patel , Luo Mai

This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-26 Himel Ghosh

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

In this paper, we propose DEEPSERVE, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DEEPSERVE addresses key challenges such as resource allocation, serving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-10 Junhao Hu , Jiang Xu , Zhixia Liu , Yulong He , Yuetao Chen , Hao Xu , Jiang Liu , Jie Meng , Baoquan Zhang , Shining Wan , Gengyuan Dan , Zhiyu Dong , Zhihao Ren , Changhong Liu , Tao Xie , Dayun Lin , Qin Zhang , Yue Yu , Hao Feng , Xusheng Chen , Yizhou Shan

Serverless computing has grown rapidly for serving Large Language Model (LLM) inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve…

Machine Learning · Computer Science 2025-05-21 Yifan Sui , Hao Wang , Hanfei Yu , Yitao Hu , Jianxun Li , Hao Wang

RAPID-LLM is a unified performance modeling framework for large language model (LLM) training and inference on GPU clusters. It couples a DeepFlow-based frontend that generates hardware-aware, operator-level Chakra execution traces from an…

Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-02 Jaehong Cho , Minsu Kim , Hyunmin Choi , Guseul Heo , Jongse Park

In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-25 Yiyuan He , Minxian Xu , Jingfeng Wu , Wanyi Zheng , Kejiang Ye , Chengzhong Xu

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-28 Linyu Wu , Xiaoyuan Liu , Tianneng Shi , Zhe Ye , Dawn Song

Serverless computing offers attractive scalability, elasticity and cost-effectiveness. However, constraints on memory, CPU and function runtime have hindered its adoption for data-intensive applications and machine learning (ML) workloads.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-25 Joe Oakley , Hakan Ferhatosmanoglu

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Large language models (LLMs) have surged in popularity and are extensively used in commercial applications, where the efficiency of model serving is crucial for the user experience. Most current research focuses on optimizing individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-12 Yuhang Yao , Han Jin , Alay Dilipbhai Shah , Shanshan Han , Zijian Hu , Yide Ran , Dimitris Stripelis , Zhaozhuo Xu , Salman Avestimehr , Chaoyang He

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-13 Michael Lui , Yavuz Yetim , Özgür Özkan , Zhuoran Zhao , Shin-Yeh Tsai , Carole-Jean Wu , Mark Hempstead

Large Language Model (LLM) inference on large-scale systems is expected to dominate future cloud infrastructures. Efficient LLM inference in cloud environments with numerous AI accelerators is challenging, necessitating extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-11 Ilias Bournias , Lukas Cavigelli , Georgios Zacharopoulos

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use…

Machine Learning · Computer Science 2024-09-26 Bingyang Wu , Yinmin Zhong , Zili Zhang , Shengyu Liu , Fangyue Liu , Yuanhang Sun , Gang Huang , Xuanzhe Liu , Xin Jin

Large language models (LLMs) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Mingyu Sun , Xiao Zhang , Shen Qu , Yan Li , Mengbai Xiao , Yuan Yuan , Dongxiao Yu

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-08 Bin Lin , Chen Zhang , Tao Peng , Hanyu Zhao , Wencong Xiao , Minmin Sun , Anmin Liu , Zhipeng Zhang , Lanbo Li , Xiafei Qiu , Shen Li , Zhigang Ji , Tao Xie , Yong Li , Wei Lin

Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-24 Mingjin Zhang , Jiannong Cao , Xiaoming Shen , Zeyang Cui

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik

As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-24 Jingzhi Fang , Yanyan Shen , Yue Wang , Lei Chen
‹ Prev 1 2 3 10 Next ›