{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Minchen Yu; Rui Yang; Chaobo Jia; Zhaoyuan Su; Sheng Yao; Tingfeng Lan; Yuchen Yang; Zirui Wang; Yue Cheng; Wei Wang; Ao Wang; Ruichuan Chen

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Distributed, Parallel, and Cluster Computing 2026-03-09 v3

Authors: Minchen Yu , Rui Yang , Chaobo Jia , Zhaoyuan Su , Sheng Yao , Tingfeng Lan , Yuchen Yang , Zirui Wang , Yue Cheng , Wei Wang , Ao Wang , Ruichuan Chen

View on arXiv ↗ PDF ↗

Abstract

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce {\lambda}Scale, an efficient serverless inference system to achieve fast model scaling. The key idea behind {\lambda}Scale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". {\lambda}Scale proposes an efficient model scaling scheme, {\lambda}Pipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, {\lambda}Scale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that {\lambda}Scale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.

Keywords

large language model inference serverless computing key-value cache

Cite

@article{arxiv.2502.09922,
  title  = {{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference},
  author = {Minchen Yu and Rui Yang and Chaobo Jia and Zhaoyuan Su and Sheng Yao and Tingfeng Lan and Yuchen Yang and Zirui Wang and Yue Cheng and Wei Wang and Ao Wang and Ruichuan Chen},
  journal= {arXiv preprint arXiv:2502.09922},
  year   = {2026}
}

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Abstract

Keywords

Cite

Related papers