RTP-LLM: High-Performance Alibaba LLM Inference Engine

Authors: Boyu Tan, Jiarui Guo, Zongwei Lv, Hanbo Sun, Tong Yang, Kan Liu, Xinfei Shi, Zetao Hu, Yaxin Yu, Chi Zhang, Jianning Zhang, Xi Yang, Wei Zhang, Bo Cai, Silu Zhou, Xiyu Wang, Na He, Yinghao Yu, Wending Bao, Guiyang Huang, Yuxing Yuan, Juncheng Yin, Nan Wang, Lin Yang, Zechao Zhang, Lu Chen, Guoding Li, Tao Lan, Lin Qu

cs.OS2026-05v1license

View on arXiv ↗ PDF ↗

Abstract

Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully deployed across Alibaba Group serving over 100 million users. RTP-LLM addresses fundamental bottlenecks through integrated design. It optimizes model loading via file-order-driven I/O and parallel I/O-communication overlapping. The Prefill-Decode Disaggregation architecture decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management enabling efficient cache reuse. In addition, RTP-LLM incorporates modular speculative decoding supporting multiple algorithms, adaptive KV cache quantization, and decoupled multimodal processing, with support for multi-level parallelism. Comprehensive evaluations across diverse model architectures (8B-235B parameters) have been conducted, where both controlled benchmarks and real production workloads are used. The results demonstrate RTP-LLM's superior performance against vLLM and SGLang: 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement in production traffic scheduling, 1.12x-2.48x and 1.86x-2.52x throughput improvements in speculative decoding and multimodal inference, respectively, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference. RTP-LLM's production-proven architecture and open-source availability make it a comprehensive solution for industrial LLM deployment.

Cite

@article{arxiv.2605.29639,
  title  = {RTP-LLM: High-Performance Alibaba LLM Inference Engine},
  author = {Boyu Tan and Jiarui Guo and Zongwei Lv and Hanbo Sun and Tong Yang and Kan Liu and Xinfei Shi and Zetao Hu and Yaxin Yu and Chi Zhang and Jianning Zhang and Xi Yang and Wei Zhang and Bo Cai and Silu Zhou and Xiyu Wang and Na He and Yinghao Yu and Wending Bao and Guiyang Huang and Yuxing Yuan and Juncheng Yin and Nan Wang and Lin Yang and Zechao Zhang and Lu Chen and Guoding Li and Tao Lan and Lin Qu},
  journal= {arXiv preprint arXiv:2605.29639},
  year   = {2026}
}

← cs.OS · Home