Related papers: Efficiently Serving Large Multimodal Models Using …

EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend

With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-21 Fan Bai , Pai Peng , Zhengzhi Tang , Zhe Wang , Gong Chen , Xiang Lu , Yinuo Li , Huan Lin , Weizhe Lin , Yaoyuan Wang , Xiaosong Li

Efficient Multi-round LLM Inference over Disaggregated Serving

With the rapid evolution of Large Language Models (LLMs), multi-round workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Wenhao He , Youhe Jiang , Penghao Zhao , Quanqing Xu , Eiko Yoneki , Bin Cui , Fangcheng Fu

DOPD: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-10 Junhan Liao , Minxian Xu , Wanyi Zheng , Yan Wang , Kejiang Ye , Rajkumar Buyya , Chengzhong Xu

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Yinmin Zhong , Shengyu Liu , Junda Chen , Jianbo Hu , Yibo Zhu , Xuanzhe Liu , Xin Jin , Hao Zhang

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a…

Computation and Language · Computer Science 2025-04-29 Ke Hong , Lufang Chen , Zhong Wang , Xiuhong Li , Qiuli Mao , Jianping Ma , Chao Xiong , Guanyu Wu , Buhe Han , Guohao Dai , Yun Liang , Yu Wang

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators.…

Machine Learning · Computer Science 2025-04-11 Shaoyuan Chen , Wencong Xiao , Yutong Lin , Mingxing Zhang , Yingdi Shan , Jinlei Jiang , Kang Chen , Yongwei Wu

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-16 Yibo Jin , Tao Wang , Huimin Lin , Mingyang Song , Peiyang Li , Yipeng Ma , Yicheng Shan , Zhengfan Yuan , Cailong Li , Yajing Sun , Tiandeng Wu , Xing Chu , Ruizhi Huan , Li Ma , Xiao You , Wenting Zhou , Yunpeng Ye , Wen Liu , Xiangkun Xu , Yongsheng Zhang , Tiantian Dong , Jiawei Zhu , Zhe Wang , Xijian Ju , Jianxun Song , Haoliang Cheng , Xiaojing Li , Jiandong Ding , Hefei Guo , Zhengyong Zhang

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-12 Zedong Liu , Shenggan Cheng , Guangming Tan , Yang You , Dingwen Tao

Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and…

Networking and Internet Architecture · Computer Science 2026-05-06 Zongze Li , Jingyu Liu , Zhen Xu , Yineng Zhang , Tahseen Rabbani , Ce Zhang

PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands.…

Computation and Language · Computer Science 2025-12-16 Hao Zhang , Mengsi Lyu , Zhuo Chen , Xingrun Xing , Yulong Ao , Yonghua Lin

SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference

Prefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-06 Luchang Li , Dongfang Li , Bozhao Gong , Yu Zhang

P/D-Device: Disaggregated Large Language Model between Cloud and Devices

Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-13 Yibo Jin , Yixu Xu , Yue Chen , Chengbin Wang , Tao Wang , Jiaqi Huang , Rongfei Zhang , Yiming Dong , Yuting Yan , Ke Cheng , Yingjie Zhu , Shulan Wang , Qianqian Tang , Shuaishuai Meng , Guanxin Cheng , Ze Wang , Shuyan Miao , Ketao Wang , Wen Liu , Yifan Yang , Tong Zhang , Anran Wang , Chengzhou Lu , Tiantian Dong , Yongsheng Zhang , Zhe Wang , Hefei Guo , Hongjie Liu , Wei Lu , Zhengyong Zhang

DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding

Large language models (LLMs) have transformed natural language processing but face critical deployment challenges in device-edge systems due to resource limitations and communication overhead. To address these issues, collaborative…

Signal Processing · Electrical Eng. & Systems 2025-07-18 Jiahong Ning , Ce Zheng , Tingting Yang

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-22 Lingxiao Zhao , Haoran Zhou , Yuezhi Che , Dazhao Cheng

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect…

Machine Learning · Computer Science 2026-04-17 Gunjun Lee , Jiwon Kim , Jaiyoung Park , Younjoo Lee , Jung Ho Ahn

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation,…

Machine Learning · Computer Science 2026-05-28 Hanjiang Wu , Abhimanyu Rajeshkumar Bambhaniya , Sarbartha Banerjee , Tuhin Khare , Sudarshan Srinivasan , Suvinay Subramanian , Souvik Kundu , Madhu Kumar , Midhilesh Elavazhagan , William Won , Amir Yazdanbakhsh , Tushar Krishna

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

The architectural shift to prefill/decode (PD) disaggregation in LLM serving improves resource utilization but struggles with the bursty nature of modern workloads. Existing autoscaling policies, often retrofitted from monolithic systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Ruiqi Lai , Hongrui Liu , Chengzhi Lu , Zonghao Liu , Siyu Cao , Siyang Shao , Yixin Zhang , Luo Mai , Dmitrii Ustiugov

eP-ALM: Efficient Perceptual Augmentation of Language Models

Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Mustafa Shukor , Corentin Dancette , Matthieu Cord

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-28 Rongzhi Li , Ruogu Du , Zefang Chu , Sida Zhao , Chunlei Han , Zuocheng Shi , Yiwen Shao , Huanle Han , Long Huang , Zherui Liu , Shufan Liu

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode…

Hardware Architecture · Computer Science 2025-10-10 Hengrui Zhang , Pratyush Patel , August Ning , David Wentzlaff