English
Related papers

Related papers: Efficiently Serving Large Multimodal Models Using …

200 papers

With the widespread adoption of large multimodal models, efficient inference across text, image, audio, and video modalities has become critical. However, existing multimodal inference systems typically employ monolithic architectures that…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-21 Fan Bai , Pai Peng , Zhengzhi Tang , Zhe Wang , Gong Chen , Xiang Lu , Yinuo Li , Huan Lin , Weizhe Lin , Yaoyuan Wang , Xiaosong Li

With the rapid evolution of Large Language Models (LLMs), multi-round workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Wenhao He , Youhe Jiang , Penghao Zhao , Quanqing Xu , Eiko Yoneki , Bin Cui , Fangcheng Fu

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-10 Junhan Liao , Minxian Xu , Wanyi Zheng , Yan Wang , Kejiang Ye , Rajkumar Buyya , Chengzhong Xu

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Yinmin Zhong , Shengyu Liu , Junda Chen , Jianbo Hu , Yibo Zhu , Xuanzhe Liu , Xin Jin , Hao Zhang

Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a…

Computation and Language · Computer Science 2025-04-29 Ke Hong , Lufang Chen , Zhong Wang , Xiuhong Li , Qiuli Mao , Jianping Ma , Chao Xiong , Guanyu Wu , Buhe Han , Guohao Dai , Yun Liang , Yu Wang

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators.…

Machine Learning · Computer Science 2025-04-11 Shaoyuan Chen , Wencong Xiao , Yutong Lin , Mingxing Zhang , Yingdi Shan , Jinlei Jiang , Kang Chen , Yongwei Wu

Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the…

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-12 Zedong Liu , Shenggan Cheng , Guangming Tan , Yang You , Dingwen Tao

Prefill-Decode (PD) disaggregation has become the standard architecture for modern LLM inference engines, which alleviates the interference of two distinctive workloads. With the growing demand for multi-turn interactions in chatbots and…

Networking and Internet Architecture · Computer Science 2026-05-06 Zongze Li , Jingyu Liu , Zhen Xu , Yineng Zhang , Tahseen Rabbani , Ce Zhang

Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands.…

Computation and Language · Computer Science 2025-12-16 Hao Zhang , Mengsi Lyu , Zhuo Chen , Xingrun Xing , Yulong Ao , Yonghua Lin

Prefill-Decode (P/D) disaggregation has emerged as a widely adopted optimization strategy for Large Language Model (LLM) inference. However, there currently exists no well-established methodology for determining the optimal number of P/D…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-06 Luchang Li , Dongfang Li , Bozhao Gong , Yu Zhang

Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the…

Large language models (LLMs) have transformed natural language processing but face critical deployment challenges in device-edge systems due to resource limitations and communication overhead. To address these issues, collaborative…

Signal Processing · Electrical Eng. & Systems 2025-07-18 Jiahong Ning , Ce Zheng , Tingting Yang

Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-22 Lingxiao Zhao , Haoran Zhou , Yuezhi Che , Dazhao Cheng

Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect…

Machine Learning · Computer Science 2026-04-17 Gunjun Lee , Jiwon Kim , Jaiyoung Park , Younjoo Lee , Jung Ho Ahn

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation,…

The architectural shift to prefill/decode (PD) disaggregation in LLM serving improves resource utilization but struggles with the bursty nature of modern workloads. Existing autoscaling policies, often retrofitted from monolithic systems…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-04 Ruiqi Lai , Hongrui Liu , Chengzhi Lu , Zonghao Liu , Siyu Cao , Siyang Shao , Yixin Zhang , Luo Mai , Dmitrii Ustiugov

Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Mustafa Shukor , Corentin Dancette , Matthieu Cord

Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-28 Rongzhi Li , Ruogu Du , Zefang Chu , Sida Zhao , Chunlei Han , Zuocheng Shi , Yiwen Shao , Huanle Han , Long Huang , Zherui Liu , Shufan Liu

Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode…

Hardware Architecture · Computer Science 2025-10-10 Hengrui Zhang , Pratyush Patel , August Ning , David Wentzlaff
‹ Prev 1 2 3 10 Next ›