Related papers: FastUSP: A Multi-Level Collaborative Acceleration …

SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-26 Jiacheng Yang , Jun Wu , Yaoyao Ding , Zhiying Xu , Yida Wang , Gennady Pekhimenko

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Euisoo Jung , Byunghyun Kim , Hyunjin Kim , Seonghye Cho , Jae-Gil Lee

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-22 Bin Ma , Xingjian Ding , Tekin Bicer , Pengfei Su , Dong Li

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed…

Machine Learning · Computer Science 2026-02-25 Ravi Ghadia , Maksim Abraham , Sergei Vorobyov , Max Ryabinin

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Muyang Li , Tianle Cai , Jiaxin Cao , Qinsheng Zhang , Han Cai , Junjie Bai , Yangqing Jia , Ming-Yu Liu , Kai Li , Song Han

LinFusion: 1 GPU, 1 Minute, 16K Image

Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this…

Computer Vision and Pattern Recognition · Computer Science 2024-10-18 Songhua Liu , Weihao Yu , Zhenxiong Tan , Xinchao Wang

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases.…

Computer Vision and Pattern Recognition · Computer Science 2024-12-05 XiuYu Zhang , Zening Luo , Michelle E. Lu

USV: Unified Sparsification for Accelerating Video Diffusion Models

The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising…

Computer Vision and Pattern Recognition · Computer Science 2025-12-08 Xinjian Wu , Hongmei Wang , Yuan Zhou , Qinglin Lu

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique…

Machine Learning · Computer Science 2024-10-25 Li-Wen Chang , Wenlei Bao , Qi Hou , Chengquan Jiang , Ningxin Zheng , Yinmin Zhong , Xuanrun Zhang , Zuquan Song , Chengji Yao , Ziheng Jiang , Haibin Lin , Xin Jin , Xin Liu

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car).…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-11 Weiwen Jiang , Edwin H. -M. Sha , Xinyi Zhang , Lei Yang , Qingfeng Zhuge , Yiyu Shi , Jingtong Hu

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jiarui Fang , Jinzhe Pan , Aoyu Li , Xibo Sun , Jiannan Wang

USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the…

Machine Learning · Computer Science 2024-07-03 Jiarui Fang , Shangchun Zhao

Linear Attention Sequence Parallelism

Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take…

Machine Learning · Computer Science 2025-05-19 Weigao Sun , Zhen Qin , Dong Li , Xuyang Shen , Yu Qiao , Yiran Zhong

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Siqi Chen , Ke Hong , Tianchen Zhao , Ruiqi Xie , Zhenhua Zhu , Xudong Zhang , Yu Wang

Fused3S: Fast Sparse Attention on Tensor Cores

Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-14 Zitong Li , Aparna Chandramowlishwaran

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

Discrete optimization is a central problem in artificial intelligence. The optimization of the aggregated cost of a network of cost functions arises in a variety of problems including (W)CSP, DCOP, as well as optimization in stochastic…

Artificial Intelligence · Computer Science 2018-01-12 Ferdinando Fioretto , Enrico Pontelli , William Yeoh , Rina Dechter

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the…

Machine Learning · Computer Science 2025-10-14 Kunyun Wang , Bohan Li , Kai Yu , Minyi Guo , Jieru Zhao

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through…

Machine Learning · Computer Science 2025-05-27 Yuxiang Huang , Mingye Li , Xu Han , Chaojun Xiao , Weilin Zhao , Sun Ao , Hao Zhou , Jie Zhou , Zhiyuan Liu , Maosong Sun

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the…

Machine Learning · Computer Science 2023-10-05 Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He