FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Computer Vision and Pattern Recognition 2026-02-12 v1

Authors: Guandong Li

Abstract

Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12 $\times$ --1.16 $\times$ } end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09 $\times$ } speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30 $\times$ --1.46 $\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

Keywords

parallel programming large language model inference gpu computing

Cite

@article{arxiv.2602.10940,
  title  = {FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference},
  author = {Guandong Li},
  journal= {arXiv preprint arXiv:2602.10940},
  year   = {2026}
}

FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Abstract

Keywords

Cite

Related papers