Related papers: Counting Path Configurations in Parallel Diffusion
Originally proposed by Duffy et al., Diffusion is a variant of chip-firing in which chips from flow from places of high concentration to places of low concentration. In the variant, Perturbation Diffusion, the first step involves a…
We study a variant of the chip-firing game called \emph{diffusion}. In diffusion on a graph, each vertex of the graph is initially labelled with an integer interpreted as the number of chips at that vertex, and at each subsequent step, each…
Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive…
This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…
Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency,…
Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU…
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism…
Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases.…
Previous work has shown that there are two major complexity barriers in the synthesis of fault-tolerant distributed programs: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock…
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy,…
Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces…
Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Model weights frequently exceed the memory capacity of commodity GPUs, while the…
We study how parallel chip-firing on the complete graph K_n changes behavior as we vary the total number of chips. Surprisingly, the activity of the system, defined as the average number of firings per time step, does not increase smoothly…
Graphical chip-firing is a discrete dynamical system where chips are placed on the vertices of a graph and exchanged via simple firing moves. Recent work has sought to generalize chip-firing on graphs to higher dimensions, wherein graphs…
We study the cycles generated by the chip firing game associated with n-cube orientations. We show the existence of the cycles generated by parallel evolutions of even lengths from 2 to $2^n$ on $H_n$ (n >= 1), and of odd lengths different…
Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel…
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the…
The computational complexity of internal diffusion-limited aggregation (DLA) is examined from both a theoretical and a practical point of view. We show that for two or more dimensions, the problem of predicting the cluster from a given set…
Parallelization techniques have become ubiquitous for accelerating inference and training of deep neural networks. Despite this, several operations are still performed in a sequential manner. For instance, the forward and backward passes…
Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard…