Related papers: Model-Based Warp Overlapped Tiling for Image Proce…
The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance of programs has been more and more widely approved during the last two years since the CUDA platform was released. Its benefit extends…
We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…
Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality,…
With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for…
We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or…
This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…
Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid…
Stochastic simulations need multiple replications in order to build confidence intervals for their results. Even if we do not need a large amount of replications, it is a good practice to speed-up the whole simulation time using the…
The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new…
Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…
The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…
This paper presents a Graphics Processing Units (GPUs) acceleration method of an iterative scheme for gas-kinetic model equations. Unlike the previous GPU parallelization of explicit kinetic schemes, this work features a fast converging…
Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…
Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory…
This paper introduces a framework for distributed parallel image signal extrapolation. Since high-quality image signal processing often comes along with a high computational complexity, a parallel execution is desirable. The proposed…
Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in…
The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also…
Furthering our understanding of many of today's interesting problems in plasma physics---including plasma based acceleration and magnetic reconnection with pair production due to quantum electrodynamic effects---requires large-scale kinetic…
As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV)…
This paper introduces sTiles, a GPU-accelerated framework for factorizing sparse structured symmetric matrices. By leveraging tile algorithms for fine-grained computations, sTiles uses a structure-aware task execution flow to handle…