Related papers: Model-Based Warp Overlapped Tiling for Image Proce…

Tiling for Performance Tuning on Different Models of GPUs

The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance of programs has been more and more widely approved during the last two years since the CUDA platform was released. Its benefit extends…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-01-12 Chang Xu , Steven R. Kirk , Samantha Jenkins

Efficient hybrid topology optimization using GPU and homogenization based multigrid approach

We propose a new hybrid topology optimization algorithm based on multigrid approach that combines the parallelization strategy of CPU using OpenMP and heavily multithreading capabilities of modern Graphics Processing Units (GPU). In…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-01 Arya Prakash Padhi , Souvik Chakraborty , Anupam Chakrabarti , Rajib Chowdhury

Improving Locality in Sparse and Dense Matrix Multiplications

Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-02 Mohammad Mahdi Salehi Dezfuli , Kazem Cheshmi

PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-28 Z. Jonny Kong , Qiang Xu , Y. Charlie Hu

Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU

We present a new algorithm to quickly generate high-performance GPU implementations of complex imaging and vision pipelines, directly from high-level Halide algorithm code. It is fully automatic, requiring no schedule templates or…

Programming Languages · Computer Science 2023-08-29 Luke Anderson , Andrew Adams , Karima Ma , Tzu-Mao Li , Tian Jin , Jonathan Ragan-Kelley

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Jiarui Fang , Jinzhe Pan , Aoyu Li , Xibo Sun , Jiannan Wang

Hybrid Static/Dynamic Schedules for Tiled Polyhedral Programs

Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid…

Programming Languages · Computer Science 2016-10-25 Tian Jin , Nirmal Prajapati , Waruna Ranasinghe , Guillaume Iooss , Yun Zou , Sanjay Rajopadhye , David Wonnacott

Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU

Stochastic simulations need multiple replications in order to build confidence intervals for their results. Even if we do not need a large amount of replications, it is a good practice to speed-up the whole simulation time using the…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-01-08 Jonathan Passerat-Palmbach , Jonathan Caux , Pridi Siregar , Claude Mazel , David Hill

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new…

Machine Learning · Computer Science 2024-10-14 Reza Abbasi , Sernam Lim

Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Jaemin Choi , David F. Richards , Laxmikant V. Kale

Parallel Sub-Structuring Methods for solving Sparse Linear Systems on a cluster of GPU

The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…

Numerical Analysis · Mathematics 2021-08-31 Abal-Kassim Cheik Ahamed , Frédéric Magoulès

GPU acceleration of an iterative scheme for gas-kinetic model equations with memory reduction techniques

This paper presents a Graphics Processing Units (GPUs) acceleration method of an iterative scheme for gas-kinetic model equations. Unlike the previous GPU parallelization of explicit kinetic schemes, this work features a fast converging…

Computational Physics · Physics 2020-01-08 Lianhua Zhu , Peng Wang , Songze Chen , Zhaoli Guo , Yonghao Zhang

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory…

Computational Engineering, Finance, and Science · Computer Science 2019-06-20 Fabio Luporini , Michael Lange , Christian T. Jacobs , Gerard J. Gorman , J. Ramanujam , Paul H. J. Kelly

Distributed Parallel Image Signal Extrapolation Framework using Message Passing Interface

This paper introduces a framework for distributed parallel image signal extrapolation. Since high-quality image signal processing often comes along with a high computational complexity, a parallel execution is desirable. The proposed…

Image and Video Processing · Electrical Eng. & Systems 2022-07-04 Jürgen Seiler , André Kaup

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in…

Hardware Architecture · Computer Science 2024-04-10 Diya Joseph , Juan Luis Aragón , Joan-Manuel Parcerisa , Antonio Gonzalez

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-22 Basilis Mamalis , Marios Perlitis

Dynamic load balancing with enhanced shared-memory parallelism for particle-in-cell codes

Furthering our understanding of many of today's interesting problems in plasma physics---including plasma based acceleration and magnetic reconnection with pair production due to quantum electrodynamic effects---requires large-scale kinetic…

Computational Physics · Physics 2020-10-28 Kyle G. Miller , Roman P. Lee , Adam Tableman , Anton Helm , Ricardo A. Fonseca , Viktor K. Decyk , Warren B. Mori

SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference

As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV)…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-30 Yongchao He , Bohan Zhao , Zheng Cao

sTiles: An Accelerated Computational Framework for Sparse Factorizations of Structured Matrices

This paper introduces sTiles, a GPU-accelerated framework for factorizing sparse structured symmetric matrices. By leveraging tile algorithms for fine-grained computations, sTiles uses a structure-aware task execution flow to handle…

Performance · Computer Science 2025-01-07 Esmail Abdul Fattah , Hatem Ltaief , Havard Rue , David Keyes