English
Related papers

Related papers: A Versatile Software Systolic Execution Model for …

200 papers

Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Baodi Shan , Mauricio Araya-Polo

We propose a language and compiler to productively build high-performance {\it software systolic arrays} that run on GPUs. Based on a rigorous mathematical foundation (uniform recurrence equations and space-time transform), our language has…

Programming Languages · Computer Science 2020-11-02 Hongbo Rong , Xiaochen Hao , Yun Liang , Lidong Xu , Hong H Jiang , Pradeep Dubey

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Ryuichi Sai , John Mellor-Crummey , Xiaozhu Meng , Mauricio Araya-Polo , Jie Meng

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-04 Kazuaki Matsumura , Hamid Reza Zohouri , Mohamed Wahib , Toshio Endo , Satoshi Matsuoka

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-15 Lingqi Zhang , Mohamed Wahib , Peng Chen , Jintao Meng , Xiao Wang , Toshio Endo , Satoshi Matsuoka

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Ryuichi Sai , John Mellor-Crummey , Jinfan Xu , Mauricio Araya-Polo

The simulation of the two-dimensional Ising model is used as a benchmark to show the computational capabilities of Graphic Processing Units (GPUs). The rich programming environment now available on GPUs and flexible hardware capabilities…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-26 Joshua Romero , Mauro Bisson , Massimiliano Fatica , Massimo Bernaschi

Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-28 Johannes Pekkilä , Oskar Lappi , Fredrik Robertsén , Maarit J. Korpi-Lagg

Stencil computations are a fundamental kernel in scientific computing, critical for simulations in domains such as fluid dynamics and climate modeling. However, these computations are often memory-bound on traditional High-Performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-11 Elia Belli , Daniele De Sensi

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a controlled operator-level study of CUDA…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-30 Huriyeh Babak , Melanie Schaller

The computation and memory-intensive nature of DNNs limits their use in many mobile and embedded contexts. Application-specific integrated circuit (ASIC) hardware accelerators employ matrix multiplication units (such as the systolic arrays)…

Hardware Architecture · Computer Science 2024-02-02 Ruiqi Sun , Yinchen Ni , Xin He , Jie Zhao , An Zou

In this era of diverse and heterogeneous computer architectures, the programmability issues, such as productivity and portable efficiency, are crucial to software development and algorithm design. One way to approach the problem is to step…

Mathematical Software · Computer Science 2012-07-10 Mauro Bianco , Ugo Varetto

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-08 Dominik Ernst , Markus Holzer , Georg Hager , Matthias Knorr , Gerhard Wellein

A major bottleneck in scenario-based Sample Average Approximation (SAA) for stochastic programming (SP) is the cost of solving an exact second-stage problem for every scenario, especially when each scenario contains an NP-hard combinatorial…

Optimization and Control · Mathematics 2026-05-12 Jingyi Zhao , Linxin Yang , Haohua Zhang , Qile He , Tian Ding

Block iterative methods are extremely important as smoothers for multigrid methods, as preconditioners for Krylov methods, and as solvers for diagonally dominant linear systems. Developing robust and efficient algorithms suitable for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-16 Manuel Birke , Bobby Philip , Zhen Wang , Mark Berrill

Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-04 Gargi Alavani , Santonu Sarkar

The research interest in specialized hardware accelerators for deep neural networks (DNN) spikes recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-11 Cong Guo , Yangjie Zhou , Jingwen Leng , Yuhao Zhu , Zidong Du , Quan Chen , Chao Li , Bin Yao , Minyi Guo

The convolution computation is widely used in many fields, especially in CNNs. Because of the rapid growth of the training data in CNNs, GPUs have been used for the acceleration, and memory-efficient algorithms are focused because of thier…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-02 Qiong Chang , Masaki Onishi , Tsutomu Maruyama

This paper presents Systolic-CNN, an OpenCL-defined scalable, run-time-flexible FPGA accelerator architecture, optimized for accelerating the inference of various convolutional neural networks (CNNs) in multi-tenancy cloud/edge computing.…

Hardware Architecture · Computer Science 2020-12-08 Akshay Dua , Yixing Li , Fengbo Ren

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Jonah Ekelund , Stefano Markidis , Ivy Peng
‹ Prev 1 2 3 10 Next ›