Related papers: Accelerator Codesign as Non-Linear Optimization

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-28 Johannes Pekkilä , Oskar Lappi , Fredrik Robertsén , Maarit J. Korpi-Lagg

Accelerating High-Order Stencils on GPUs

Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-16 Ryuichi Sai , John Mellor-Crummey , Xiaozhu Meng , Mauricio Araya-Polo , Jie Meng

A Semi-Decoupled Approach to Fast and Optimal Hardware-Software Co-Design of Neural Accelerators

In view of the performance limitations of fully-decoupled designs for neural architectures and accelerators, hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network…

Hardware Architecture · Computer Science 2022-03-29 Bingqian Lu , Zheyu Yan , Yiyu Shi , Shaolei Ren

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-16 Hamid Reza Zohouri , Artur Podobas , Satoshi Matsuoka

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-13 Baodi Shan , Mauricio Araya-Polo

MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit

Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-16 Yinuo Wang , Tianqi Mao , Lin Gan , Wubing Wan , Zeyu Song , Jiayu Fu , Lanke He , Wenqiang Wang , Zekun Yin , Wei Xue , Guangwen Yang

Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices

High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators. To improve the overall solution quality as well as to boost the design productivity, efficient…

Hardware Architecture · Computer Science 2020-10-16 Cong Hao , Yao Chen , Xiaofan Zhang , Yuhong Li , Jinjun Xiong , Wen-mei Hwu , Deming Chen

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-04 Kazuaki Matsumura , Hamid Reza Zohouri , Mohamed Wahib , Toshio Endo , Satoshi Matsuoka

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-05 Vitor Hugo Mickus Rodrigues , Lucas Cavalcante , Maelso Bruno Pereira , Fabio Luporini , István Reguly , Gerard Gorman , Samuel Xavier de Souza

Algorithm and Hardware Co-design for Reconfigurable CNN Accelerator

Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization…

Machine Learning · Computer Science 2021-11-29 Hongxiang Fan , Martin Ferianc , Zhiqiang Que , He Li , Shuanglong Liu , Xinyu Niu , Wayne Luk

Late Breaking Results: Fast System Technology Co-Optimization Framework for Emerging Technology Based on Graph Neural Networks

This paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and…

Emerging Technologies · Computer Science 2024-10-31 Tianliang Ma , Guangxi Fan , Xuguang Sun , Zhihui Deng , Kainlu Low , Leilai Shao

Analytical Performance Estimation during Code Generation on Modern GPUs

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-08 Dominik Ernst , Markus Holzer , Georg Hager , Matthias Knorr , Gerhard Wellein

Learned Hardware/Software Co-Design of Neural Accelerators

The use of deep learning has grown at an exponential rate, giving rise to numerous specialized hardware and software systems for deep learning. Because the design space of deep learning software stacks and hardware accelerators is diverse…

Machine Learning · Computer Science 2020-10-06 Zhan Shi , Chirag Sakhuja , Milad Hashemi , Kevin Swersky , Calvin Lin

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-09 Ryuichi Sai , John Mellor-Crummey , Jinfan Xu , Mauricio Araya-Polo

GPU acceleration and performance of the particle-beam-dynamics code Elegant

Elegant is an accelerator physics and particle-beam dynamics code widely used for modeling and design of a variety of high-energy particle accelerators and accelerator-based systems. In this paper we discuss a recently developed version of…

Computational Physics · Physics 2018-11-22 J. R. King , I. V. Pogorelov , K. M. Amyx , M. Borland , R. Soliday

Accelerating Lattice QCD Multigrid on GPUs Using Fine-Grained Parallelization

The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using…

High Energy Physics - Lattice · Physics 2016-12-26 M. A. Clark , Bálint Joó , Alexei Strelchenko , Michael Cheng , Arjun Gambhir , Richard Brower

A Tool for Automatically Suggesting Source-Code Optimizations for Complex GPU Kernels

Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-18 Saeed Taheri , Apan Qasem , Martin Burtscher

Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

Hardware-Efficient Template-Based Deep CNNs Accelerator Design

Acceleration of Convolutional Neural Network (CNN) on edge devices has recently achieved a remarkable performance in image classification and object detection applications. This paper proposes an efficient and scalable CNN-based SoC-FPGA…

Hardware Architecture · Computer Science 2022-07-29 Azzam Alhussain , Mingjie Lin

Mitigating the Bandwidth Wall via Data-Streaming System-Accelerator Co-Design

Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by…

Hardware Architecture · Computer Science 2026-03-20 Qunyou Liu , Marina Zapater , David Atienza