Related papers: Accelerator Codesign as Non-Linear Optimization
Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent…
Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for…
In view of the performance limitations of fully-decoupled designs for neural architectures and accelerators, hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network…
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of…
Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share…
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based…
High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators. To improve the overall solution quality as well as to boost the design productivity, efficient…
Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving…
The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units…
Recent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization…
This paper proposes a fast system technology co-optimization (STCO) framework that optimizes power, performance, and area (PPA) for next-generation IC design, addressing the challenges and opportunities presented by novel materials and…
Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations,…
The use of deep learning has grown at an exponential rate, giving rise to numerous specialized hardware and software systems for deep learning. Because the design space of deep learning software stacks and hardware accelerators is diverse…
Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one…
Elegant is an accelerator physics and particle-beam dynamics code widely used for modeling and design of a variety of high-energy particle accelerators and accelerator-based systems. In this paper we discuss a recently developed version of…
The past decade has witnessed a dramatic acceleration of lattice quantum chromodynamics calculations in nuclear and particle physics. This has been due to both significant progress in accelerating the iterative linear solvers using…
Future computing systems, from handhelds to supercomputers, will undoubtedly be more parallel and heterogeneous than todays systems to provide more performance and energy efficiency. Thus, GPUs are increasingly being used to accelerate…
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…
Acceleration of Convolutional Neural Network (CNN) on edge devices has recently achieved a remarkable performance in image classification and object detection applications. This paper proposes an efficient and scalable CNN-based SoC-FPGA…
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by…