Related papers: Characterizing Optimizations to Memory Access Patt…

AIWC: OpenCL-based Architecture-Independent Workload Characterisation

Measuring performance-critical characteristics of application workloads is important both for developers, who must understand and optimize the performance of codes, as well as designers and integrators of HPC systems, who must ensure that…

Software Engineering · Computer Science 2018-11-01 Beau Johnston , Josh Milthorpe

OpenCL Performance Prediction using Architecture-Independent Features

OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-04 Beau Johnston , Greg Falzon , Josh Milthorpe

Memory and Parallelism Analysis Using a Platform-Independent Approach

Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-19 Stefano Corda , Gagandeep Singh , Ahsan Javed Awan , Roel Jordans , Henk Corporaal

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-02 E. Calore , A. Gabbana , J. Kraus , S. F. Schifano , R. Tripiccione

Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures

For reasons of both performance and energy efficiency, high-performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of computing devices and is…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-01 Beau Johnston , Josh Milthorpe

Parallel Approximation Algorithms for Facility-Location Problems

This paper presents the design and analysis of parallel approximation algorithms for facility-location problems, including $\NC$ and $\RNC$ algorithms for (metric) facility location, $k$-center, $k$-median, and $k$-means. These problems…

Data Structures and Algorithms · Computer Science 2010-06-11 Guy E. Blelloch , Kanat Tangwongsan

Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

Software-hardware co-design is essential for optimizing in-memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, leading to highly specialized hardware…

Hardware Architecture · Computer Science 2026-03-05 Olga Krestinskaya , Mohammed E. Fouda , Ahmed Eltawil , Khaled N. Salama

Platform Independent Software Analysis for Near Memory Computing

Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized…

Performance · Computer Science 2019-06-26 Stefano Corda , Gagandeep Singh , Ahsan Javed Awan , Roel Jordans , Henk Corporaal

Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling

In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication…

Signal Processing · Electrical Eng. & Systems 2024-05-27 Jiacong Sun , Pouya Houshmand , Marian Verhelst

Evaluating SYCL as a Unified Programming Model for Heterogeneous Systems

High-performance computing (HPC) applications are increasingly executed in heterogeneous environments, introducing new challenges for programming and software portability. SYCL has emerged as a leading model designed to simplify…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-20 Ami Marowka

A Unified Spatially Coupled Code Design: Threshold, Cycles, and Locality

Spatially-Coupled (SC)-LDPC codes are known to have outstanding error-correction performance and low decoding latency. Whereas previous works on LDPC and SC-LDPC codes mostly take either an asymptotic or a finite-length design approach, in…

Information Theory · Computer Science 2022-09-02 Homa Esfahanizadeh , Eshed Ram , Yuval Cassuto , Lara Dolecek

pocl: A Performance-Portable OpenCL Implementation

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-23 Pekka Jääskeläinen , Carlos Sánchez de La Lama , Erik Schnetter , Kalle Raiskila , Jarmo Takala , Heikki Berg

Using the IBM Analog In-Memory Hardware Acceleration Kit for Neural Network Training and Inference

Analog In-Memory Computing (AIMC) is a promising approach to reduce the latency and energy consumption of Deep Neural Network (DNN) inference and training. However, the noisy and non-linear device characteristics, and the non-ideal…

Emerging Technologies · Computer Science 2024-01-29 Manuel Le Gallo , Corey Lammie , Julian Buechel , Fabio Carta , Omobayode Fagbohungbe , Charles Mackin , Hsinyu Tsai , Vijay Narayanan , Abu Sebastian , Kaoutar El Maghraoui , Malte J. Rasch

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

The rapid growth of large-language models (LLMs) is driving a new wave of specialized hardware for inference. This paper presents the first workload-centric, cross-architectural performance study of commercial AI accelerators, spanning…

Hardware Architecture · Computer Science 2025-06-10 Amit Sharma

EnviroLLM: Resource Tracking and Optimization for Local AI

Large language models (LLMs) are increasingly deployed locally for privacy and accessibility, yet users lack tools to measure their resource usage, environmental impact, and efficiency metrics. This paper presents EnviroLLM, an open-source…

Machine Learning · Computer Science 2025-12-16 Troy Allen

AIPerf: Automated machine learning as an AI-HPC benchmark

The plethora of complex artificial intelligence (AI) algorithms and available high performance computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-16 Zhixiang Ren , Yongheng Liu , Tianhui Shi , Lei Xie , Yue Zhou , Jidong Zhai , Youhui Zhang , Yunquan Zhang , Wenguang Chen

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-19 Suejb Memeti , Lu Li , Sabri Pllana , Joanna Kolodziej , Christoph Kessler

Analog Foundation Models

Analog in-memory computing (AIMC) is a promising compute paradigm to improve speed and power efficiency of neural network inference beyond the limits of conventional von Neumann-based architectures. However, AIMC introduces fundamental…

Machine Learning · Computer Science 2025-10-28 Julian Büchel , Iason Chalas , Giovanni Acampa , An Chen , Omobayode Fagbohungbe , Sidney Tsai , Kaoutar El Maghraoui , Manuel Le Gallo , Abbas Rahimi , Abu Sebastian

Towards Efficient IMC Accelerator Design Through Joint Hardware-Workload Co-optimization

Designing generalized in-memory computing (IMC) hardware that efficiently supports a variety of workloads requires extensive design space exploration, which is infeasible to perform manually. Optimizing hardware individually for each…

Hardware Architecture · Computer Science 2025-02-04 Olga Krestinskaya , Mohammed E. Fouda , Ahmed Eltawil , Khaled N. Salama

Benchmarking and modeling of analog and digital SRAM in-memory computing architectures

In-memory-computing is emerging as an efficient hardware paradigm for deep neural network accelerators at the edge, enabling to break the memory wall and exploit massive computational parallelism. Two design models have surged: analog…

Hardware Architecture · Computer Science 2023-05-31 Pouya Houshmand , Jiacong Sun , Marian Verhelst