Related papers: Accelerating Irregular Applications via Efficient …

Strategies for Efficient Executions of Irregular Message-Driven Parallel Applications on GPU Systems

Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and have been demonstrated for irregular applications. Supporting efficient execution of such message-driven irregular…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-14 Vasudevan Rengasamy , Sathish Vadhiyar

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units,…

Hardware Architecture · Computer Science 2021-02-16 Christina Giannoula , Nandita Vijaykumar , Nikela Papadopoulou , Vasileios Karakostas , Ivan Fernandez , Juan Gómez-Luna , Lois Orosa , Nectarios Koziris , Georgios Goumas , Onur Mutlu

Processor Allocation for Optimistic Parallelization of Irregular Programs

Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and…

Programming Languages · Computer Science 2012-06-28 Francesco Versaci , Keshav Pingali

Reducing overheads of dynamic scheduling on heterogeneous chips

In recent processor development, we have witnessed the integration of GPU and CPUs into a single chip. The result of this integration is a reduction of the data communication overheads. This enables an efficient collaboration of both…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-07 Francisco Corbera , Andrés Rodríguez , Rafael Asenjo , Angeles Navarro , Antonio Vilches , María J. Garzarán

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

NearPM: A Near-Data Processing System for Storage-Class Applications

Persistent Memory (PM) technologies enable program recovery to a consistent state in a case of failure. To ensure this crash-consistent behavior, programs need to enforce persist ordering by employing mechanisms, such as logging and…

Computational Engineering, Finance, and Science · Computer Science 2023-04-03 Yasas Seneviratne , Korakit Seemakhupt , Sihang Liu , Samira Khan

UpDown: Programmable fine-grained Events for Scalable Performance on Irregular Applications

Applications with irregular data structures, data-dependent control flows and fine-grained data transfers (e.g., real-world graph computations) perform poorly on cache-based systems. We propose the UpDown accelerator that supports…

Hardware Architecture · Computer Science 2024-07-31 Andronicus Rajasukumar , Jiya Su , Yuqing , Wang , Tianshuo Su , Marziyeh Nourian , Jose M Monsalve Diaz , Tianchi Zhang , Jianru Ding , Wenyi Wang , Ziyi Zhang , Moubarak Jeje , Henry Hoffmann , Yanjing Li , Andrew A. Chien

Exploring the Limits of GPUs With Parallel Graph Algorithms

In this paper, we explore the limits of graphics processors (GPUs) for general purpose parallel computing by studying problems that require highly irregular data access patterns: parallel graph algorithms for list ranking and connected…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-02-25 Frank Dehne , Kumanan Yogaratnam

Scheduling Task-parallel Applications in Dynamically Asymmetric Environments

Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Jing Chen , Pirah Noor Soomro , Mustafa Abduljabbar , Madhavan Manivannan , Miquel Pericas

Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones

The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due…

Computation · Statistics 2014-09-23 Norman Matloff

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures

One area of Computing applications which poses significant challenge of performance scalability on Chip Multiprocessors(CMP's) are Irregular applications. Such applications have very little computation and unpredictable memory access…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-09 Varun Nagpal

Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms

Traditional heterogeneous parallel algorithms, designed for heterogeneous clusters of workstations, are based on the assumption that the absolute speed of the processors does not depend on the size of the computational task. This assumption…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-09-15 Alexey Lastovetsky , Ravi Reddy , Vladimir Rychkov , David Clarke

GRAPHOPT: constrained-optimization-based parallelization of irregular graphs

Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-17 Nimish Shah , Wannes Meert , Marian Verhelst

LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures

Processing-in-memory (PIM) architectures have seen an increase in popularity recently, as the high internal bandwidth available within 3D-stacked memory provides greater incentive to move some computation into the logic layer of the memory.…

Hardware Architecture · Computer Science 2017-06-13 Amirali Boroumand , Saugata Ghose , Minesh Patel , Hasan Hassan , Brandon Lucia , Nastaran Hajinazar , Kevin Hsieh , Krishna T. Malladi , Hongzhong Zheng , Onur Mutlu

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Ayesha Afzal , Georg Hager , Stefano Markidis , Gerhard Wellein

Membrane: Accelerating Database Analytics with Bank-Level DRAM-PIM Filtering

In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to…

Hardware Architecture · Computer Science 2025-04-10 Akhil Shekar , Kevin Gaffney , Martin Prammer , Khyati Kiyawat , Lingxi Wu , Helena Caminal , Zhenxing Fan , Yimin Gao , Ashish Venkat , José F. Martínez , Jignesh Patel , Kevin Skadron

Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads

GPGPU architectures have become established as the dominant parallelization and performance platform achieving exceptional popularization and empowering domains such as regular algebra, machine learning, image detection and self-driving…

Hardware Architecture · Computer Science 2022-03-17 Albert Segura , Jose-Maria Arnau , Antonio Gonzalez

ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses

Irregular memory accesses pose challenges for effective and efficient data prefetching. While temporal prefetchers have recently shown promise for irregular memory access patterns, their effectiveness fundamentally depends on temporal…

Hardware Architecture · Computer Science 2026-05-18 Mengming Li , Chenlu Miao , Buqing Xu , Qijun Zhang , Xiangfeng Sun , Ceyu Xu , Yuan Xie , Wenkai Li , Shang Liu , Zhiyao Xie

ASAP: Asynchronous Approximate Data-Parallel Computation

Emerging workloads, such as graph processing and machine learning are approximate because of the scale of data involved and the stochastic nature of the underlying algorithms. These algorithms are often distributed over multiple machines…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-28 Asim Kadav , Erik Kruus