Related papers: Accelerating Irregular Applications via Efficient …
Message-driven executions with over-decomposition of tasks constitute an important model for parallel programming and have been demonstrated for irregular applications. Supporting efficient execution of such message-driven irregular…
Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units,…
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and…
In recent processor development, we have witnessed the integration of GPU and CPUs into a single chip. The result of this integration is a reduction of the data communication overheads. This enables an efficient collaboration of both…
GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…
Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…
Persistent Memory (PM) technologies enable program recovery to a consistent state in a case of failure. To ensure this crash-consistent behavior, programs need to enforce persist ordering by employing mechanisms, such as logging and…
Applications with irregular data structures, data-dependent control flows and fine-grained data transfers (e.g., real-world graph computations) perform poorly on cache-based systems. We propose the UpDown accelerator that supports…
In this paper, we explore the limits of graphics processors (GPUs) for general purpose parallel computing by studying problems that require highly irregular data access patterns: parallel graph algorithms for list ranking and connected…
Shared resource interference is observed by applications as dynamic performance asymmetry. Prior art has developed approaches to reduce the impact of performance asymmetry mainly at the operating system and architectural levels. In this…
The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due…
One area of Computing applications which poses significant challenge of performance scalability on Chip Multiprocessors(CMP's) are Irregular applications. Such applications have very little computation and unpredictable memory access…
Traditional heterogeneous parallel algorithms, designed for heterogeneous clusters of workstations, are based on the assumption that the absolute speed of the processors does not depend on the size of the computational task. This assumption…
Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern…
Processing-in-memory (PIM) architectures have seen an increase in popularity recently, as the high internal bandwidth available within 3D-stacked memory provides greater incentive to move some computation into the logic layer of the memory.…
Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…
In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to…
GPGPU architectures have become established as the dominant parallelization and performance platform achieving exceptional popularization and empowering domains such as regular algebra, machine learning, image detection and self-driving…
Irregular memory accesses pose challenges for effective and efficient data prefetching. While temporal prefetchers have recently shown promise for irregular memory access patterns, their effectiveness fundamentally depends on temporal…
Emerging workloads, such as graph processing and machine learning are approximate because of the scale of data involved and the stochastic nature of the underlying algorithms. These algorithms are often distributed over multiple machines…