Related papers: Optimizing Fine-Grained Parallelism Through Dynami…
Nowadays, latency-critical, high-performance applications are parallelized even on power-constrained client systems to improve performance. However, an important scenario of fine-grained tasking on simultaneous multithreading CPU cores in…
Task parallelism as employed by the OpenMP task construct, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which…
Task parallelism as employed by the OpenMP task construct or some Intel Threading Building Blocks (TBB) components, although ideal for tackling irregular problems or typical producer/consumer schemes, bears some potential for performance…
The aim of parallel computing is to increase an application performance by executing the application on multiple processors. OpenMP is an API that supports multi platform shared memory programming model and shared-memory programs are…
Parallel processing is considered as todays and future trend for improving performance of computers. Computing devices ranging from small embedded systems to big clusters of computers rely on parallelizing applications to reduce execution…
Processors with large numbers of cores are becoming commonplace. In order to take advantage of the available resources in these systems, the programming paradigm has to move towards increased parallelism. However, increasing the level of…
Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that…
OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing…
The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also…
Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption…
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that…
Work-stealing is a widely used technique for balancing irregular parallel workloads, and most modern runtime systems adopt lock-free work-stealing deques to reduce contention and improve scalability. However, existing algorithms are…
OpenMP is the de-facto standard for shared memory systems in High-Performance Computing (HPC). It includes a task-based model that offers a high-level of abstraction to effectively exploit highly dynamic structured and unstructured…
The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of…
There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the data in parallel and the cost of synchronization. It has been observed that for a large class of…
The Simplex tableau has been broadly used and investigated in the industry and academia. With the advent of the big data era, ever larger problems are posed to be solved in ever larger machines whose architecture type did not exist in the…
Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per…
Furthering our understanding of many of today's interesting problems in plasma physics---including plasma based acceleration and magnetic reconnection with pair production due to quantum electrodynamic effects---requires large-scale kinetic…
Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in programming models such as OpenMP. While many high-performance parallel libraries are based on task graphs, they…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…