English
Related papers

Related papers: DCAFE: Dynamic load-balanced loop Chunking & Aggre…

200 papers

Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-31 Jing Chen , Madhavan Manivannan , Mustafa Abduljabbar , Miquel Pericàs

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Wenyi Wang , Maxime Gonthier , Poornima Nookala , Haochen Pan , Ian Foster , Ioan Raicu , Kyle Chard

In this case study, we investigate the impact of workload balance on the performance of multi-FPGA codes. We start with an application in which two distinct kernels run in parallel on two SRC-6 MAP processors. We observe that one of the MAP…

Astrophysics · Physics 2007-11-14 Volodymyr V. Kindratenko , Robert J. Brunner , Adam D. Myers

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However,…

Hardware Architecture · Computer Science 2026-04-20 Jiesong Chen , Jun You , Zhidan Liu , Zhenjiang Li

A low-cap power budget is challenging for exascale computing. Dynamic Voltage and Frequency Scaling (DVFS) and Uncore Frequency Scaling (UFS) are the two widely used techniques for limiting the HPC application's energy footprint. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-05 Sunil Kumar , Akshat Gupta , Vivek Kumar , Sridutt Bhalachandra

We study parallel algorithms for the minimisation and equivalence checking of Deterministic Finite Automata (DFAs). Regarding DFA minimisation, we implement four different massively parallel algorithms on Graphics Processing Units~(GPUs).…

Formal Languages and Automata Theory · Computer Science 2025-08-29 Jan Heemstra , Jan Martens , Anton Wijs

We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-07-23 Yousun Ko , Minyoung Jung , Yo-Sub Han , Bernd Burgstaller

Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Zhenyu Bai , Dan Wu , Pranav Dangi , Dhananjaya Wijerathne , Venkata Pavan Kumar Miriyala , Tulika Mitra

The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X…

Mathematical Software · Computer Science 2019-05-28 Marta Garcia-Gasulla , Guillaume Houzeaux , Roger Ferrer , Antoni Artigues , Victor López , Jesús Labarta , Mariano Vázquez

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is…

Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-07 Ali Mohammed , Aurelien Cavelan , Florina M. Ciorba

In parallel iterative applications, computational efficiency is essential for addressing large problems. Load imbalance is one of the major performance degradation factors of parallel applications. Therefore, distributing, cleverly, and as…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-18 Anthony Boulmier , Franck Raynaud , Nabil Abdennadher , Bastien Chopard

Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the…

Data Structures and Algorithms · Computer Science 2020-03-24 Dan Alistarh , Nikita Koval , Giorgi Nadiradze

As energy efficiency became a critical factor in the embedded systems domain, dynamic voltage and frequency scaling (DVFS) techniques have emerged as means to control the system's power and energy efficiency. Additionally, due to the…

Hardware Architecture · Computer Science 2016-01-11 Jonatan Waern , Per Ekemark , Konstantinos Koukos , Stefanos Kaxiras , Alexandra Jimborean

Edge computing systems struggle to efficiently manage multiple concurrent deep neural network (DNN) workloads while meeting strict latency requirements, minimizing power consumption, and maintaining environmental sustainability. This paper…

Machine Learning · Computer Science 2025-03-07 Varatheepan Paramanayakam , Andreas Karatzas , Dimitrios Stamoulis , Iraklis Anagnostopoulos

Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-09-22 Raphaël Bleuse , Thierry Gautier , João V. F. Lima , Grégory Mounié , Denis Trystram

We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code…

Software Engineering · Computer Science 2024-08-14 Jiawei Liu , Songrun Xie , Junhao Wang , Yuxiang Wei , Yifeng Ding , Lingming Zhang

Finite-element (FE) discretisations have emerged as a powerful real-space alternative to large-scale Kohn-Sham density functional theory (DFT) calculations, offering systematic convergence, excellent parallel scalability, while…

Computational Physics · Physics 2025-12-11 Gourab Panigrahi , Phani Motamarri

The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and…

Hardware Architecture · Computer Science 2025-12-09 Zhongchun Zhou , Chengtao Lai , Yuhang Gu , Wei Zhang
‹ Prev 1 2 3 10 Next ›