Related papers: DCAFE: Dynamic load-balanced loop Chunking & Aggre…

ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes

Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-31 Jing Chen , Madhavan Manivannan , Mustafa Abduljabbar , Miquel Pericàs

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Wenyi Wang , Maxime Gonthier , Poornima Nookala , Haochen Pan , Ian Foster , Ioan Raicu , Kyle Chard

Dynamic load-balancing on multi-FPGA systems: a case study

In this case study, we investigate the impact of workload balance on the performance of multi-FPGA codes. We start with an application in which two distinct kernels run in parallel on two SRC-6 MAP processors. We observe that one of the MAP…

Astrophysics · Physics 2007-11-14 Volodymyr V. Kindratenko , Robert J. Brunner , Adam D. Myers

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge

Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However,…

Hardware Architecture · Computer Science 2026-04-20 Jiesong Chen , Jun You , Zhidan Liu , Zhenjiang Li

Cuttlefish: Library for Achieving Energy Efficiency in Multicore Parallel Programs

A low-cap power budget is challenging for exascale computing. Dynamic Voltage and Frequency Scaling (DVFS) and Uncore Frequency Scaling (UFS) are the two widely used techniques for limiting the HPC application's energy footprint. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-05 Sunil Kumar , Akshat Gupta , Vivek Kumar , Sridutt Bhalachandra

Evaluating Massively Parallel Algorithms for DFA Minimisation, Equivalence Checking and Inclusion Checking

We study parallel algorithms for the minimisation and equivalence checking of Deterministic Finite Automata (DFAs). Regarding DFA minimisation, we implement four different massively parallel algorithms on Graphics Processing Units~(GPUs).…

Formal Languages and Automata Theory · Computer Science 2025-08-29 Jan Heemstra , Jan Martens , Anton Wijs

A Speculative Parallel DFA Membership Test for Multicore, SIMD and Cloud Computing Environments

We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-07-23 Yousun Ko , Minyoung Jung , Yo-Sub Han , Bernd Burgstaller

Data-aware Dynamic Execution of Irregular Workloads on Heterogeneous Systems

Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-12 Zhenyu Bai , Dan Wu , Pranav Dangi , Dhananjaya Wijerathne , Venkata Pavan Kumar Miriyala , Tulika Mitra

MPI+X: task-based parallelization and dynamic load balance of finite element assembly

The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X…

Mathematical Software · Computer Science 2019-05-28 Marta Garcia-Gasulla , Guillaume Houzeaux , Roger Ferrer , Antoni Artigues , Victor López , Jesús Labarta , Mariano Vázquez

Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver

We study the performance behaviour of a seismic simulation using the ExaHyPE engine with a specific focus on memory characteristics and energy needs. ExaHyPE combines dynamically adaptive mesh refinement (AMR) with ADER-DG. It is…

Mathematical Software · Computer Science 2019-06-18 Dominic E. Charrier , Benjamin Hazelwood , Ekaterina Tutlyaeva , Michael Bader , Michael Dumbser , Andrey Kudryavtsev , Alexander Moskovsky , Tobias Weinzierl

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks

Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-07 Ali Mohammed , Aurelien Cavelan , Florina M. Ciorba

On the Benefits of Anticipating Load Imbalance for Performance Optimization of Parallel Applications

In parallel iterative applications, computational efficiency is essential for addressing large problems. Load imbalance is one of the major performance degradation factors of parallel applications. Therefore, distributing, cleverly, and as…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-18 Anthony Boulmier , Franck Raynaud , Nabil Abdennadher , Bastien Chopard

Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers

Several classic problems in graph processing and computational geometry are solved via incremental algorithms, which split computation into a series of small tasks acting on shared state, which gets updated progressively. While the…

Data Structures and Algorithms · Computer Science 2020-03-24 Dan Alistarh , Nikita Koval , Giorgi Nadiradze

Profiling-Assisted Decoupled Access-Execute

As energy efficiency became a critical factor in the embedded systems domain, dynamic voltage and frequency scaling (DVFS) techniques have emerged as means to control the system's power and energy efficiency. Additionally, due to the…

Hardware Architecture · Computer Science 2016-01-11 Jonatan Waern , Per Ekemark , Konstantinos Koukos , Stefanos Kaxiras , Alexandra Jimborean

Ecomap: Sustainability-Driven Optimization of Multi-Tenant DNN Execution on Edge Servers

Edge computing systems struggle to efficiently manage multiple concurrent deep neural network (DNN) workloads while meeting strict latency requirements, minimizing power consumption, and maintaining environmental sustainability. This paper…

Machine Learning · Computer Science 2025-03-07 Varatheepan Paramanayakam , Andreas Karatzas , Dimitrios Stamoulis , Iraklis Anagnostopoulos

Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-09-22 Raphaël Bleuse , Thierry Gautier , João V. F. Lima , Grégory Mounié , Denis Trystram

Evaluating Language Models for Efficient Code Generation

We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code…

Software Engineering · Computer Science 2024-08-14 Jiawei Liu , Songrun Xie , Junhao Wang , Yuxiang Wei , Yifeng Ding , Lingming Zhang

Matrix-free algorithms for fast ab initio calculations on distributed CPU architectures using finite-element discretization

Finite-element (FE) discretisations have emerged as a powerful real-space alternative to large-scale Kohn-Sham density functional theory (DFT) calculations, offering systematic convergence, excellent parallel scalability, while…

Computational Physics · Physics 2025-12-11 Gourab Panigrahi , Phani Motamarri

DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management

The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and…

Hardware Architecture · Computer Science 2025-12-09 Zhongchun Zhou , Chengtao Lai , Yuhang Gu , Wei Zhang