Related papers: Inner Loop Optimizations in Mapping Single Threade…

Loop Unrolling in Multi-pipeline ASIP Design

Application Specific Instruction-set Processor (ASIP) is one of the popular processor design techniques for embedded systems which allows customizability in processor design without overly hindering design flexibility. Multi-pipeline ASIPs…

Programming Languages · Computer Science 2014-02-05 Rajitha Navarathna , Swarnalatha Radhakrishnan , Roshan Ragel

An Improving Method for Loop Unrolling

In this paper we review main ideas mentioned in several other papers which talk about optimization techniques used by compilers. Here we focus on loop unrolling technique and its effect on power consumption, energy usage and also its impact…

Programming Languages · Computer Science 2013-08-13 Meisam Booshehri , Abbas Malekpour , Peter Luksch

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-14 Mostafa Eghbali Zarch , Reece Neff , Michela Becchi

Enhancing the performance of Decoupled Software Pipeline through Backward Slicing

The rapidly increasing number of cores available in multicore processors does not necessarily lead directly to a commensurate increase in performance: programs written in conventional languages, such as C, need careful restructuring,…

Programming Languages · Computer Science 2015-01-28 Esraa Alwan , John Fitch , Julian Padget

Fast and simple inner-loop algorithms of static / dynamic BLP estimations

This study investigates computationally efficient inner-loop algorithms for estimating static/dynamic BLP models. It provides the following ideas for reducing the number of inner-loop iterations: (1). Add a term relating to the outside…

Econometrics · Economics 2025-04-25 Takeshi Fukasawa

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-12-30 Nanda K. Unnikrishnan , Keshab K. Parhi

Energy-Efficiency Evaluation of OpenMP Loop Transformations and Runtime Constructs

OpenMP is the de facto API for parallel programming in HPC applications. These programs are often computed in data centers, where energy consumption is a major issue. Whereas previous work has focused almost entirely on performance, we here…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-12 Henrik Valter , Axel Karlsson , Miquel Pericàs

OpenMP Loop Scheduling Revisited: Making a Case for More Schedules

In light of continued advances in loop scheduling, this work revisits the OpenMP loop scheduling by outlining the current state of the art in loop scheduling and presenting evidence that the existing OpenMP schedules are insufficient for…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-11 Florina M. Ciorba , Christian Iwainsky , Patrick Buder

Software Pipelining for Quantum Loop Programs

We propose a method for performing software pipelining on quantum for-loop programs, exploiting parallelism in and across iterations. We redefine concepts that are useful in program optimization, including array aliasing, instruction…

Quantum Physics · Physics 2020-12-25 Jingzhe Guo , Mingsheng Ying

Mapping Matters: Application Process Mapping on 3-D Processor Topologies

Applications' performance is influenced by the mapping of processes to computing nodes, the frequency and volume of exchanges among processing elements, the network capacity, and the routing protocol. A poor mapping of application processes…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-11 Jonas H. Müller Korndörfer , Mario Bielert , Laércio L. Pilla , Florina M. Ciorba

An evaluation of a microprocessor with two independent hardware execution threads coupled through a shared cache

We investigate the utility of augmenting a microprocessor with a single execution pipeline by adding a second copy of the execution pipeline in parallel with the existing one. The resulting dual-hardware-threaded microprocessor has two…

Hardware Architecture · Computer Science 2023-05-30 Madhav P. Desai

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-27 Guoping Long , Jun Yang , Wei Lin

Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

According to the increasing complexity of network application and internet traffic, network processor as a subset of embedded processors have to process more computation intensive tasks. By scaling down the feature size and emersion of chip…

Hardware Architecture · Computer Science 2012-04-13 Mehdi Alipour , Hojjat Taghdisi

Adaptive Performance Optimization under Power Constraint in Multi-thread Applications with Diverse Scalability

In modern data centers, energy usage represents one of the major factors affecting operational costs. Power capping is a technique that limits the power consumption of individual systems, which allows reducing the overall power demand at…

Performance · Computer Science 2017-09-05 Stefano Conoci , Pierangelo Di Sanzo , Bruno Ciciani , Francesco Quaglia

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

Token generation speed is critical to power the next wave of AI inference applications. GPUs significantly underperform during token generation due to synchronization overheads at kernel boundaries, utilizing only 21% of their peak memory…

Computation and Language · Computer Science 2024-11-01 David Koeplinger , Darshan Gandhi , Pushkar Nandkar , Nathan Sheeley , Matheen Musaddiq , Leon Zhang , Reid Goodbar , Matthew Shaffer , Han Wang , Angela Wang , Mingran Wang , Raghu Prabhakar

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

We introduce a mapping framework for deep learning inference that takes advantage of predictable neural network behavior to plan both computation and communication ahead of time. The framework generates a unified stream of instructions and…

Hardware Architecture · Computer Science 2025-09-05 Md Rownak Hossain Chowdhury , Mostafizur Rahman

SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors

Simultaneous multithreading processors improve throughput over single-threaded processors thanks to sharing internal core resources among instructions from distinct threads. However, resource sharing introduces inter-thread interference…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-20 Marta Navarro , Josué Feliu , Salvador Petit , María E. Gómez , Julio Sahuquillo

Single-Loop Deterministic and Stochastic Interior-Point Algorithms for Nonlinearly Constrained Optimization

An interior-point algorithm framework is proposed, analyzed, and tested for solving nonlinearly constrained continuous optimization problems. The main setting of interest is when the objective and constraint functions may be nonlinear…

Optimization and Control · Mathematics 2024-08-30 Frank E. Curtis , Xin Jiang , Qi Wang

Using Deep Neural Networks for Estimating Loop Unrolling Factor

Optimizing programs requires deep expertise. On one hand, it is a tedious task, because it requires a lot of tests to find out the best combination of optimizations to apply with their best factors. On the other hand, this task is critical,…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-12 Asma Balamane , Zina Taklit

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-11 Carl-Johannes Johnsen , Tiziano De Matteis , Tal Ben-Nun , Johannes de Fine Licht , Torsten Hoefler