Related papers: FlipFlop: A Static Analysis-based Energy Optimizat…

Prediction of Performance and Power Consumption of GPGPU Applications

Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-04 Gargi Alavani , Santonu Sarkar

GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Ruifan Chu , Anbang Wang , Xiuxiu Bai , Shuai Liu , Xiaoshe Dong

AI Application Benchmarking: Power-Aware Performance Analysis for Vision and Language Models

Artificial Intelligence (AI) workloads drive a rapid expansion of high-performance computing (HPC) infrastructures and increase their power and energy demands towards a critical level. AI benchmarks representing state-of-the art workloads…

Performance · Computer Science 2026-03-18 Martin Mayr , Sebastian Wind , Lukas Schröder , Georg Hager , Harald Köstler , Gerhard Wellein

Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-05 Gregory Bolet , Giorgis Georgakoudis , Konstantinos Parasyris , Harshitha Menon , Niranjan Hasabnis , Kirk W. Cameron , Gal Oren

ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-09 Guyue Huang , Yang Bai , Liu Liu , Yuke Wang , Bei Yu , Yufei Ding , Yuan Xie

EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck not in the modeling techniques…

Hardware Architecture · Computer Science 2026-04-23 Kyungmi Lee , Zhiye Song , Eun Kyung Lee , Xin Zhang , Tamar Eilam , Anantha P. Chandrakasan

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale

Graph Neural Networks (GNNs) have shown great superiority on non-Euclidean graph data, achieving ground-breaking performance on various graph-related tasks. As a practical solution to train GNN on large graphs with billions of nodes and…

Machine Learning · Computer Science 2024-09-24 Zeyu Zhu , Peisong Wang , Qinghao Hu , Gang Li , Xiaoyao Liang , Jian Cheng

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-06 Farui Wang , Weizhe Zhang , Shichao Lai , Meng Hao , Zheng Wang

A Data-Driven Frequency Scaling Approach for Deadline-aware Energy Efficient Scheduling on Graphics Processing Units (GPUs)

Modern computing paradigms, such as cloud computing, are increasingly adopting GPUs to boost their computing capabilities primarily due to the heterogeneous nature of AI/ML/deep learning workloads. However, the energy consumption of GPUs is…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-29 Shashikant Ilager , Rajeev Muralidhar , Kotagiri Rammohanrao , Rajkumar Buyya

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning

Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-15 Richard Schoonhoven , Bram Veenboer , Ben van Werkhoven , Kees Joost Batenburg

On the Impact of Device-Level Techniques on Energy-Efficiency of Neural Network Accelerators

Energy-efficiency is a key concern for neural network applications. To alleviate this issue, hardware acceleration using FPGAs or GPUs can provide better energy-efficiency than general-purpose processors. However, further improvement of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-29 Seyed Morteza Nabavinejad , Behzad Salami

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Charlene Yang

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite…

Neural and Evolutionary Computing · Computer Science 2016-11-22 Matthew W. Moskewicz , Ali Jannesari , Kurt Keutzer

GPA: A GPU Performance Advisor Based on Instruction Sampling

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we…

Performance · Computer Science 2020-11-25 Keren Zhou , Xiaozhu Meng , Ryuichi Sai , John Mellor-Crummey

FinGraV: Methodology for Fine-Grain GPU Power Visibility and Insights

Ubiquity of AI makes optimizing GPU power a priority as large GPU-based clusters are often employed to train and serve AI models. An important first step in optimizing GPU power consumption is high-fidelity and fine-grain power measurement…

Hardware Architecture · Computer Science 2025-04-01 Varsha Singhania , Shaizeen Aga , Mohamed Assem Ibrahim

Power Constrained Autotuning using Graph Neural Networks

Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-23 Akash Dutta , Jee Choi , Ali Jannesari

Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding

Energy efficiency is a first-order concern in AI deployment, as long-running inference can exceed training in cumulative carbon impact. We propose a bio-inspired framework that maps protein-folding energy basins to inference cost landscapes…

Machine Learning · Computer Science 2026-01-09 Mustapha Hamdi , Mourad Jabou

ECLIP: Energy-efficient and Practical Co-Location of ML Inference on Spatially Partitioned GPUs

As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve…

Systems and Control · Electrical Eng. & Systems 2025-06-17 Ryan Quach , Yidi Wang , Ali Jahanshahi , Daniel Wong , Hyoseung Kim

Autotuning GPU Kernels via Static and Predictive Analysis

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-30 Robert V. Lim , Boyana Norris , Allen D. Malony

Power Consumption Analysis of Parallel Algorithms on GPUs

Due to their highly parallel multi-cores architecture, GPUs are being increasingly used in a wide range of computationally intensive applications. Compared to CPUs, GPUs can achieve higher performances at accelerating the programs'…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-05 Frédéric Magoulès , Abal-Kassim Cheik Ahamed , Alban Desmaison , Jean-Christophe Léchenet , François Mayer , Haifa Ben Salem , Thomas Zhu