Related papers: DeepProf: Performance Analysis for Deep Learning A…

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Hierarchical Roofline Performance Analysis for Deep Learning Applications

This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-26 Charlene Yang , Yunsong Wang , Steven Farrell , Thorsten Kurth , Samuel Williams

Using Graph Neural Networks to model the performance of Deep Neural Networks

With the unprecedented proliferation of machine learning software, there is an ever-increasing need to generate efficient code for such applications. State-of-the-art deep-learning compilers like TVM and Halide incorporate a learning-based…

Machine Learning · Computer Science 2021-08-31 Shikhar Singh , Benoit Steiner , James Hegarty , Hugh Leather

DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads

Effective performance profiling and analysis are essential for optimizing training and inference of deep learning models, especially given the growing complexity of heterogeneous computing environments. However, existing tools often lack…

Performance · Computer Science 2024-11-06 Qidong Zhao , Hao Wu , Yuming Hao , Zilingfeng Ye , Jiajia Li , Xu Liu , Keren Zhou

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Benchmarking State-of-the-Art Deep Learning Software Tools

Deep learning has been shown as a successful machine learning method for a variety of tasks, and its popularity results in numerous open-source deep learning software tools. Training a deep network is usually a very time-consuming process.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-20 Shaohuai Shi , Qiang Wang , Pengfei Xu , Xiaowen Chu

Comparative Analysis of CPU and GPU Profiling for Deep Learning Models

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

General Purpose Graphics Processing Unit (GPGPU) computing plays a transformative role in deep learning and machine learning by leveraging the computational advantages of parallel processing. Through the power of Compute Unified Device…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-20 Ming Li , Ziqian Bi , Tianyang Wang , Yizhu Wen , Qian Niu , Xinyuan Song , Zekun Jiang , Junyu Liu , Benji Peng , Sen Zhang , Xuanhe Pan , Jiawei Xu , Jinlang Wang , Keyu Chen , Caitlyn Heqi Yin , Pohsun Feng , Ming Liu

Time-Based Roofline for Deep Learning Performance Analysis

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Yunsong Wang , Charlene Yang , Steven Farrell , Yan Zhang , Thorsten Kurth , Samuel Williams

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

Understanding Training Efficiency of Deep Learning Recommendation Models at Scale

The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of…

Hardware Architecture · Computer Science 2020-11-12 Bilge Acun , Matthew Murphy , Xiaodong Wang , Jade Nie , Carole-Jean Wu , Kim Hazelwood

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 Steven W. D. Chien , Artur Podobas , Ivy B. Peng , Stefano Markidis

Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach

While it is well-known and acknowledged that the performance of graph algorithms is heavily dependent on the input data, there has been surprisingly little research to quantify and predict the impact the graph structure has on performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-04 Merijn Verstraaten , Ana Lucia Varbanescu , Cees de Laat

Benchmarking GPU and TPU Performance with Graph Neural Networks

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly…

Machine Learning · Computer Science 2022-10-25 xiangyang Ju , Yunsong Wang , Daniel Murnane , Nicholas Choma , Steven Farrell , Paolo Calafiura

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-22 Ankur Lahiry , Ayush Pokharel , Banooqa Banday , Seth Ockerman , Amal Gueroudji , Mohammad Zaeed , Tanzima Z. Islam , Line Pouchard

XES Tensorflow - Process Prediction using the Tensorflow Deep-Learning Framework

Predicting the next activity of a running process is an important aspect of process management. Recently, artificial neural networks, so called deep-learning approaches, have been proposed to address this challenge. This demo paper…

Machine Learning · Computer Science 2017-05-04 Joerg Evermann , Jana-Rebecca Rehse , Peter Fettke

Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing

The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational…

Performance · Computer Science 2026-04-03 Lisan Al Amin , Md Ismail Hossain , Rupak Kumar Das , Mahbubul Islam , Abdulaziz Tabbakh

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present the design principles and implementation of Deep Graph Library (DGL). DGL distills the…

Machine Learning · Computer Science 2020-08-26 Minjie Wang , Da Zheng , Zihao Ye , Quan Gan , Mufei Li , Xiang Song , Jinjing Zhou , Chao Ma , Lingfan Yu , Yu Gai , Tianjun Xiao , Tong He , George Karypis , Jinyang Li , Zheng Zhang

On the performance of deep learning models for time series classification in streaming

Processing data streams arriving at high speed requires the development of models that can provide fast and accurate predictions. Although deep neural networks are the state-of-the-art for many machine learning tasks, their performance in…

Machine Learning · Computer Science 2020-04-07 Pedro Lara-Benítez , Manuel Carranza-García , Francisco Martínez-Álvarez , José C. Riquelme

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the…

Machine Learning · Computer Science 2023-12-07 Phitchaya Mangpo Phothilimthana , Sami Abu-El-Haija , Kaidi Cao , Bahare Fatemi , Mike Burrows , Charith Mendis , Bryan Perozzi