English
Related papers

Related papers: DeepProf: Performance Analysis for Deep Learning A…

200 papers

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-26 Charlene Yang , Yunsong Wang , Steven Farrell , Thorsten Kurth , Samuel Williams

With the unprecedented proliferation of machine learning software, there is an ever-increasing need to generate efficient code for such applications. State-of-the-art deep-learning compilers like TVM and Halide incorporate a learning-based…

Machine Learning · Computer Science 2021-08-31 Shikhar Singh , Benoit Steiner , James Hegarty , Hugh Leather

Effective performance profiling and analysis are essential for optimizing training and inference of deep learning models, especially given the growing complexity of heterogeneous computing environments. However, existing tools often lack…

Performance · Computer Science 2024-11-06 Qidong Zhao , Hao Wu , Yuming Hao , Zilingfeng Ye , Jiajia Li , Xu Liu , Keren Zhou

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Deep learning has been shown as a successful machine learning method for a variety of tasks, and its popularity results in numerous open-source deep learning software tools. Training a deep network is usually a very time-consuming process.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-20 Shaohuai Shi , Qiang Wang , Pengfei Xu , Xiaowen Chu

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

General Purpose Graphics Processing Unit (GPGPU) computing plays a transformative role in deep learning and machine learning by leveraging the computational advantages of parallel processing. Through the power of Compute Unified Device…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-20 Ming Li , Ziqian Bi , Tianyang Wang , Yizhu Wen , Qian Niu , Xinyuan Song , Zekun Jiang , Junyu Liu , Benji Peng , Sen Zhang , Xuanhe Pan , Jiawei Xu , Jinlang Wang , Keyu Chen , Caitlyn Heqi Yin , Pohsun Feng , Ming Liu

Deep learning applications are usually very compute-intensive and require a long run time for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-24 Yunsong Wang , Charlene Yang , Steven Farrell , Yan Zhang , Thorsten Kurth , Samuel Williams

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of…

Hardware Architecture · Computer Science 2020-11-12 Bilge Acun , Matthew Murphy , Xiaodong Wang , Jade Nie , Carole-Jean Wu , Kim Hazelwood

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 Steven W. D. Chien , Artur Podobas , Ivy B. Peng , Stefano Markidis

While it is well-known and acknowledged that the performance of graph algorithms is heavily dependent on the input data, there has been surprisingly little research to quantify and predict the impact the graph structure has on performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-04 Merijn Verstraaten , Ana Lucia Varbanescu , Cees de Laat

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly…

Machine Learning · Computer Science 2022-10-25 xiangyang Ju , Yunsong Wang , Daniel Murnane , Nicholas Choma , Steven Farrell , Paolo Calafiura

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-22 Ankur Lahiry , Ayush Pokharel , Banooqa Banday , Seth Ockerman , Amal Gueroudji , Mohammad Zaeed , Tanzima Z. Islam , Line Pouchard

Predicting the next activity of a running process is an important aspect of process management. Recently, artificial neural networks, so called deep-learning approaches, have been proposed to address this challenge. This demo paper…

Machine Learning · Computer Science 2017-05-04 Joerg Evermann , Jana-Rebecca Rehse , Peter Fettke

The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational…

Performance · Computer Science 2026-04-03 Lisan Al Amin , Md Ismail Hossain , Rupak Kumar Das , Mahbubul Islam , Abdulaziz Tabbakh

Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present the design principles and implementation of Deep Graph Library (DGL). DGL distills the…

Processing data streams arriving at high speed requires the development of models that can provide fast and accurate predictions. Although deep neural networks are the state-of-the-art for many machine learning tasks, their performance in…

Machine Learning · Computer Science 2020-04-07 Pedro Lara-Benítez , Manuel Carranza-García , Francisco Martínez-Álvarez , José C. Riquelme

Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the…

‹ Prev 1 2 3 10 Next ›