Related papers: Analyzing Machine Learning Workloads Using a Detai…

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Modeling Deep Learning Accelerator Enabled GPUs

The efficacy of deep learning has resulted in its use in a growing number of applications. The Volta graphics processor unit (GPU) architecture from NVIDIA introduced a specialized functional unit, the "tensor core", that helps meet the…

Mathematical Software · Computer Science 2019-02-22 Md Aamir Raihan , Negar Goli , Tor Aamodt

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Deep learning researchers and practitioners usually leverage GPUs to help train their deep neural networks (DNNs) faster. However, choosing which GPU to use is challenging both because (i) there are many options, and (ii) users grapple with…

Machine Learning · Computer Science 2021-06-09 Geoffrey X. Yu , Yubo Gao , Pavel Golikov , Gennady Pekhimenko

Comparative Analysis of CPU and GPU Profiling for Deep Learning Models

Deep Learning(DL) and Machine Learning(ML) applications are rapidly increasing in recent days. Massive amounts of data are being generated over the internet which can derive meaningful results by the use of ML and DL algorithms. Hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-12 Dipesh Gyawali

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Behnam Pourghassemi , Chenghao Zhang , Joo Hwan Lee , Aparna Chandramowlishwaran

{\mu}-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably,…

Machine Learning · Computer Science 2018-04-16 Yosuke Oyama , Tal Ben-Nun , Torsten Hoefler , Satoshi Matsuoka

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-09 Kun Wu

Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g.,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-12 Yuxin Wang , Qiang Wang , Shaohuai Shi , Xin He , Zhenheng Tang , Kaiyong Zhao , Xiaowen Chu

Benchmarking State-of-the-Art Deep Learning Software Tools

Deep learning has been shown as a successful machine learning method for a variety of tasks, and its popularity results in numerous open-source deep learning software tools. Training a deep network is usually a very time-consuming process.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-20 Shaohuai Shi , Qiang Wang , Pengfei Xu , Xiaowen Chu

High performance and energy efficient inference for deep learning on ARM processors

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-20 Adrián Castelló , Sergio Barrachina , Manuel F. Dolz , Enrique S. Quintana-Ortí , Pau San Juan

PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is…

Machine Learning · Computer Science 2021-01-21 Seung Won Min , Kun Wu , Sitao Huang , Mert Hidayetoğlu , Jinjun Xiong , Eiman Ebrahimi , Deming Chen , Wen-mei Hwu

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Operation-Level Performance Benchmarking of Graph Neural Networks for Scientific Applications

As Graph Neural Networks (GNNs) increase in popularity for scientific machine learning, their training and inference efficiency is becoming increasingly critical. Additionally, the deep learning field as a whole is trending towards wider…

Machine Learning · Computer Science 2022-07-21 Ryien Hosseini , Filippo Simini , Venkatram Vishwanath

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite…

Neural and Evolutionary Computing · Computer Science 2016-11-22 Matthew W. Moskewicz , Ali Jannesari , Kurt Keutzer

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized…

Machine Learning · Computer Science 2019-10-23 Yu Emma Wang , Gu-Yeon Wei , David Brooks

Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…

Machine Learning · Computer Science 2025-09-04 David Cortes , Carlos Juiz , Belen Bermejo

{\ae}net-PyTorch: a GPU-supported implementation for machine learning atomic potentials training

In this work, we present {\ae}net-PyTorch, a PyTorch-based implementation for training artificial neural network-based machine learning interatomic potentials. Developed as an extension of the atomic energy network ({\ae}net),…

Disordered Systems and Neural Networks · Physics 2023-05-10 Jon Lopez-Zorrilla , Xabier M. Aretxabaleta , Inwon Yue , Inigo Etxebarria , Hegoi Manzano , Nongnuch Artrith

DeepProf: Performance Analysis for Deep Learning Applications via Mining GPU Execution Patterns

Deep learning applications are computation-intensive and often employ GPU as the underlying computing devices. Deep learning frameworks provide powerful programming interfaces, but the gap between source codes and practical GPU operations…

Software Engineering · Computer Science 2017-07-13 Jiazhen Gu , Huan Liu , Yangfan Zhou , Xin Wang

Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing

The exponential growth in data has intensified the demand for computational power to train large-scale deep learning models. However, the rapid growth in model size and complexity raises concerns about equal and fair access to computational…

Performance · Computer Science 2026-04-03 Lisan Al Amin , Md Ismail Hossain , Rupak Kumar Das , Mahbubul Islam , Abdulaziz Tabbakh