Related papers: A Metaprogramming and Autotuning Framework for Dep…

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-08-21 Shaohuai Shi , Qiang Wang , Xiaowen Chu

A Framework to Enable Algorithmic Design Choice Exploration in DNNs

Deep learning technologies, particularly deep neural networks (DNNs), have demonstrated significant success across many domains. This success has been accompanied by substantial advancements and innovations in the algorithms behind the…

Machine Learning · Computer Science 2025-04-14 Timothy L. Cronin , Sanmukh Kuppannagari

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing…

Neural and Evolutionary Computing · Computer Science 2020-08-11 Aditya Dhakal , Junguk Cho , Sameer G. Kulkarni , K. K. Ramakrishnan , Puneet Sharma

Latency optimized Deep Neural Networks (DNNs): An Artificial Intelligence approach at the Edge using Multiprocessor System on Chip (MPSoC)

Almost in every heavily computation-dependent application, from 6G communication systems to autonomous driving platforms, a large portion of computing should be near to the client side. Edge computing (AI at Edge) in mobile devices is one…

Hardware Architecture · Computer Science 2024-07-29 Seyed Nima Omidsajedi , Rekha Reddy , Jianming Yi , Jan Herbst , Christoph Lipps , Hans Dieter Schotten

{\mu}-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably,…

Machine Learning · Computer Science 2018-04-16 Yosuke Oyama , Tal Ben-Nun , Torsten Hoefler , Satoshi Matsuoka

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-29 Behnam Pourghassemi , Chenghao Zhang , Joo Hwan Lee , Aparna Chandramowlishwaran

MetaML-Pro: Cross-Stage Design Flow Automation for Efficient Deep Learning Acceleration

This paper presents a unified framework for codifying and automating optimization strategies to efficiently deploy deep neural networks (DNNs) on resource-constrained hardware, such as FPGAs, while maintaining high performance, accuracy,…

Hardware Architecture · Computer Science 2026-02-11 Zhiqiang Que , Jose G. F. Coutinho , Ce Guo , Hongxiang Fan , Wayne Luk

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up the process nowadays.…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-29 Chi-Chung Chen , Chia-Lin Yang , Hsiang-Yun Cheng

SlimEdge: Performance and Device Aware Distributed DNN Deployment on Resource-Constrained Edge Hardware

Distributed deep neural networks (DNNs) have become central to modern computer vision, yet their deployment on resource-constrained edge devices remains hindered by substantial parameter counts, computational demands, and the probability of…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-17 Mahadev Sunil Kumar , Arnab Raha , Debayan Das , Gopakumar G , Rounak Chatterjee , Amitava Mukherjee

Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training

Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a considerable amount of calculations, long running time, and much energy. Nowadays, many-core AI accelerators (e.g.,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-12 Yuxin Wang , Qiang Wang , Shaohuai Shi , Xin He , Zhenheng Tang , Kaiyong Zhao , Xiaowen Chu

Learning to Optimize Tensor Programs

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective…

Machine Learning · Computer Science 2019-01-10 Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , Arvind Krishnamurthy

Learning on Hardware: A Tutorial on Neural Network Accelerators and Co-Processors

Deep neural networks (DNNs) have the advantage that they can take into account a large number of parameters, which enables them to solve complex tasks. In computer vision and speech recognition, they have a better accuracy than common…

Machine Learning · Computer Science 2021-04-20 Lukas Baischer , Matthias Wess , Nima TaheriNejad

Bring Your Own Codegen to Deep Learning Compiler

Deep neural networks (DNNs) have been ubiquitously applied in many applications, and accelerators are emerged as an enabler to support the fast and efficient inference tasks of these applications. However, to achieve high model coverage…

Machine Learning · Computer Science 2021-05-10 Zhi Chen , Cody Hao Yu , Trevor Morris , Jorn Tuyls , Yi-Hsiang Lai , Jared Roesch , Elliott Delaye , Vin Sharma , Yida Wang

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at…

Computer Vision and Pattern Recognition · Computer Science 2017-08-15 Vivienne Sze , Yu-Hsin Chen , Tien-Ju Yang , Joel Emer

Enabling Deep Learning on Edge Devices

Deep neural networks (DNNs) have succeeded in many different perception tasks, e.g., computer vision, natural language processing, reinforcement learning, etc. The high-performed DNNs heavily rely on intensive resource consumption. For…

Machine Learning · Computer Science 2022-10-10 Zhongnan Qu

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing and speech recognition. However, their superior performance comes at the…

Machine Learning · Computer Science 2022-04-26 Han Cai , Ji Lin , Yujun Lin , Zhijian Liu , Haotian Tang , Hanrui Wang , Ligeng Zhu , Song Han

Towards making the most of NLP-based device mapping optimization for OpenCL kernels

Nowadays, we are living in an era of extreme device heterogeneity. Despite the high variety of conventional CPU architectures, accelerator devices, such as GPUs and FPGAs, also appear in the foreground exploding the pool of available…

Machine Learning · Computer Science 2022-08-31 Petros Vavaroutsos , Ioannis Oroutzoglou , Dimosthenis Masouros , Dimitrios Soudris

GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs

As the emerging trend of graph-based deep learning, Graph Neural Networks (GNNs) excel for their capability to generate high-quality node feature vectors (embeddings). However, the existing one-size-fits-all GNN implementations are…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-22 Yuke Wang , Boyuan Feng , Gushu Li , Shuangchen Li , Lei Deng , Yuan Xie , Yufei Ding

DF-GNN: Dynamic Fusion Framework for Attention Graph Neural Networks on GPUs

Attention Graph Neural Networks (AT-GNNs), such as GAT and Graph Transformer, have demonstrated superior performance compared to other GNNs. However, existing GNN systems struggle to efficiently train AT-GNNs on GPUs due to their intricate…

Machine Learning · Computer Science 2024-11-26 Jiahui Liu , Zhenkun Cai , Zhiyong Chen , Minjie Wang