Related papers: Code generation and runtime techniques for enablin…

PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

With the increasing adoption of graph neural networks (GNNs) in the machine learning community, GPUs have become an essential tool to accelerate GNN training. However, training GNNs on very large graphs that do not fit in GPU memory is…

Machine Learning · Computer Science 2021-01-21 Seung Won Min , Kun Wu , Sitao Huang , Mert Hidayetoğlu , Jinjun Xiong , Eiman Ebrahimi , Deming Chen , Wen-mei Hwu

Deep Learning Models on CPUs: A Methodology for Efficient Training

GPUs have been favored for training deep learning models due to their highly parallelized architecture. As a result, most studies on training optimization focus on GPUs. There is often a trade-off, however, between cost and efficiency when…

Machine Learning · Computer Science 2023-06-21 Quchen Fu , Ramesh Chukka , Keith Achorn , Thomas Atta-fosu , Deepak R. Canchi , Zhongwei Teng , Jules White , Douglas C. Schmidt

Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report

A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms.…

Machine Learning · Computer Science 2022-12-08 Ivan Svogor , Christian Eichenberger , Markus Spanring , Moritz Neun , Michael Kopp

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

Fine-Tuning GPT-5 for GPU Kernel Generation

Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-12 Ali Tehrani , Yahya Emara , Essam Wissam , Wojciech Paluch , Waleed Atallah , Łukasz Dudziak , Mohamed S. Abdelfattah

Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System

The recent huge advance of Large Language Models (LLMs) is mainly driven by the increase in the number of parameters. This has led to substantial memory capacity requirements, necessitating the use of dozens of GPUs just to meet the…

Hardware Architecture · Computer Science 2024-03-12 Hongsun Jang , Jaeyong Song , Jaewon Jung , Jaeyoung Park , Youngsok Kim , Jinho Lee

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-24 Kashish Mittal , Di Yu , Roozbeh Ketabi , Arushi Arora , Brendon Lapp , Peng Zhang

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning…

Software Engineering · Computer Science 2024-07-09 Yun-Da Tsai , Mingjie Liu , Haoxing Ren

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator

Most deep neural networks deployed today are trained using GPUs via high-level frameworks such as TensorFlow and PyTorch. This paper describes changes we made to the GPGPU-Sim simulator to enable it to run PyTorch by running PTX kernels…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-01-29 Jonathan Lew , Deval Shah , Suchita Pati , Shaylin Cattell , Mengchi Zhang , Amruth Sandhupatla , Christopher Ng , Negar Goli , Matthew D. Sinclair , Timothy G. Rogers , Tor Aamodt

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction.…

Machine Learning · Computer Science 2021-12-17 Tianfeng Liu , Yangrui Chen , Dan Li , Chuan Wu , Yibo Zhu , Jun He , Yanghua Peng , Hongzheng Chen , Hongzhi Chen , Chuanxiong Guo

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite…

Neural and Evolutionary Computing · Computer Science 2016-11-22 Matthew W. Moskewicz , Ali Jannesari , Kurt Keutzer

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-07-17 Andreas Klöckner , Nicolas Pinto , Yunsup Lee , Bryan Catanzaro , Paul Ivanov , Ahmed Fasih

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme

Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-31 Jeongmin Brian Park , Kun Wu , Vikram Sharma Mailthody , Zaid Quresh , Scott Mahlke , Wen-mei Hwu

A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. When training a DNN model, the intermediate activation data must be saved in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-24 Sian Jin , Guanpeng Li , Shuaiwen Leon Song , Dingwen Tao

Efficient Training of Convolutional Neural Nets on Large Distributed Systems

Deep Neural Networks (DNNs) have achieved im- pressive accuracy in many application domains including im- age classification. Training of DNNs is an extremely compute- intensive process and is solved using variants of the stochastic…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-03 Sameer Kumar , Dheeraj Sreedhar , Vaibhav Saxena , Yogish Sabharwal , Ashish Verma

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Deep learning researchers and practitioners usually leverage GPUs to help train their deep neural networks (DNNs) faster. However, choosing which GPU to use is challenging both because (i) there are many options, and (ii) users grapple with…

Machine Learning · Computer Science 2021-06-09 Geoffrey X. Yu , Yubo Gao , Pavel Golikov , Gennady Pekhimenko

Moving Stuff Around: A study on efficiency of moving documents into memory for Neural IR models

When training neural rankers using Large Language Models, it's expected that a practitioner would make use of multiple GPUs to accelerate the training time. By using more devices, deep learning frameworks, like PyTorch, allow the user to…

Information Retrieval · Computer Science 2022-06-24 Arthur Câmara , Claudia Hauff

Operation-Level Performance Benchmarking of Graph Neural Networks for Scientific Applications

As Graph Neural Networks (GNNs) increase in popularity for scientific machine learning, their training and inference efficiency is becoming increasingly critical. Additionally, the deep learning field as a whole is trending towards wider…

Machine Learning · Computer Science 2022-07-21 Ryien Hosseini , Filippo Simini , Venkatram Vishwanath

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their…

Machine Learning · Computer Science 2021-08-17 Seung Won Min , Kun Wu , Sitao Huang , Mert Hidayetoğlu , Jinjun Xiong , Eiman Ebrahimi , Deming Chen , Wen-mei Hwu

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala