Related papers: Heterogeneous CPU+GPU Stochastic Gradient Descent …

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-15 Yujing Ma , Florin Rusu , Kesheng Wu , Alexander Sim

Stochastic Gradient Descent on Highly-Parallel Architectures

There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow and BIDMach, implement their compute-intensive primitives in…

Databases · Computer Science 2018-02-27 Yujing Ma , Florin Rusu , Martin Torres

Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to…

Machine Learning · Computer Science 2026-02-24 Jihyun Lim , Junhyuk Jo , Chanhyeok Ko , Young Min Go , Jimin Hwa , Sunwoo Lee

A block-random algorithm for learning on distributed, heterogeneous data

Most deep learning models are based on deep neural networks with multiple layers between input and output. The parameters defining these layers are initialized using random values and are "learned" from data, typically using stochastic…

Machine Learning · Computer Science 2019-03-05 Prakash Mohan , Marc T. Henry de Frahan , Ryan King , Ray W. Grout

GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters

The growing demand for computational resources in machine learning has made efficient resource allocation a critical challenge, especially in heterogeneous hardware clusters where devices vary in capability, age, and energy efficiency.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-20 Ahmad Raeisi , Mahdi Dolati , Sina Darabi , Sadegh Talebi , Patrick Eugster , Ahmad Khonsari

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

Given their increasing size and complexity, the need for efficient execution of deep neural networks has become increasingly pressing in the design of heterogeneous High-Performance Computing (HPC) and edge platforms, leading to a wide…

Hardware Architecture · Computer Science 2025-05-23 Serena Curzel , Fabrizio Ferrandi , Leandro Fiorin , Daniele Ielmini , Cristina Silvano , Francesco Conti , Luca Bompani , Luca Benini , Enrico Calore , Sebastiano Fabio Schifano , Cristian Zambelli , Maurizio Palesi , Giuseppe Ascia , Enrico Russo , Valeria Cardellini , Salvatore Filippone , Francesco Lo Presti , Stefania Perri

Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems

We propose a generic algorithmic building block to accelerate training of machine learning models on heterogeneous compute systems. Our scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of…

Machine Learning · Computer Science 2017-11-08 Celestine Dünner , Thomas Parnell , Martin Jaggi

Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems

Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Yuanhang Yu , Dong Wen , Ying Zhang , Xiaoyang Wang , Wenjie Zhang , Xuemin Lin

Scaling Deep Learning on GPU and Knights Landing clusters

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-11 Yang You , Aydin Buluc , James Demmel

Large-Scale Stochastic Learning using GPUs

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU…

Machine Learning · Computer Science 2017-02-24 Thomas Parnell , Celestine Dünner , Kubilay Atasu , Manolis Sifalakis , Haris Pozidis

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

CPU-GPU heterogeneous architectures are now commonly used in a wide variety of computing systems from mobile devices to supercomputers. Maximizing the throughput for multi-programmed workloads on such systems is indispensable as one single…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-08 Issa Saba , Eishi Arima , Dai Liu , Martin Schulz

Using HPC infrastructures for deep learning applications in fusion research

In the fusion community, the use of high performance computing (HPC) has been mostly dominated by heavy-duty plasma simulations, such as those based on particle-in-cell and gyrokinetic codes. However, there has been a growing interest in…

Computational Physics · Physics 2021-06-14 Diogo R. Ferreira

Dual-Delayed Asynchronous SGD for Arbitrarily Heterogeneous Data

We consider the distributed learning problem with data dispersed across multiple workers under the orchestration of a central server. Asynchronous Stochastic Gradient Descent (SGD) has been widely explored in such a setting to reduce the…

Machine Learning · Computer Science 2024-05-28 Xiaolu Wang , Yuchang Sun , Hoi-To Wai , Jun Zhang

Heterogeneous computing in a strongly-connected CPU-GPU environment: fast multiple time-evolution equation-based modeling accelerated using data-driven approach

We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a single-GH200 node, the…

Computational Engineering, Finance, and Science · Computer Science 2024-10-01 Tsuyoshi Ichimura , Kohei Fujita , Muneo Hori , Lalith Maddegedara , Jack Wells , Alan Gray , Ian Karlin , John Linford

Heterogeneous Acceleration Pipeline for Recommendation System Training

Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid…

Hardware Architecture · Computer Science 2024-04-30 Muhammad Adnan , Yassaman Ebrahimzadeh Maboud , Divya Mahajan , Prashant J. Nair

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational…

Machine Learning · Computer Science 2023-08-30 Xin Zhou , Ling Chen , Houming Wu

GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-09 Hanqing Zeng , Viktor Prasanna

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a…

Optimization and Control · Mathematics 2018-09-26 Xiangru Lian , Wei Zhang , Ce Zhang , Ji Liu

Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors

As deep learning models are increasingly deployed on mobile devices, modern mobile devices incorporate deep learning-specific accelerators to handle the growing computational demands, thus increasing their hardware heterogeneity. However,…

Machine Learning · Computer Science 2025-08-26 Duseok Kang , Yunseong Lee , Junghoon Kim

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-12 Shiwei Zhang , Lansong Diao , Chuan Wu , Zongyan Cao , Siyu Wang , Wei Lin