Related papers: Orchestrate: Infrastructure for Enabling Paralleli…

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Distributed Training and Optimization Of Neural Networks

Deep learning models are yielding increasingly better performances thanks to multiple factors. To be successful, model may have large number of parameters or complex architectures and be trained on large dataset. This leads to large…

Machine Learning · Computer Science 2022-12-20 Jean-Roch Vlimant , Junqi Yin

CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

Many hyperparameter optimization (HyperOpt) methods assume restricted computing resources and mainly focus on enhancing performance. Here we propose a novel cloud-based HyperOpt (CHOPT) framework which can efficiently utilize shared…

Machine Learning · Computer Science 2018-10-17 Jinwoong Kim , Minkyu Kim , Heungseok Park , Ernar Kusdavletov , Dongjun Lee , Adrian Kim , Ji-Hoon Kim , Jung-Woo Ha , Nako Sung

Federated Optimization:Distributed Optimization Beyond the Datacenter

We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to…

Machine Learning · Computer Science 2015-11-12 Jakub Konečný , Brendan McMahan , Daniel Ramage

Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review

While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-03 Suejb Memeti , Sabri Pllana , Alecio Binotto , Joanna Kolodziej , Ivona Brandic

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Orchestra: Unsupervised Federated Learning via Globally Consistent Clustering

Federated learning is generally used in tasks where labels are readily available (e.g., next word prediction). Relaxing this constraint requires design of unsupervised learning techniques that can support desirable properties for federated…

Machine Learning · Computer Science 2022-06-14 Ekdeep Singh Lubana , Chi Ian Tang , Fahim Kawsar , Robert P. Dick , Akhil Mathur

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using…

Machine Learning · Computer Science 2020-07-01 Yu Emma Wang , Carole-Jean Wu , Xiaodong Wang , Kim Hazelwood , David Brooks

Enhancing Multi-Objective Optimization through Machine Learning-Supported Multiphysics Simulation

This paper presents a methodological framework for training, self-optimising, and self-organising surrogate models to approximate and speed up multiobjective optimisation of technical systems based on multiphysics simulations. At the hand…

Machine Learning · Computer Science 2024-04-04 Diego Botache , Jens Decke , Winfried Ripken , Abhinay Dornipati , Franz Götz-Hahn , Mohamed Ayeb , Bernhard Sick

Assisted Learning for Organizations with Limited Imbalanced Data

In the era of big data, many big organizations are integrating machine learning into their work pipelines to facilitate data analysis. However, the performance of their trained models is often restricted by limited and imbalanced data…

Machine Learning · Computer Science 2024-03-05 Cheng Chen , Jiaying Zhou , Jie Ding , Yi Zhou

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-10 Hamidreza Khaleghzadeh , Muhammad Fahad , Arsalan Shahid , Ravi Reddy Manumachu , Alexey Lastovetsky

HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters

With the rapid evolution of GPU architectures, the heterogeneity of model training infrastructures is steadily increasing. In such environments, effectively utilizing all available heterogeneous accelerators becomes critical for distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Antian Liang , Zhigang Zhao , Kai Zhang , Xuri Shi , Chuantao Li , Chunxiao Wang , Zhenying He , Yinan Jing , X. Sean Wang

Optimal Algorithm Allocation for Single Robot Cloud Systems

In order for a robot to perform a task, several algorithms need to be executed, sometimes, simultaneously. Algorithms can be run either on the robot itself or, upon request, be performed on cloud infrastructure. The term cloud…

Robotics · Computer Science 2022-02-09 Saeid Alirezazadeh , Luís A. Alexandre

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Recommendation systems are often trained with a tremendous amount of data, and distributed training is the workhorse to shorten the training time. While the training throughput can be increased by simply adding more workers, it is also…

Machine Learning · Computer Science 2021-02-24 Qinqing Zheng , Bor-Yiing Su , Jiyan Yang , Alisson Azzolini , Qiang Wu , Ou Jin , Shri Karandikar , Hagay Lupesko , Liang Xiong , Eric Zhou

Machine Learning-based Orchestration of Containers: A Taxonomy and Future Directions

Containerization is a lightweight application virtualization technology, providing high environmental consistency, operating system distribution portability, and resource isolation. Existing mainstream cloud service providers have…

Machine Learning · Computer Science 2021-08-23 Zhiheng Zhong , Minxian Xu , Maria Alejandra Rodriguez , Chengzhong Xu , Rajkumar Buyya

Workflow-Driven Modeling for the Compute Continuum: An Optimization Approach to Automated System and Workload Scheduling

The convergence of IoT, Edge, Cloud, and HPC technologies creates a compute continuum that merges cloud scalability and flexibility with HPC's computational power and specialized optimizations. However, integrating cloud and HPC resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Aasish Kumar Sharma , Christian Boehme , Patrick Gelß , Ramin Yahyapour , Julian Kunkel

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Service Orchestration in the Computing Continuum: Structural Challenges and Vision

The Computing Continuum (CC) integrates different layers of processing infrastructure, from Edge to Cloud, to optimize service quality through ubiquitous and reliable computation. Compared to central architectures, however, heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-18 Boris Sedlak , Víctor Casamayor Pujol , Ildefons Magrans de Abril , Praveen Kumar Donta , Adel N. Toosi , Schahram Dustdar