English
Related papers

Related papers: Orchestrate: Infrastructure for Enabling Paralleli…

200 papers

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Deep learning models are yielding increasingly better performances thanks to multiple factors. To be successful, model may have large number of parameters or complex architectures and be trained on large dataset. This leads to large…

Machine Learning · Computer Science 2022-12-20 Jean-Roch Vlimant , Junqi Yin

Many hyperparameter optimization (HyperOpt) methods assume restricted computing resources and mainly focus on enhancing performance. Here we propose a novel cloud-based HyperOpt (CHOPT) framework which can efficiently utilize shared…

Machine Learning · Computer Science 2018-10-17 Jinwoong Kim , Minkyu Kim , Heungseok Park , Ernar Kusdavletov , Dongjun Lee , Adrian Kim , Ji-Hoon Kim , Jung-Woo Ha , Nako Sung

We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to…

Machine Learning · Computer Science 2015-11-12 Jakub Konečný , Brendan McMahan , Daniel Ramage

While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-03 Suejb Memeti , Sabri Pllana , Alecio Binotto , Joanna Kolodziej , Ivona Brandic

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Federated learning is generally used in tasks where labels are readily available (e.g., next word prediction). Relaxing this constraint requires design of unsupervised learning techniques that can support desirable properties for federated…

Machine Learning · Computer Science 2022-06-14 Ekdeep Singh Lubana , Chi Ian Tang , Fahim Kawsar , Robert P. Dick , Akhil Mathur

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-04 Yixin Bao , Yanghua Peng , Chuan Wu , Zongpeng Li

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using…

Machine Learning · Computer Science 2020-07-01 Yu Emma Wang , Carole-Jean Wu , Xiaodong Wang , Kim Hazelwood , David Brooks

This paper presents a methodological framework for training, self-optimising, and self-organising surrogate models to approximate and speed up multiobjective optimisation of technical systems based on multiphysics simulations. At the hand…

Machine Learning · Computer Science 2024-04-04 Diego Botache , Jens Decke , Winfried Ripken , Abhinay Dornipati , Franz Götz-Hahn , Mohamed Ayeb , Bernhard Sick

In the era of big data, many big organizations are integrating machine learning into their work pipelines to facilitate data analysis. However, the performance of their trained models is often restricted by limited and imbalanced data…

Machine Learning · Computer Science 2024-03-05 Cheng Chen , Jiaying Zhou , Jie Ding , Yi Zhou

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-10 Hamidreza Khaleghzadeh , Muhammad Fahad , Arsalan Shahid , Ravi Reddy Manumachu , Alexey Lastovetsky

With the rapid evolution of GPU architectures, the heterogeneity of model training infrastructures is steadily increasing. In such environments, effectively utilizing all available heterogeneous accelerators becomes critical for distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Antian Liang , Zhigang Zhao , Kai Zhang , Xuri Shi , Chuantao Li , Chunxiao Wang , Zhenying He , Yinan Jing , X. Sean Wang

In order for a robot to perform a task, several algorithms need to be executed, sometimes, simultaneously. Algorithms can be run either on the robot itself or, upon request, be performed on cloud infrastructure. The term cloud…

Robotics · Computer Science 2022-02-09 Saeid Alirezazadeh , Luís A. Alexandre

Recommendation systems are often trained with a tremendous amount of data, and distributed training is the workhorse to shorten the training time. While the training throughput can be increased by simply adding more workers, it is also…

Machine Learning · Computer Science 2021-02-24 Qinqing Zheng , Bor-Yiing Su , Jiyan Yang , Alisson Azzolini , Qiang Wu , Ou Jin , Shri Karandikar , Hagay Lupesko , Liang Xiong , Eric Zhou

Containerization is a lightweight application virtualization technology, providing high environmental consistency, operating system distribution portability, and resource isolation. Existing mainstream cloud service providers have…

Machine Learning · Computer Science 2021-08-23 Zhiheng Zhong , Minxian Xu , Maria Alejandra Rodriguez , Chengzhong Xu , Rajkumar Buyya

The convergence of IoT, Edge, Cloud, and HPC technologies creates a compute continuum that merges cloud scalability and flexibility with HPC's computational power and specialized optimizations. However, integrating cloud and HPC resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Aasish Kumar Sharma , Christian Boehme , Patrick Gelß , Ramin Yahyapour , Julian Kunkel

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

The Computing Continuum (CC) integrates different layers of processing infrastructure, from Edge to Cloud, to optimize service quality through ubiquitous and reliable computation. Compared to central architectures, however, heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-18 Boris Sedlak , Víctor Casamayor Pujol , Ildefons Magrans de Abril , Praveen Kumar Donta , Adel N. Toosi , Schahram Dustdar
‹ Prev 1 2 3 10 Next ›