Related papers: DeepSpark: A Spark-Based Distributed Deep Learning…

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-04 Xing Zhao , Aijun An , Junfeng Liu , Bao Xin Chen

BigDL: A Distributed Deep Learning Framework for Big Data

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-13 Jason Dai , Yiheng Wang , Xin Qiu , Ding Ding , Yao Zhang , Yanzhang Wang , Xianyan Jia , Cherry Zhang , Yan Wan , Zhichao Li , Jiao Wang , Shengsheng Huang , Zhongyuan Wu , Yang Wang , Yuhao Yang , Bowen She , Dongjie Shi , Qi Lu , Kai Huang , Guoqiong Song

SparkNet: Training Deep Networks in Spark

Training deep networks is a time-consuming process, with networks for object recognition often requiring multiple days to train. For this reason, leveraging the resources of a cluster to speed up training is an important area of work.…

Machine Learning · Statistics 2016-03-01 Philipp Moritz , Robert Nishihara , Ion Stoica , Michael I. Jordan

Distributed Deep Learning using Stochastic Gradient Staleness

Despite the notable success of deep neural networks (DNNs) in solving complex tasks, the training process still remains considerable challenges. A primary obstacle is the substantial time required for training, particularly as high…

Machine Learning · Computer Science 2025-09-09 Viet Hoang Pham , Hyo-Sung Ahn

Modeling Scalability of Distributed Machine Learning

Present day machine learning is computationally intensive and processes large amounts of data. It is implemented in a distributed fashion in order to address these scalability issues. The work is parallelized across a number of computing…

Machine Learning · Computer Science 2017-03-28 Alexander Ulanov , Andrey Simanovsky , Manish Marwah

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices

In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled…

Machine Learning · Computer Science 2019-05-20 Mario Almeida , Stefanos Laskaridis , Ilias Leontiadis , Stylianos I. Venieris , Nicholas D. Lane

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we…

Machine Learning · Computer Science 2018-09-18 Tal Ben-Nun , Torsten Hoefler

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-30 Shen Li , Yanli Zhao , Rohan Varma , Omkar Salpekar , Pieter Noordhuis , Teng Li , Adam Paszke , Jeff Smith , Brian Vaughan , Pritam Damania , Soumith Chintala

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Training deep neural networks (DNNs) in large-cluster computing environments is increasingly necessary, as networks grow in size and complexity. Local memory and processing limitations require robust data and model parallelism for crossing…

Machine Learning · Computer Science 2020-06-08 Russell J. Hewett , Thomas J. Grady

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient…

Machine Learning · Computer Science 2019-06-24 Mark Hamilton , Sudarshan Raghunathan , Ilya Matiach , Andrew Schonhoffer , Anand Raman , Eli Barzilay , Karthik Rajendran , Dalitso Banda , Casey Jisoo Hong , Manon Knoertzer , Ben Brodsky , Minsoo Thigpen , Janhavi Suresh Mahajan , Courtney Cochrane , Abhiram Eswaran , Ari Green

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems.…

Machine Learning · Computer Science 2015-09-24 Yuchen Zhang , Michael I. Jordan

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-05-25 Seo Jin Park , Joshua Fried , Sunghyun Kim , Mohammad Alizadeh , Adam Belay