Related papers: A Big Data Analysis Framework Using Apache Spark a…

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen

MLlib: Machine Learning in Apache Spark

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib…

Machine Learning · Computer Science 2015-05-27 Xiangrui Meng , Joseph Bradley , Burak Yavuz , Evan Sparks , Shivaram Venkataraman , Davies Liu , Jeremy Freeman , DB Tsai , Manish Amde , Sean Owen , Doris Xin , Reynold Xin , Michael J. Franklin , Reza Zadeh , Matei Zaharia , Ameet Talwalkar

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

BigDL: A Distributed Deep Learning Framework for Big Data

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-13 Jason Dai , Yiheng Wang , Xin Qiu , Ding Ding , Yao Zhang , Yanzhang Wang , Xianyan Jia , Cherry Zhang , Yan Wan , Zhichao Li , Jiao Wang , Shengsheng Huang , Zhongyuan Wu , Yang Wang , Yuhao Yang , Bowen She , Dongjie Shi , Qi Lu , Kai Huang , Guoqiong Song

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-23 Taha Tekdogan , Ali Cakmak

Flexible and Scalable Deep Learning with MMLSpark

In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-30 Mark Hamilton , Sudarshan Raghunathan , Akshaya Annavajhala , Danil Kirsanov , Eduardo de Leon , Eli Barzilay , Ilya Matiach , Joe Davison , Maureen Busch , Miruna Oprescu , Ratan Sur , Roope Astala , Tong Wen , ChangYoung Park

Big Data analytics. Three use cases with R, Python and Spark

Management and analysis of big data are systematically associated with a data distributed architecture in the Hadoop and now Spark frameworks. This article offers an introduction for statisticians to these technologies by comparing the…

Applications · Statistics 2016-10-03 Philippe Besse , Brendan Guillouet , Jean-Michel Loubes

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient…

Machine Learning · Computer Science 2019-06-24 Mark Hamilton , Sudarshan Raghunathan , Ilya Matiach , Andrew Schonhoffer , Anand Raman , Eli Barzilay , Karthik Rajendran , Dalitso Banda , Casey Jisoo Hong , Manon Knoertzer , Ben Brodsky , Minsoo Thigpen , Janhavi Suresh Mahajan , Courtney Cochrane , Abhiram Eswaran , Ari Green

Performance Benefits of DataMPI: A Case Study with BigDataBench

Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and…

Performance · Computer Science 2014-03-17 Fan Liang , Chen Feng , Xiaoyi Lu , Zhiwei Xu

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-17 Matteo Migliorini , Riccardo Castellotti , Luca Canali , Marco Zanetti

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Subhadip Mitra

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-31 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Lisa Gerhardt , Prabhat , Jey Kottalam , Michael Ringenburg , Kristyn Maschhoff

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang