Related papers: Deep Learning with Apache SystemML

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

BigDL: A Distributed Deep Learning Framework for Big Data

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-13 Jason Dai , Yiheng Wang , Xin Qiu , Ding Ding , Yao Zhang , Yanzhang Wang , Xianyan Jia , Cherry Zhang , Yan Wan , Zhichao Li , Jiao Wang , Shengsheng Huang , Zhongyuan Wu , Yang Wang , Yuhao Yang , Bowen She , Dongjie Shi , Qi Lu , Kai Huang , Guoqiong Song

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-17 Matteo Migliorini , Riccardo Castellotti , Luca Canali , Marco Zanetti

A Data and Model-Parallel, Distributed and Scalable Framework for Training of Deep Networks in Apache Spark

Training deep networks is expensive and time-consuming with the training period increasing with data size and growth in model parameters. In this paper, we provide a framework for distributed training of deep networks over a cluster of CPUs…

Machine Learning · Statistics 2017-08-22 Disha Shrivastava , Santanu Chaudhury , Dr. Jayadeva

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters

The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms…

Machine Learning · Computer Science 2016-10-04 Hanjoo Kim , Jaehong Park , Jaehee Jang , Sungroh Yoon

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient…

Machine Learning · Computer Science 2019-06-24 Mark Hamilton , Sudarshan Raghunathan , Ilya Matiach , Andrew Schonhoffer , Anand Raman , Eli Barzilay , Karthik Rajendran , Dalitso Banda , Casey Jisoo Hong , Manon Knoertzer , Ben Brodsky , Minsoo Thigpen , Janhavi Suresh Mahajan , Courtney Cochrane , Abhiram Eswaran , Ari Green

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Subhadip Mitra

Towards A Domain-Customized Automated Machine Learning Framework For Networks and Systems

Clouds gather a vast volume of telemetry from their networked systems which contain valuable information that can help solve many of the problems that continue to plague them. However, it is hard to extract useful information from such raw…

Networking and Internet Architecture · Computer Science 2020-04-28 Behnaz Arzani , Bita Rouhani

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-23 Taha Tekdogan , Ali Cakmak

A Survey on Geographically Distributed Big-Data Processing using MapReduce

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many…

Databases · Computer Science 2017-07-07 Shlomi Dolev , Patricia Florissi , Ehud Gudes , Shantanu Sharma , Ido Singer

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Julian Rodriguez , Piotr Lopez , Emiliano Lerma , Rafael Medrano , Jacobo Hernandez

SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle

Machine learning (ML) applications become increasingly common in many domains. ML systems to execute these workloads include numerical computing frameworks and libraries, ML algorithm libraries, and specialized systems for deep neural…

Databases · Computer Science 2020-01-09 Matthias Boehm , Iulian Antonov , Sebastian Baunsgaard , Mark Dokter , Robert Ginthoer , Kevin Innerebner , Florijan Klezin , Stefanie Lindstaedt , Arnab Phani , Benjamin Rath , Berthold Reinwald , Shafaq Siddiqi , Sebastian Benjamin Wrede

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics

Modern advanced analytics applications make use of machine learning techniques and contain multiple steps of domain-specific and general-purpose processing with high resource requirements. We present KeystoneML, a system that captures and…

Machine Learning · Computer Science 2016-11-01 Evan R. Sparks , Shivaram Venkataraman , Tomer Kaftan , Michael J. Franklin , Benjamin Recht

A Comparison of Big Data Frameworks on a Layered Dataflow Model

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay

Exploiting Apache Spark platform for CMS computing analytics

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing…

Data Analysis, Statistics and Probability · Physics 2017-11-03 Marco Meoni , Valentin Kuznetsov , Luca Menichetti , Justinas Rumševičius , Tommaso Boccali , Daniele Bonacorsi

Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks

Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-30 Vladyslav Taran , Oleg Alienin , Sergii Stirenko , A. Rojbi , Yuri Gordienko