Related papers: MLlib: Machine Learning in Apache Spark

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Demonstration of LogicLib: An Expressive Multi-Language Interface over Scalable Datalog System

With the ever-increasing volume of data, there is an urgent need to provide expressive and efficient tools to support Big Data analytics. The declarative logical language Datalog has proven very effective at expressing concisely graph,…

Databases · Computer Science 2022-09-07 Mingda Li , Jin Wang , Guorui Xiao , Youfu Li , Carlo Zaniolo

Flexible and Scalable Deep Learning with MMLSpark

In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-30 Mark Hamilton , Sudarshan Raghunathan , Akshaya Annavajhala , Danil Kirsanov , Eduardo de Leon , Eli Barzilay , Ilya Matiach , Joe Davison , Maureen Busch , Miruna Oprescu , Ratan Sur , Roope Astala , Tong Wen , ChangYoung Park

Spark NLP: Natural Language Understanding at Scale

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant and accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark…

Computation and Language · Computer Science 2021-01-27 Veysel Kocaman , David Talby

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient…

Machine Learning · Computer Science 2019-06-24 Mark Hamilton , Sudarshan Raghunathan , Ilya Matiach , Andrew Schonhoffer , Anand Raman , Eli Barzilay , Karthik Rajendran , Dalitso Banda , Casey Jisoo Hong , Manon Knoertzer , Ben Brodsky , Minsoo Thigpen , Janhavi Suresh Mahajan , Courtney Cochrane , Abhiram Eswaran , Ari Green

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-31 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Lisa Gerhardt , Prabhat , Jey Kottalam , Michael Ringenburg , Kristyn Maschhoff

Designing and building the mlpack open-source machine learning library

mlpack is an open-source C++ machine learning library with an emphasis on speed and flexibility. Since its original inception in 2007, it has grown to be a large project implementing a wide variety of machine learning algorithms, from…

Mathematical Software · Computer Science 2017-08-31 Ryan R. Curtin , Marcus Edel

MLPACK: A Scalable C++ Machine Learning Library

MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging…

Mathematical Software · Computer Science 2021-06-24 Ryan R. Curtin , James R. Cline , N. P. Slagle , William B. March , Parikshit Ram , Nishant A. Mehta , Alexander G. Gray

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

Alchemist: An Apache Spark <=> MPI Interface

The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-06 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Jey Kottalam , Lisa Gerhardt , Prabhat , Michael Ringenburg , Kristyn Maschhoff

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen

mlpy: Machine Learning Python

mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is…

Mathematical Software · Computer Science 2012-03-02 Davide Albanese , Roberto Visintainer , Stefano Merler , Samantha Riccadonna , Giuseppe Jurman , Cesare Furlanello

The MADlib Analytics Library or MAD Skills, the SQL

MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data…

Databases · Computer Science 2015-03-20 Joe Hellerstein , Christopher Ré , Florian Schoppmann , Daisy Zhe Wang , Eugene Fratkin , Aleksander Gorajek , Kee Siong Ng , Caleb Welton , Xixuan Feng , Kun Li , Arun Kumar

MLitB: Machine Learning in the Browser

With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-18 Edward Meeds , Remco Hendriks , Said Al Faraby , Magiel Bruntink , Max Welling

Large-scale text processing pipeline with Apache Spark

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

Big Data analytics. Three use cases with R, Python and Spark

Management and analysis of big data are systematically associated with a data distributed architecture in the Hadoop and now Spark frameworks. This article offers an introduction for statisticians to these technologies by comparing the…

Applications · Statistics 2016-10-03 Philippe Besse , Brendan Guillouet , Jean-Michel Loubes

BigDL: A Distributed Deep Learning Framework for Big Data

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-13 Jason Dai , Yiheng Wang , Xin Qiu , Ding Ding , Yao Zhang , Yanzhang Wang , Xianyan Jia , Cherry Zhang , Yan Wan , Zhichao Li , Jiao Wang , Shengsheng Huang , Zhongyuan Wu , Yang Wang , Yuhao Yang , Bowen She , Dongjie Shi , Qi Lu , Kai Huang , Guoqiong Song

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu