English
Related papers

Related papers: MLlib: Machine Learning in Apache Spark

200 papers

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

With the ever-increasing volume of data, there is an urgent need to provide expressive and efficient tools to support Big Data analytics. The declarative logical language Datalog has proven very effective at expressing concisely graph,…

Databases · Computer Science 2022-09-07 Mingda Li , Jin Wang , Guorui Xiao , Youfu Li , Carlo Zaniolo

In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java…

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant and accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark…

Computation and Language · Computer Science 2021-01-27 Veysel Kocaman , David Talby

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient…

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-31 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Lisa Gerhardt , Prabhat , Jey Kottalam , Michael Ringenburg , Kristyn Maschhoff

mlpack is an open-source C++ machine learning library with an emphasis on speed and flexibility. Since its original inception in 2007, it has grown to be a large project implementing a wide variety of machine learning algorithms, from…

Mathematical Software · Computer Science 2017-08-31 Ryan R. Curtin , Marcus Edel

MLPACK is a state-of-the-art, scalable, multi-platform C++ machine learning library released in late 2011 offering both a simple, consistent API accessible to novice users and high performance and flexibility to expert users by leveraging…

Mathematical Software · Computer Science 2021-06-24 Ryan R. Curtin , James R. Cline , N. P. Slagle , William B. March , Parikshit Ram , Nishant A. Mehta , Alexander G. Gray

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

The Apache Spark framework for distributed computation is popular in the data analytics community due to its ease of use, but its MapReduce-style programming model can incur significant overheads when performing computations that do not map…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-06 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Jey Kottalam , Lisa Gerhardt , Prabhat , Michael Ringenburg , Kristyn Maschhoff

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen

mlpy is a Python Open Source Machine Learning library built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is…

Mathematical Software · Computer Science 2012-03-02 Davide Albanese , Roberto Visintainer , Stefano Merler , Samantha Riccadonna , Giuseppe Jurman , Cesare Furlanello

MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data…

With few exceptions, the field of Machine Learning (ML) research has largely ignored the browser as a computational engine. Beyond an educational resource for ML, the browser has vast potential to not only improve the state-of-the-art in ML…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-06-18 Edward Meeds , Remco Hendriks , Said Al Faraby , Magiel Bruntink , Max Welling

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

Management and analysis of big data are systematically associated with a data distributed architecture in the Hadoop and now Spark frameworks. This article offers an introduction for statisticians to these technologies by comparing the…

Applications · Statistics 2016-10-03 Philippe Besse , Brendan Guillouet , Jean-Michel Loubes

This paper presents BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on production big data platforms. It allows deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-13 Jason Dai , Yiheng Wang , Xin Qiu , Ding Ding , Yao Zhang , Yanzhang Wang , Xianyan Jia , Cherry Zhang , Yan Wan , Zhichao Li , Jiao Wang , Shengsheng Huang , Zhongyuan Wu , Yang Wang , Yuhao Yang , Bowen She , Dongjie Shi , Qi Lu , Kai Huang , Guoqiong Song

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu
‹ Prev 1 2 3 10 Next ›