English
Related papers

Related papers: An Executable Sequential Specification for Spark A…

200 papers

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems.…

Machine Learning · Computer Science 2015-09-24 Yuchen Zhang , Michael I. Jordan

As more and more devices connect to Internet of Things, unbounded streams of data will be generated, which have to be processed "on the fly" in order to trigger automated actions and deliver real-time services. Spark Streaming is a popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-12 Jia-Chun Lin , Ming-Chang Lee , Ingrid Chieh Yu , Einar Broch Johnsen

We introduce SparkCL, an open source unified programming framework based on Java, OpenCL and the Apache Spark framework. The motivation behind this work is to bring unconventional compute cores such as FPGAs/GPUs/APUs/DSPs and future core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-06 Oren Segal , Philip Colangelo , Nasibeh Nasiri , Zhuo Qian , Martin Margala

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which…

We introduce a sampling framework to support approximate computing with estimated error bounds in Spark. Our framework allows sampling to be performed at the beginning of a sequence of multiple transformations ending in an aggregation…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-07 Guangyan Hu , Desheng Zhang , Sandro Rigo , Thu D. Nguyen

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-21 Peter Kraft , Amos Waterland , Daniel Y Fu , Anitha Gollamudi , Shai Szulanski , Margo Seltzer

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms…

Machine Learning · Computer Science 2016-10-04 Hanjoo Kim , Jaehong Park , Jaehee Jang , Sungroh Yoon

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code,…

Operating Systems · Computer Science 2010-02-01 Amittai Aviram , Bryan Ford

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Custom policy-learning pipelines in Spark fail for two coupled systems reasons: rowwise Python execution makes inference impractical, and driver-side candidate materialization makes split search fragile at feature scale. We present Spark…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Zeyu Bai

Spark has been established as an attractive platform for big data analysis, since it manages to hide most of the complexities related to parallelism, fault tolerance and cluster setting from developers. However, this comes at the expense of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-07-26 Panagiotis Petridis , Anastasios Gounaris , Jordi Torres

Distributed computation is always a tricky topic to deal with, especially in context of various requirements in various scenarios. A popular solution is to use Apache Spark with a setup of multiple systems forming a cluster. However, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-04-18 Anuran Roy , Sridhar Raj S

Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-07 Raunaq Suri , Ilan Gofman , Guangwei Yu , Jesse C. Cresswell

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-12 Ehsan Totoni , Todd A. Anderson , Tatiana Shpeisman

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-07-14 Reza Bosagh Zadeh , Xiangrui Meng , Aaron Staple , Burak Yavuz , Li Pu , Shivaram Venkataraman , Evan Sparks , Alexander Ulanov , Matei Zaharia

We present GraSSP, a novel approach to perform automated parallelization relying on recent advances in formal verification and synthesis. GraSSP augments an existing sequential program with an additional functionality to decompose data…

Programming Languages · Computer Science 2016-11-24 Grigory Fedyukovich , Rastislav Bodík
‹ Prev 1 2 3 10 Next ›