English
Related papers

Related papers: In-Memory Indexed Caching for Distributed Data Pro…

200 papers

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-04-29 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-17 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-31 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…

Performance · Computer Science 2018-05-09 Zhengyu Yang , Danlin Jia , Stratis Ioannidis , Ningfang Mi , Bo Sheng

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-11 Ehsan Totoni , Wajih Ul Hassan , Todd A. Anderson , Tatiana Shpeisman

Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-03 Vicent Sanz Marco , Ben Taylor , Barry Porter , Zheng Wang

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-07-14 Reza Bosagh Zadeh , Xiangrui Meng , Aaron Staple , Burak Yavuz , Li Pu , Shivaram Venkataraman , Evan Sparks , Alexander Ulanov , Matei Zaharia

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional…

Databases · Computer Science 2022-10-14 Elham Azhir , Mehdi Hosseinzadeh , Faheem Khan , Amir Mosavi

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

We present a novel architecture for sparse pattern processing, using flash storage with embedded accelerators. Sparse pattern processing on large data sets is the essence of applications such as document search, natural language processing,…

Hardware Architecture · Computer Science 2017-01-25 Sang-Woo Jun , Huy T. Nguyen , Vijay N. Gadepally , Arvind

Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques…

Databases · Computer Science 2019-07-17 Mingjie Tang , Yongyang Yu , Walid G. Aref , Ahmed R. Mahmood , Qutaibah M. Malluhi , Mourad Ouzzani
‹ Prev 1 2 3 10 Next ›