Related papers: HiFrames: High Performance Data Frames in a Script…

HPAT: High Performance Analytics with Scripting Ease-of-Use

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-12 Ehsan Totoni , Todd A. Anderson , Tatiana Shpeisman

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which…

Databases · Computer Science 2017-03-27 Grégory M. Essertel , Ruby Y. Tahboub , James M. Decker , Kevin J. Brown , Kunle Olukotun , Tiark Rompf

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework (Extended Version)

Frequent itemset mining (FIM) is a highly computational and data intensive algorithm. Therefore, parallel and distributed FIM algorithms have been designed to process large volume of data in a reduced time. Recently, a number of FIM…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-26 Pankaj Singh , Sudhakar Singh , P K Mishra , Rakhi Garg

HEP-Frame: an Efficient Tool for Big Data Applications at the LHC

HEP-Frame is a new C++ package designed to efficiently perform analyses of data sets from a very large number of events, like those available at the Large Hadron Collider (LHC) at CERN, Geneva. It mainly targets high performance servers and…

High Energy Physics - Experiment · Physics 2023-03-10 A. Pereira , A. Onofre , A. Proenca

Towards Scalable Dataframe Systems

Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is…

Databases · Computer Science 2020-06-03 Devin Petersohn , Stephen Macke , Doris Xin , William Ma , Doris Lee , Xiangxi Mo , Joseph E. Gonzalez , Joseph M. Hellerstein , Anthony D. Joseph , Aditya Parameswaran

hMDAP: A Hybrid Framework for Multi-paradigm Data Analytical Processing on Spark

We propose hMDAP, a hybrid framework for large-scale data analytical processing on Spark, to support multi-paradigm process (incl. OLAP, machine learning, and graph analysis etc.) in distributed environments. The framework features a…

Databases · Computer Science 2017-01-17 Xiaowang Zhang , Jiahui Zhang , Zhiyong Feng

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

The paradigm of big data is characterized by the need to collect and process data sets of great volume, arriving at the systems with great velocity, in a variety of formats. Spark is a widely used big data processing system that can be…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-29 Duarte M. Nascimento , Miguel Ferreira , Miguel L. Pardal

GraphMat: High performance graph analytics made productive

Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly…

Performance · Computer Science 2015-03-26 Narayanan Sundaram , Nadathur Rajagopalan Satish , Md Mostofa Ali Patwary , Subramanya R Dulloor , Satya Gautam Vadlamudi , Dipankar Das , Pradeep Dubey

RDD-Eclat: Approaches to Parallelize Eclat Algorithm on Spark RDD Framework

Initially, a number of frequent itemset mining (FIM) algorithms have been designed on the Hadoop MapReduce, a distributed big data processing framework. But, due to heavy disk I/O, MapReduce is found to be inefficient for such highly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-16 Pankaj Singh , Sudhakar Singh , P. K. Mishra , Rakhi Garg

Sparklen: A Statistical Learning Toolkit for High-Dimensional Hawkes Processes in Python

This paper introduces Sparklen, a statistical learning toolkit for Hawkes processes in Python, designed to bring together efficiency and ease of use. The purpose of this package is to provide the Python community with a complete suite of…

Methodology · Statistics 2025-03-31 Romain Edmond Lacoste

MPIgnite: An MPI-Like Language and Prototype Implementation for Apache Spark

Scale-out parallel processing based on MPI is a 25-year-old standard with at least another decade of preceding history of enabling technologies in the High Performance Computing community. Newer frameworks such as MapReduce, Hadoop, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-18 Brandon L. Morris , Anthony Skjellum

COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics

Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer…

Databases · Computer Science 2021-07-28 Tarique Siddiqui , Surajit Chaudhuri , Vivek Narasayya

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

Performance Benefits of DataMPI: A Case Study with BigDataBench

Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and…

Performance · Computer Science 2014-03-17 Fan Liang , Chen Feng , Xiaoyi Lu , Zhiwei Xu

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Translation of Array-Based Loops to Distributed Data-Parallel Programs

Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But,…

Databases · Computer Science 2020-03-24 Leonidas Fegaras , Md Hasanuzzaman Noor

Shark: SQL and Rich Analytics at Scale

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics…

Databases · Computer Science 2012-11-28 Reynold Xin , Josh Rosen , Matei Zaharia , Michael J. Franklin , Scott Shenker , Ion Stoica