Related papers: Analyzing billion-objects catalog interactively: A…

AXS: A framework for fast astronomical data processing based on Apache Spark

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark,…

Instrumentation and Methods for Astrophysics · Physics 2019-07-10 Petar Zečević , Colin T. Slater , Mario Jurić , Andrew J. Connolly , Sven Lončarić , Eric C. Bellm , V. Zach Golkhou , Krzysztof Suberlak

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-15 Zhao Zhang , Kyle Barbary , Frank Austin Nothaft , Evan Sparks , Oliver Zahn , Michael J. Franklin , David A. Patterson , Saul Perlmutter

Scaling pair count to next galaxy surveys

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…

Instrumentation and Methods for Astrophysics · Physics 2022-01-04 S. Plaszczynski , J. E. Campagne , J. Peloton , C. Arnault

FITS Data Source for Apache Spark

We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…

Instrumentation and Methods for Astrophysics · Physics 2018-10-17 Julien Peloton , Christian Arnault , Stéphane Plaszczynski

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-02 Janak Dahal , Elias Ioup , Shaikh Arifuzzaman , Mahdi Abdelguerfi

Large-scale text processing pipeline with Apache Spark

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

Mining Area Skyline Objects from Map-based Big Data using Apache Spark Framework

The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-05 Chen Li , Ye Zhu , Yang Cao , Jinli Zhang , Annisa Annisa , Debo Cheng , Yasuhiko Morimoto

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

CMS Analysis and Data Reduction with Apache Spark

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Oliver Gutsche , Luca Canali , Illia Cremer , Matteo Cremonesi , Peter Elmer , Ian Fisk , Maria Girone , Bo Jayatilaka , Jim Kowalkowski , Viktor Khristenko , Evangelos Motesnitsalis , Jim Pivarski , Saba Sehrish , Kacper Surdy , Alexey Svyatkovskiy

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

Artificial Intelligence · Computer Science 2016-10-20 Sergio Ramírez-Gallego , Héctor Mouriño-Talín , David Martínez-Rego , Verónica Bolón-Canedo , José Manuel Benítez , Amparo Alonso-Betanzos , Francisco Herrera

Mobile Big Data Analytics Using Deep Learning and Apache Spark

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Using Big Data Technologies for HEP Analysis

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-02 Matteo Cremonesi , Claudio Bellini , Bianny Bian , Luca Canali , Vasileios Dimakopoulos , Peter Elmer , Ian Fisk , Maria Girone , Oliver Gutsche , Siew-Yan Hoh , Bo Jayatilaka , Viktor Khristenko , Andrea Luiselli , Andrew Melo , Evangelos Evangelos , Dominick Olivito , Jacopo Pazzini , Jim Pivarski , Alexey Svyatkovskiy , Marco Zanetti

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

Comparing Two Different Approaches in Big Data and Business Analysis for Churn Prediction with the Focus on How Apache Spark Employed

Due to the significant importance of Big Data analysis, especially in business-related topics such as improving services, finding potential customers, and selecting practical approaches to manage income and expenses, many companies attempt…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-01 Mohammad Sina Kiarostami

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-31 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Lisa Gerhardt , Prabhat , Jey Kottalam , Michael Ringenburg , Kristyn Maschhoff