Related papers: FITS Data Source for Apache Spark

AXS: A framework for fast astronomical data processing based on Apache Spark

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark,…

Instrumentation and Methods for Astrophysics · Physics 2019-07-10 Petar Zečević , Colin T. Slater , Mario Jurić , Andrew J. Connolly , Sven Lončarić , Eric C. Bellm , V. Zach Golkhou , Krzysztof Suberlak

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Scaling pair count to next galaxy surveys

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…

Instrumentation and Methods for Astrophysics · Physics 2022-01-04 S. Plaszczynski , J. E. Campagne , J. Peloton , C. Arnault

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-15 Zhao Zhang , Kyle Barbary , Frank Austin Nothaft , Evan Sparks , Oliver Zahn , Michael J. Franklin , David A. Patterson , Saul Perlmutter

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

Exploiting Apache Spark platform for CMS computing analytics

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing…

Data Analysis, Statistics and Probability · Physics 2017-11-03 Marco Meoni , Valentin Kuznetsov , Luca Menichetti , Justinas Rumševičius , Tommaso Boccali , Daniele Bonacorsi

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Using Big Data Technologies for HEP Analysis

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-02 Matteo Cremonesi , Claudio Bellini , Bianny Bian , Luca Canali , Vasileios Dimakopoulos , Peter Elmer , Ian Fisk , Maria Girone , Oliver Gutsche , Siew-Yan Hoh , Bo Jayatilaka , Viktor Khristenko , Andrea Luiselli , Andrew Melo , Evangelos Evangelos , Dominick Olivito , Jacopo Pazzini , Jim Pivarski , Alexey Svyatkovskiy , Marco Zanetti

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Identifying the potential of Near Data Computing for Apache Spark

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-31 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

Performance Evaluation of Linear Regression Algorithm in Cluster Environment

Cluster computing was introduced to replace the superiority of super computers. Cluster computing is able to overcome the problems that cannot be effectively dealt with supercomputers. In this paper, we are going to evaluate the performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-15 Cinantya Paramita , Fauzi Adi Rafrastara , Usman Sudibyo , R. I. W. Agung Wibowo

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data.…

Databases · Computer Science 2020-08-04 Salman Ahmed Shaikh , Komal Mariam , Hiroyuki Kitagawa , Kyoung-Sook Kim

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Efficient Fuzz Testing for Apache Spark Using Framework Abstraction

The emerging data-intensive applications are increasingly dependent on data-intensive scalable computing (DISC) systems, such as Apache Spark, to process large data. Despite their popularity, DISC applications are hard to test. In recent…

Software Engineering · Computer Science 2021-03-10 Qian Zhang , Jiyuan Wang , Muhammad Ali Gulzar , Rohan Padhye , Miryung Kim

Solving All-Pairs Shortest-Paths Problem in Large Graphs Using Apache Spark

Algorithms for computing All-Pairs Shortest-Paths (APSP) are critical building blocks underlying many practical applications. The standard sequential algorithms, such as Floyd-Warshall and Johnson, quickly become infeasible for large input…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-08 Frank Schoeneman , Jaroslaw Zola

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

CMS Analysis and Data Reduction with Apache Spark

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Oliver Gutsche , Luca Canali , Illia Cremer , Matteo Cremonesi , Peter Elmer , Ian Fisk , Maria Girone , Bo Jayatilaka , Jim Kowalkowski , Viktor Khristenko , Evangelos Motesnitsalis , Jim Pivarski , Saba Sehrish , Kacper Surdy , Alexey Svyatkovskiy

FAIR: A Hadoop-based Hybrid Model for Faculty Information Retrieval System

In era of ever-expanding data and knowledge, we lack a centralized system that maps all the faculties to their research works. This problem has not been addressed in the past and it becomes challenging for students to connect with the right…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-06-27 Noopur Gupta , Rakesh K. Lenka , Rabindra K. Barik , Harishchandra Dubey