Related papers: AXS: A framework for fast astronomical data proces…

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

FITS Data Source for Apache Spark

We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…

Instrumentation and Methods for Astrophysics · Physics 2018-10-17 Julien Peloton , Christian Arnault , Stéphane Plaszczynski

The Astronomy Commons Platform: A Deployable Cloud-Based Analysis Platform for Astronomy

We present a scalable, cloud-based science platform solution designed to enable next-to-the-data analyses of terabyte-scale astronomical tabular datasets. The presented platform is built on Amazon Web Services (over Kubernetes and S3…

Instrumentation and Methods for Astrophysics · Physics 2022-08-03 Steven Stetzler , Mario Jurić , Kyle Boone , Andrew Connolly , Colin T. Slater , Petar Zečević

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-15 Zhao Zhang , Kyle Barbary , Frank Austin Nothaft , Evan Sparks , Oliver Zahn , Michael J. Franklin , David A. Patterson , Saul Perlmutter

Architecture of processing and analysis system for big astronomical data

This work explores the use of big data technologies deployed in the cloud for processing of astronomical data. We have applied Hadoop and Spark to the task of co-adding astronomical images. We compared the overhead and execution time of…

Instrumentation and Methods for Astrophysics · Physics 2017-04-03 Ivan Kolosov , Sergey Gerasimov , Alexander Meshcheryakov

Scaling pair count to next galaxy surveys

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…

Instrumentation and Methods for Astrophysics · Physics 2022-01-04 S. Plaszczynski , J. E. Campagne , J. Peloton , C. Arnault

Mining Area Skyline Objects from Map-based Big Data using Apache Spark Framework

The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-05 Chen Li , Ye Zhu , Yang Cao , Jinli Zhang , Annisa Annisa , Debo Cheng , Yasuhiko Morimoto

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-02 Janak Dahal , Elias Ioup , Shaikh Arifuzzaman , Mahdi Abdelguerfi

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data.…

Databases · Computer Science 2020-08-04 Salman Ahmed Shaikh , Komal Mariam , Hiroyuki Kitagawa , Kyoung-Sook Kim

Astronomical Data Fusion Tool Based on PostgreSQL

With the application of advanced astronomical technologies, equipments and methods all over the world, astronomy covers from radio, infrared, visible light, ultraviolet, X-ray and gamma ray band, and enters into the era of full wavelength…

Instrumentation and Methods for Astrophysics · Physics 2016-11-09 Bo Han , Yanxia Zhang , Shoubo Zhong , Yongheng Zhao

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

Artificial Intelligence · Computer Science 2016-10-20 Sergio Ramírez-Gallego , Héctor Mouriño-Talín , David Martínez-Rego , Verónica Bolón-Canedo , José Manuel Benítez , Amparo Alonso-Betanzos , Francisco Herrera

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

Making Access to Astronomical Software More Efficient

Access to astronomical data through archives and VO is essential but does not solve all problems. Availability of appropriate software for analyzing the data is often equally important for the efficiency with which a researcher can publish…

Instrumentation and Methods for Astrophysics · Physics 2010-04-27 P. Grosbol , D. Tody

SMA-X: Versatile information sharing in and around telescopes, and beyond

We developed the SMA eXchange (SMA-X) as a real-time data sharing solution, built atop a central Redis database. SMA-X is a storage convention, facilitated by a set of server-side Lua scripts (or Redis functions) which enable efficient…

Instrumentation and Methods for Astrophysics · Physics 2025-01-29 Attila Kovács , Paul K. Grimes , Christopher Moriarty , Robert Wilson

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark

The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Chandan Misra , Sourangshu Bhattacharya , Soumya K. Ghosh

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

AKARI-CAS --- Online Service for AKARI All-Sky Catalogues

The AKARI All-Sky Catalogues are an important infrared astronomical database for next-generation astronomy that take over the IRAS catalog. We have developed an online service, AKARI Catalogue Archive Server (AKARI-CAS), for astronomers.…

Instrumentation and Methods for Astrophysics · Physics 2011-07-28 C. Yamauchi , S. Fujishima , N. Ikeda , K. Inada , M. Katano , H. Kataza , S. Makiuti , K. Matsuzaki , S. Takita , Y. Yamamoto , I. Yamamura