Related papers: AXS: A framework for fast astronomical data proces…
Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…
We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…
We present a scalable, cloud-based science platform solution designed to enable next-to-the-data analyses of terabyte-scale astronomical tabular datasets. The presented platform is built on Amazon Web Services (over Kubernetes and S3…
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…
This work explores the use of big data technologies deployed in the cloud for processing of astronomical data. We have applied Hadoop and Spark to the task of co-adding astronomical images. We compared the overhead and execution time of…
Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…
The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…
Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…
Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data.…
With the application of advanced astronomical technologies, equipments and methods all over the world, astronomy covers from radio, infrared, visible light, ultraviolet, X-ray and gamma ray band, and enters into the era of full wavelength…
With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…
Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…
Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…
Access to astronomical data through archives and VO is essential but does not solve all problems. Availability of appropriate software for analyzing the data is often equally important for the efficiency with which a researcher can publish…
We developed the SMA eXchange (SMA-X) as a real-time data sharing solution, built atop a central Redis database. SMA-X is a storage convention, facilitated by a set of server-side Lua scripts (or Redis functions) which enable efficient…
The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…
The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and…
As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…
The AKARI All-Sky Catalogues are an important infrared astronomical database for next-generation astronomy that take over the IRAS catalog. We have developed an online service, AKARI Catalogue Archive Server (AKARI-CAS), for astronomers.…