Related papers: Analyzing billion-objects catalog interactively: A…
We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark,…
Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…
Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…
We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…
With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…
Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…
In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…
The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…
The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…
Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…
With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…
The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…
As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…
The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…
To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…
Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…
Due to the significant importance of Big Data analysis, especially in business-related topics such as improving services, finding potential customers, and selecting practical approaches to manage income and expenses, many companies attempt…
With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…
Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…
Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…