English
Related papers

Related papers: Analyzing billion-objects catalog interactively: A…

200 papers

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark,…

Instrumentation and Methods for Astrophysics · Physics 2019-07-10 Petar Zečević , Colin T. Slater , Mario Jurić , Andrew J. Connolly , Sven Lončarić , Eric C. Bellm , V. Zach Golkhou , Krzysztof Suberlak

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-15 Zhao Zhang , Kyle Barbary , Frank Austin Nothaft , Evan Sparks , Oliver Zahn , Michael J. Franklin , David A. Patterson , Saul Perlmutter

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…

Instrumentation and Methods for Astrophysics · Physics 2022-01-04 S. Plaszczynski , J. E. Campagne , J. Peloton , C. Arnault

We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…

Instrumentation and Methods for Astrophysics · Physics 2018-10-17 Julien Peloton , Christian Arnault , Stéphane Plaszczynski

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-02 Janak Dahal , Elias Ioup , Shaikh Arifuzzaman , Mahdi Abdelguerfi

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-05 Chen Li , Ye Zhu , Yang Cao , Jinli Zhang , Annisa Annisa , Debo Cheng , Yasuhiko Morimoto

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

The proliferation of mobile devices, such as smartphones and Internet of Things (IoT) gadgets, results in the recent mobile big data (MBD) era. Collecting MBD is unprofitable unless suitable analytics and learning methods are utilized for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-16 Mohammad Abu Alsheikh , Dusit Niyato , Shaowei Lin , Hwee-Pink Tan , Zhu Han

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Due to the significant importance of Big Data analysis, especially in business-related topics such as improving services, finding potential customers, and selecting practical approaches to manage income and expenses, many companies attempt…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-01 Mohammad Sina Kiarostami

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-05-31 Alex Gittens , Kai Rothauge , Shusen Wang , Michael W. Mahoney , Lisa Gerhardt , Prabhat , Jey Kottalam , Michael Ringenburg , Kristyn Maschhoff
‹ Prev 1 2 3 10 Next ›