Related papers: Mining Area Skyline Objects from Map-based Big Dat…

Sampling Based Approximate Skyline Calculation on Big Data

The existing algorithms for processing skyline queries cannot adapt to big data. This paper proposes two approximate skyline algorithms based on sampling. The first algorithm obtains a fixed size sample and computes the approximate skyline…

Databases · Computer Science 2020-10-16 Xingxing Xiao , Jianzhong Li

Optimization Strategies for Parallel Computation of Skylines

Skyline queries are one of the most widely adopted tools for Multi-Criteria Analysis, with applications covering diverse domains, including, e.g., Database Systems, Data Mining, and Decision Making. Skylines indeed offer a useful overview…

Databases · Computer Science 2024-11-25 Paolo Ciaccia , Davide Martinenghi

Partitioning Strategies for Parallel Computation of Flexible Skylines

While classical skyline queries identify interesting data within large datasets, flexible skylines introduce preferences through constraints on attribute weights, and further reduce the data returned. However, computing these queries can be…

Databases · Computer Science 2025-01-08 Emilio De Lorenzis , Davide Martinenghi

Integration of Skyline Queries into Spark SQL

Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The…

Databases · Computer Science 2022-10-10 Lukas Grasmann , Reinhard Pichler , Alexander Selzer

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Computing Skylines on Distributed Data

In this paper we study skyline queries in the distributed computational model, where we have $s$ remote sites and a central coordinator (the query node); each site holds a piece of data, and the coordinator wants to compute the skyline of…

Databases · Computer Science 2016-11-03 Haoyu Zhang , Qin Zhang

Large-scale text processing pipeline with Apache Spark

In this paper, we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has…

Computation and Language · Computer Science 2019-12-03 Alexey Svyatkovskiy , Kosuke Imai , Mary Kroeger , Yuki Shiraito

SkyCell: A Space-Pruning Based Parallel Skyline Algorithm

Skyline computation is an essential database operation that has many applications in multi-criteria decision making scenarios such as recommender systems. Existing algorithms have focused on checking point domination, which lack efficiency…

Databases · Computer Science 2021-07-22 Chuanwen Li , Yu Gu , Jianzhong Qi , Ge Yu

A Survey of Skyline Query Processing

Living in the Information Age allows almost everyone have access to a large amount of information and options to choose from in order to fulfill their needs. In many cases, the amount of information available and the rate of change may hide…

Databases · Computer Science 2017-04-07 Christos Kalyvas , Theodoros Tzouramanis

Efficient Computation of Subspace Skyline over Categorical Domains

Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical.…

Databases · Computer Science 2017-05-31 Md Farhadur Rahman , Abolfazl Asudeh , Nick Koudas , Gautam Das

Probabilistic Skyline Query Processing over Uncertain Data Streams in Edge Computing Environments

With the advancement of technology, the data generated in our lives is getting faster and faster, and the amount of data that various applications need to process becomes extremely huge. Therefore, we need to put more effort into analyzing…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-08 Chuan-Chi Lai , Chuan-Ming Liu , Yan-Lin Chen , Li-Chun Wang

Spatial Skyline Queries: An Efficient Geometric Algorithm

As more data-intensive applications emerge, advanced retrieval semantics, such as ranking or skylines, have attracted attention. Geographic information systems are such an application with massive spatial data. Our goal is to efficiently…

Databases · Computer Science 2009-03-19 Wanbin Son , Mu-Woong Lee , Hee-Kap Ahn , Seung-won Hwang

Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

Caching Stars in the Sky: A Semantic Caching Approach to Accelerate Skyline Queries

Multi-criteria decision making has been made possible with the advent of skyline queries. However, processing such queries for high dimensional datasets remains a time consuming task. Real-time applications are thus infeasible, especially…

Databases · Computer Science 2011-06-13 Arnab Bhattacharya , B. Palvali Teja , Sourav Dutta

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-15 Zhao Zhang , Kyle Barbary , Frank Austin Nothaft , Evan Sparks , Oliver Zahn , Michael J. Franklin , David A. Patterson , Saul Perlmutter

Optimizing and accelerating space-time Ripley's K function based on Apache Spark for distributed spatiotemporal point pattern analysis

With increasing point of interest (POI) datasets available with fine-grained spatial and temporal attributes, space-time Ripley's K function has been regarded as a powerful approach to analyze spatiotemporal point process. However,…

Computation · Statistics 2019-12-11 Yuan Wang , Zhipeng Gui , Huayi Wu , Dehua Peng , Jinghang Wu , Zousen Cui

Scaling pair count to next galaxy surveys

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of…

Instrumentation and Methods for Astrophysics · Physics 2022-01-04 S. Plaszczynski , J. E. Campagne , J. Peloton , C. Arnault

AXS: A framework for fast astronomical data processing based on Apache Spark

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark,…

Instrumentation and Methods for Astrophysics · Physics 2019-07-10 Petar Zečević , Colin T. Slater , Mario Jurić , Andrew J. Connolly , Sven Lončarić , Eric C. Bellm , V. Zach Golkhou , Krzysztof Suberlak

SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark

The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Chandan Misra , Sourangshu Bhattacharya , Soumya K. Ghosh

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag