Related papers: LocationSpark: In-memory Distributed Spatial Query…

Effective Spatial Data Partitioning for Scalable Query Processing

Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and…

Databases · Computer Science 2015-09-04 Ablimit Aji , Vo Hoang , Fusheng Wang

Scalable Relational Query Processing on Big Matrix Data

The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of…

Databases · Computer Science 2021-11-10 Yongyang Yu , Mingjie Tang , Walid G. Aref

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in…

Databases · Computer Science 2018-05-23 Pietro Michiardi , Damiano Carra , Sara Migliorini

SCOPE: Scalable Composite Optimization for Learning on Spark

Many machine learning models, such as logistic regression~(LR) and support vector machine~(SVM), can be formulated as composite optimization problems. Recently, many distributed stochastic optimization~(DSO) methods have been proposed to…

Machine Learning · Statistics 2016-12-13 Shen-Yi Zhao , Ru Xiang , Ying-Hao Shi , Peng Gao , Wu-Jun Li

Enhancing In-Memory Spatial Indexing with Learned Search

Spatial data is ubiquitous. Massive amounts of data are generated every day from a plethora of sources such as billions of GPS-enabled devices (e.g., cell phones, cars, and sensors), consumer-based applications (e.g., Uber and Strava), and…

Databases · Computer Science 2023-09-13 Varun Pandey , Alexander van Renen , Eleni Tzirita Zacharatou , Andreas Kipf , Ibrahim Sabek , Jialin Ding , Volker Markl , Alfons Kemper

NumS: Scalable Array Programming for the Cloud

Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-14 Melih Elibol , Vinamra Benara , Samyu Yagati , Lianmin Zheng , Alvin Cheung , Michael I. Jordan , Ion Stoica

Multi-Resource Parallel Query Scheduling and Optimization

Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…

Databases · Computer Science 2014-04-01 Minos Garofalakis , Yannis Ioannidis

Partitioning, Indexing and Querying Spatial Data on Cloud

The number of mobile devices (e.g., smartphones, wearable technologies) is rapidly growing. In line with this trend, a massive amount of spatial data is being collected since these devices allow users to geo-tag user-generated content.…

Databases · Computer Science 2016-12-20 Afsin Akdogan

REPOSE: Distributed Top-k Trajectory Similarity Search with Local Reference Point Tries

Trajectory similarity computation is a fundamental component in a variety of real-world applications, such as ridesharing, road planning, and transportation optimization. Recent advances in mobile devices have enabled an unprecedented…

Databases · Computer Science 2021-01-27 Bolong Zheng , Lianggui Weng , Xi Zhao , Kai Zeng , Xiaofang Zhou , Christian S. Jensen

A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations

As the size of modern data sets exceeds the disk and memory capacities of a single computer, machine learning practitioners have resorted to parallel and distributed computing. Given that optimization is one of the pillars of machine…

Machine Learning · Statistics 2019-12-10 Biyi Fang , Diego Klabjan

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative…

Databases · Computer Science 2013-02-19 K. Ashwin Kumar , Amol Deshpande , Samir Khuller

Skip Hash: A Fast Ordered Map Via Software Transactional Memory

Scalable ordered maps must ensure that range queries, which operate over many consecutive keys, provide intuitive semantics (e.g., linearizability) without degrading the performance of concurrent insertions and removals. These goals are…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-11 Matthew Rodriguez , Vitaly Aksenov , Michael Spear

Sparkle: Optimizing Spark for Large Memory Machines and Analytics

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases

Modern cloud databases present scaling as a binary decision: scale-out by adding nodes or scale-up by increasing per-node resources. This one-dimensional view is limiting because database performance, cost, and coordination overhead emerge…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Shahir Abdullah , Syed Rohit Zaman

Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems

Modern big data systems run on cloud environments where resources are shared amongst several users and applications. As a result, declarative user queries in these environments need to be optimized and executed over resources that…

Databases · Computer Science 2019-06-18 Alekh Jindal , Lalitha Viswanathan , Konstantinos Karanasos

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

WISK: A Workload-aware Learned Index for Spatial Keyword Queries

Spatial objects often come with textual information, such as Points of Interest (POIs) with their descriptions, which are referred to as geo-textual data. To retrieve such data, spatial keyword queries that take into account both spatial…

Databases · Computer Science 2023-04-17 Yufan Sheng , Xin Cao , Yixiang Fang , Kaiqi Zhao , Jianzhong Qi , Gao Cong , Wenjie Zhang

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional…

Databases · Computer Science 2022-10-14 Elham Azhir , Mehdi Hosseinzadeh , Faheem Khan , Amir Mosavi

STREAK: An Efficient Engine for Processing Top-k SPARQL Queries with Spatial Filters

The importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc., has prompted the adoption of recent GeoSPARQL standard in many RDF processing engines. In addition to large…

Databases · Computer Science 2017-10-23 Jyoti Leeka , Srikanta Bedathur , Debajyoti Bera , Sriram Lakshminarasimhan