Related papers: Extensible Data Skipping

GeoFlink: A Distributed and Scalable Framework for the Real-time Processing of Spatial Streams

Apache Flink is an open-source system for scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is a requirement of many applications dealing with spatial data.…

Databases · Computer Science 2020-08-04 Salman Ahmed Shaikh , Komal Mariam , Hiroyuki Kitagawa , Kyoung-Sook Kim

Humboldt: Metadata-Driven Extensible Data Discovery

Data discovery is crucial for data management and analysis and can benefit from better utilization of metadata. For example, users may want to search data using queries like ``find the tables created by Alex and endorsed by Mike that…

Databases · Computer Science 2024-08-22 Alex Bäuerle , Çağatay Demiralp , Michael Stonebraker

Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion

By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their…

Databases · Computer Science 2015-03-20 Joachim Selke , Christoph Lofi , Wolf-Tilo Balke

A Survey on Geographically Distributed Big-Data Processing using MapReduce

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many…

Databases · Computer Science 2017-07-07 Shlomi Dolev , Patricia Florissi , Ehud Gudes , Shantanu Sharma , Ido Singer

Accelerating Large-scale Data Exploration through Data Diffusion

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-17 Ioan Raicu , Yong Zhao , Ian Foster , Alex Szalay

A Linked Data Application Framework to Enable Rapid Prototyping

Application developers, in our experience, tend to hesitate when dealing with linked data technologies. To reduce their initial hurdle and enable rapid prototyping, we propose in this paper a framework for building linked data applications.…

Databases · Computer Science 2021-04-29 Markus Schröder , Christian Jilek , Andreas Dengel

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution

Snowflake revolutionized data warehousing with an elastic architecture that decouples compute and storage, enabling scalable solutions for diverse data analytics needs. Building on this foundation, Snowflake has advanced its AI Data Cloud…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Chenwei Xie , Urjeet Shrestha , Corbin McElhanney , Lukas Lorimer , Gopal V , Zihao Ye , Yi Pan , Nic Crouch , Elliott Brossard , Florian Funke , Yuxiong He

MetaFlow: a Scalable Metadata Lookup Service for Distributed File Systems in Data Centers

In large-scale distributed file systems, efficient meta- data operations are critical since most file operations have to interact with metadata servers first. In existing distributed hash table (DHT) based metadata management systems, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-11 Peng Sun , Yonggang Wen , Ta Nguyen Binh Duong , Haiyong Xie

HiFrames: High Performance Data Frames in a Scripting Language

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-11 Ehsan Totoni , Wajih Ul Hassan , Todd A. Anderson , Tatiana Shpeisman

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Object Proxy Patterns for Accelerating Distributed Applications

Workflow and serverless frameworks have empowered new approaches to distributed application design by abstracting compute resources. However, their typically limited or one-size-fits-all support for advanced data flow patterns leaves…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-03 J. Gregory Pauloski , Valerie Hayot-Sasson , Logan Ward , Alexander Brace , André Bauer , Kyle Chard , Ian Foster

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller…

Digital Libraries · Computer Science 2017-02-06 Helge Holzmann , Vinay Goel , Avishek Anand

Provenance-based Data Skipping (TechReport)

Database systems analyze queries to determine upfront which data is needed for answering them and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and…

Databases · Computer Science 2021-05-31 Xing Niu , Ziyu Liu , Pengyuan Li , Boris Glavic

Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

The growing number of moving Internet-of-Things (IoT) devices has led to a surge in moving object data, powering applications such as traffic routing, hotspot detection, or weather forecasting. When managing such data, spatial database…

Databases · Computer Science 2025-10-22 Tim C. Rese , Alexandra Kapp , David Bermbach

SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows

Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These dataflows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant…

Databases · Computer Science 2013-11-26 Astrid Rheinländer , Arvid Heise , Fabian Hueske , Ulf Leser , Felix Naumann

Enhancing Software Development Process (ESDP) using Data Mining Integrated Environment

Nowadays, it has become a basic need to reuse existing Application Programming Interface (API), Class Libraries, and frameworks for rapid software development. Software developers often reuse this by calling the respective APIs or…

Software Engineering · Computer Science 2020-05-07 Ziaur Rahman

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data Intensive Applications

Data intensive applications often involve the analysis of large datasets that require large amounts of compute and storage resources. While dedicated compute and/or storage farms offer good task/data throughput, they suffer low resource…

Distributed, Parallel, and Cluster Computing · Computer Science 2008-08-27 Ioan Raicu , Yong Zhao , Ian Foster , Alex Szalay