Related papers: Apache VXQuery: A Scalable XQuery Implementation

Parallelization of XPath Queries using Modern XQuery Processors

A practical and promising approach to parallelizing XPath queries was proposed by Bordawekar et al. in 2009, which enables parallelization on top of existing XML database engines. Although they experimentally demonstrated the speedup by…

Databases · Computer Science 2018-06-21 Shigeyuki Sato , Wei Hao , Kiminori Matsuzaki

Scalable Relational Query Processing on Big Matrix Data

The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of…

Databases · Computer Science 2021-11-10 Yongyang Yu , Mingjie Tang , Walid G. Aref

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and…

Databases · Computer Science 2020-10-09 Edmon Begoli , Jesús Camacho Rodríguez , Julian Hyde , Michael J. Mior , Daniel Lemire

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly…

Databases · Computer Science 2021-02-11 Phanwadee Sinthong , Michael J. Carey

An Application Driven Analysis of the ParalleX Execution Model

Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. The task of assessing future…

Distributed, Parallel, and Cluster Computing · Computer Science 2011-09-27 Matthew Anderson , Maciej Brodowicz , Hartmut Kaiser , Thomas Sterling

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional…

Databases · Computer Science 2022-10-14 Elham Azhir , Mehdi Hosseinzadeh , Faheem Khan , Amir Mosavi

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

SPARQL query processing with Apache Spark

The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted to various "big data" problems. Query processing is one of them and…

Databases · Computer Science 2016-11-04 Hubert Naacke , Olivier Curé , Bernd Amann

Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud

Context: The combination of distributed stream processing with microservice architectures is an emerging pattern for building data-intensive software systems. In such systems, stream processing frameworks such as Apache Flink, Apache Kafka…

Software Engineering · Computer Science 2023-11-02 Sören Henning , Wilhelm Hasselbring

Evaluating Hive and Spark SQL with BigBench

The objective of this work was to utilize BigBench [1] as a Big Data benchmark and evaluate and compare two processing engines: MapReduce [2] and Spark [3]. MapReduce is the established engine for processing data on Hadoop. Spark is a…

Databases · Computer Science 2016-01-14 Todor Ivanov , Max-Georg Beer

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Automatic Parallelization of Sequential Programs

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-21 Peter Kraft , Amos Waterland , Daniel Y Fu , Anitha Gollamudi , Shai Szulanski , Margo Seltzer

High-Speed Query Processing over High-Speed Networks

Modern database clusters entail two levels of networks: connecting CPUs and NUMA regions inside a single server in the small and multiple servers in the large. The huge performance gap between these two types of networks used to slow down…

Databases · Computer Science 2015-11-03 Wolf Roediger , Tobias Muehlbauer , Alfons Kemper , Thomas Neumann

A New Execution Model and Executor for Adaptively Optimizing the Performance of Parallel Algorithms Using HPX Runtime System

Developing parallel algorithms efficiently requires careful management of concurrency across diverse hardware architectures. C++ executors provide a standardized interface that simplifies the development process, allowing developers to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-22 Karame Mohammadiporshokooh , Steven R. Brandt , Hartmut Kaiser

Rumble: Data Independence for Large Messy Data Sets

This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built…

Databases · Computer Science 2020-10-21 Ingo Müller , Ghislain Fourny , Stefan Irimescu , Can Berker Cikis , Gustavo Alonso

Squire: A General-Purpose Accelerator to Exploit Fine-Grain Parallelism on Dependency-Bound Kernels

Multiple HPC applications are often bottlenecked by compute-intensive kernels implementing complex dependency patterns (data-dependency bound). Traditional general-purpose accelerators struggle to effectively exploit fine-grain parallelism…

Hardware Architecture · Computer Science 2025-10-24 Rubén Langarita , Jesús Alastruey-Benedé , Pablo Ibáñez-Marín , Santiago Marco-Sola , Miquel Moretó , Adrià Armejach

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-04-21 Bita Hasheminezhad , Shahrzad Shirzad , Nanmiao Wu , Patrick Diehl , Hannes Schulz , Hartmut Kaiser

Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid…

Databases · Computer Science 2019-03-27 Jesús Camacho-Rodríguez , Ashutosh Chauhan , Alan Gates , Eugene Koifman , Owen O'Malley , Vineet Garg , Zoltan Haindrich , Sergey Shelukhin , Prasanth Jayachandran , Siddharth Seth , Deepak Jaiswal , Slim Bouguerra , Nishant Bangarwa , Sankar Hariappan , Anishek Agarwal , Jason Dere , Daniel Dai , Thejas Nair , Nita Dembla , Gopal Vijayaraghavan , Günther Hagleitner

The Family of MapReduce and Large Scale Data Processing Systems

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a…

Databases · Computer Science 2013-02-14 Sherif Sakr , Anna Liu , Ayman G. Fayoumi