Related papers: Scalable Relational Query Processing on Big Matrix…

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques…

Databases · Computer Science 2019-07-17 Mingjie Tang , Yongyang Yu , Walid G. Aref , Ahmed R. Mahmood , Qutaibah M. Malluhi , Mourad Ouzzani

Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in…

Databases · Computer Science 2018-05-23 Pietro Michiardi , Damiano Carra , Sara Migliorini

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Auto-Differentiation of Relational Computations for Very Large Scale Machine Learning

The relational data model was designed to facilitate large-scale data management and analytics. We consider the problem of how to differentiate computations expressed relationally. We show experimentally that a relational engine running an…

Machine Learning · Computer Science 2023-06-08 Yuxin Tang , Zhimin Ding , Dimitrije Jankov , Binhang Yuan , Daniel Bourgeois , Chris Jermaine

Matrix Computations and Optimization in Apache Spark

We describe matrix computations available in the cluster programming framework, Apache Spark. Out of the box, Spark provides abstractions and implementations for distributed matrices and optimization routines using these matrices. When…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-07-14 Reza Bosagh Zadeh , Xiangrui Meng , Aaron Staple , Burak Yavuz , Li Pu , Shivaram Venkataraman , Evan Sparks , Alexander Ulanov , Matei Zaharia

Scaling-Up In-Memory Datalog Processing: Observations and Techniques

Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the…

Databases · Computer Science 2018-12-11 Zhiwei Fan , Jianqiao Zhu , Zuyu Zhang , Aws Albarghouthi , Paraschos Koutris , Jignesh Patel

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Lauritz Thamsen , Dominik Scheinert , Jonathan Will , Jonathan Bader , Odej Kao

Efficient query evaluation techniques over large amount of distributed linked data

As RDF becomes more widely established and the amount of linked data is rapidly increasing, the efficient querying of large amount of data becomes a significant challenge. In this paper, we propose a family of algorithms for querying large…

Databases · Computer Science 2022-09-13 Eleftherios Kalogeros , Manolis Gergatsoulis , Matthew Damigos , Christos Nomikos

Effective Spatial Data Partitioning for Scalable Query Processing

Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and…

Databases · Computer Science 2015-09-04 Ablimit Aji , Vo Hoang , Fusheng Wang

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

SPIN: A Fast and Scalable Matrix Inversion Method in Apache Spark

The growth of big data in domains such as Earth Sciences, Social Networks, Physical Sciences, etc. has lead to an immense need for efficient and scalable linear algebra operations, e.g. Matrix inversion. Existing methods for efficient and…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-01-16 Chandan Misra , Sourangshu Bhattacharya , Soumya K. Ghosh

SPARQL query processing with Apache Spark

The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted to various "big data" problems. Query processing is one of them and…

Databases · Computer Science 2016-11-04 Hubert Naacke , Olivier Curé , Bernd Amann

Apache VXQuery: A Scalable XQuery Implementation

The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache…

Databases · Computer Science 2015-04-02 E. Preston Carman , Till Westmann , Vinayak R. Borkar , Michael J. Carey , Vassilis J. Tsotras

Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which…

Databases · Computer Science 2017-03-27 Grégory M. Essertel , Ruby Y. Tahboub , James M. Decker , Kevin J. Brown , Kunle Olukotun , Tiark Rompf

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional…

Databases · Computer Science 2022-10-14 Elham Azhir , Mehdi Hosseinzadeh , Faheem Khan , Amir Mosavi

Research Challenges in Relational Database Management Systems for LLM Queries

Large language models (LLMs) have become essential for applications such as text summarization, sentiment analysis, and automated question-answering. Recently, LLMs have also been integrated into relational database management systems to…

Databases · Computer Science 2025-08-29 Kerem Akillioglu , Anurag Chakraborty , Sairaj Voruganti , M. Tamer Özsu

Memory-Based Multi-Processing Method For Big Data Computation

The evolution of the Internet and computer applications have generated colossal amount of data. They are referred to as Big Data and they consist of huge volume, high velocity, and variable datasets that need to be managed at the right…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-13 Youssef Bassil

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups. However, existing co-clustering methods suffer from poor scalability and cannot handle large-scale data. This paper presents a novel and scalable…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Zihan Wu , Zhaoke Huang , Hong Yan