Related papers: Zero-Cost, Arrow-Enabled Data Interface for Apache…

Leveraging Apache Arrow for Zero-copy, Zero-serialization Cluster Shared Memory

This paper describes a distributed implementation of Apache Arrow that can leverage cluster-shared load-store addressable memory that is hardware-coherent only within each node. The implementation is built on the ThymesisFlow prototype that…

Emerging Technologies · Computer Science 2024-04-05 Philip Groet , Joost Hoozemans , Andreas Grapentin , Felix Eberhardt , Zaid Al-Ars , H. Peter Hofstee

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices

Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-11 Tanveer Ahmad , Zaid Al Ars , H. Peter Hofstee

Processing Particle Data Flows with SmartNICs

Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-14 Jianshen Liu , Carlos Maltzahn , Matthew L. Curry , Craig Ulmer

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Towards an Arrow-native Storage System

With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Jayjeet Chakraborty , Ivo Jimenez , Sebastiaan Alvarez Rodriguez , Alexandru Uta , Jeff LeFevre , Carlos Maltzahn

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman

AAFLOW: Scalable Patterns for Agentic AI Workflows

Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Arup Kumar Sarker , Mills Staylor , Aymen Alsaadi , Gregor von Laszewski , Shantenu Jha , Geoffrey Fox

Distributed Record Linkage in Healthcare Data with Apache Spark

Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-12 Mohammad Heydari , Reza Sarshar , Mohammad Ali Soltanshahi

AIR: A Light-Weight Yet High-Performance Dataflow Engine based on Asynchronous Iterative Routing

Distributed Stream Processing Systems (DSPSs) are among the currently most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-06 Vinu E. Venugopal , Martin Theobald , Samira Chaychi , Amal Tawakuli

Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

Flare: Native Compilation for Heterogeneous Workloads in Apache Spark

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which…

Databases · Computer Science 2017-03-27 Grégory M. Essertel , Ruby Y. Tahboub , James M. Decker , Kevin J. Brown , Kunle Olukotun , Tiark Rompf

Smarter Together: Creating Agentic Communities of Practice through Shared Experiential Learning

The transition from human-centric to agent-centric software development practices is disrupting existing knowledge sharing environments for software developers. Traditional peer-to-peer repositories and developer communities for shared…

Artificial Intelligence · Computer Science 2025-11-12 Valentin Tablan , Scott Taylor , Gabriel Hurtado , Kristoffer Bernhem , Anders Uhrenholt , Gabriele Farei , Karo Moilanen

AARC: Automated Affinity-aware Resource Configuration for Serverless Workflows

Serverless computing is increasingly adopted for its ability to manage complex, event-driven workloads without the need for infrastructure provisioning. However, traditional resource allocation in serverless platforms couples CPU and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-03 Lingxiao Jin , Zinuo Cai , Zebin Chen , Hongyu Zhao , Ruhui Ma

Identifying the potential of Near Data Computing for Apache Spark

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-31 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…

Performance · Computer Science 2018-05-09 Zhengyu Yang , Danlin Jia , Stratis Ioannidis , Ningfang Mi , Bo Sheng

Studio e confronto delle strutture di Apache Spark

English. This document is designed to study the data structures that can be used in the Apache Spark framework and to evaluate the best performing ones to implement solutions, in particular we will evaluate advantages / disadvantages…

Databases · Computer Science 2018-10-30 Massimiliano Morrelli