English
Related papers

Related papers: Zero-Cost, Arrow-Enabled Data Interface for Apache…

200 papers

This paper describes a distributed implementation of Apache Arrow that can leverage cluster-shared load-store addressable memory that is hardware-coherent only within each node. The implementation is built on the ThymesisFlow prototype that…

Emerging Technologies · Computer Science 2024-04-05 Philip Groet , Joost Hoozemans , Andreas Grapentin , Felix Eberhardt , Zaid Al-Ars , H. Peter Hofstee

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-11 Tanveer Ahmad , Zaid Al Ars , H. Peter Hofstee

Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-14 Jianshen Liu , Carlos Maltzahn , Matthew L. Curry , Craig Ulmer

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

With the ever-increasing dataset sizes, several file formats like Parquet, ORC, and Avro have been developed to store data efficiently and to save network and interconnect bandwidth at the price of additional CPU utilization. However, with…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-24 Jayjeet Chakraborty , Ivo Jimenez , Sebastiaan Alvarez Rodriguez , Alexandru Uta , Jeff LeFevre , Carlos Maltzahn

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Agentic workflows in large language model systems integrate retrieval, reasoning, and memory, but existing frameworks suffer from scalability and reproducibility limitations due to fragmented data orchestration, serialization overhead, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-05 Arup Kumar Sarker , Mills Staylor , Aymen Alsaadi , Gregor von Laszewski , Shantenu Jha , Geoffrey Fox

Healthcare data is a valuable resource for research, analysis, and decision-making in the medical field. However, healthcare data is often fragmented and distributed across various sources, making it challenging to combine and analyze…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-12 Mohammad Heydari , Reza Sarshar , Mohammad Ali Soltanshahi

Distributed Stream Processing Systems (DSPSs) are among the currently most emerging topics in data management, with applications ranging from real-time event monitoring to processing complex dataflow programs and big data analytics. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-06 Vinu E. Venugopal , Martin Theobald , Samira Chaychi , Amal Tawakuli

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which…

The transition from human-centric to agent-centric software development practices is disrupting existing knowledge sharing environments for software developers. Traditional peer-to-peer repositories and developer communities for shared…

Artificial Intelligence · Computer Science 2025-11-12 Valentin Tablan , Scott Taylor , Gabriel Hurtado , Kristoffer Bernhem , Anders Uhrenholt , Gabriele Farei , Karo Moilanen

Serverless computing is increasingly adopted for its ability to manage complex, event-driven workloads without the need for infrastructure provisioning. However, traditional resource allocation in serverless platforms couples CPU and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-03 Lingxiao Jin , Zinuo Cai , Zebin Chen , Hongyu Zhao , Ruhui Ma

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-31 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…

Performance · Computer Science 2018-05-09 Zhengyu Yang , Danlin Jia , Stratis Ioannidis , Ningfang Mi , Bo Sheng

English. This document is designed to study the data structures that can be used in the Apache Spark framework and to evaluate the best performing ones to implement solutions, in particular we will evaluate advantages / disadvantages…

Databases · Computer Science 2018-10-30 Massimiliano Morrelli
‹ Prev 1 2 3 10 Next ›