Related papers: AFrame: Extending DataFrames for Large-Scale Moder…

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly…

Databases · Computer Science 2021-02-11 Phanwadee Sinthong , Michael J. Carey

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries.…

Databases · Computer Science 2020-08-18 Xikui Wang , Michael J. Carey

Admire framework: Distributed data mining on data grid platforms

In this paper, we present the ADMIRE architecture; a new framework for developing novel and innovative data mining techniques to deal with very large and distributed heterogeneous datasets in both commercial and academic applications. The…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-30 Nhien-An Le-Khac , M-Tahar Kechadi , Joe Carthy

Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

Parallel shared-nothing data management systems have been widely used to exploit a cluster of machines for efficient and scalable data processing. When a cluster needs to be dynamically scaled in or out, data must be efficiently rebalanced.…

Databases · Computer Science 2021-05-25 Chen Luo , Michael J. Carey

AsterixDB: A Scalable, Open Source BDMS

AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data…

Databases · Computer Science 2014-07-03 Sattam Alsubaiee , Yasser Altowim , Hotham Altwaijry , Alexander Behm , Vinayak Borkar , Yingyi Bu , Michael Carey , Inci Cetindil , Madhusudan Cheelangi , Khurram Faraaz , Eugenia Gabrielova , Raman Grover , Zachary Heilbron , Young-Seok Kim , Chen Li , Guangqiang Li , Ji Mahn Ok , Nicola Onose , Pouria Pirzadeh , Vassilis Tsotras , Rares Vernica , Jian Wen , Till Westmann

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

DAME: A Distributed Data Mining & Exploration Framework within the Virtual Observatory

Nowadays, many scientific areas share the same broad requirements of being able to deal with massive and distributed datasets while, when possible, being integrated with services and applications. In order to solve the growing gap between…

Instrumentation and Methods for Astrophysics · Physics 2011-12-06 M. Brescia , S. Cavuoti , R. D'Abrusco , O. Laurino , G. Longo

Introducing Schema Inference as a Scalable SQL Function [Extended Version]

This paper introduces a novel approach to schema inference as an on-demand function integrated directly within a DBMS, targeting NoSQL databases where schema flexibility can create challenges. Unlike previous methods relying on external…

Databases · Computer Science 2024-11-21 Calvin Dani , Shiva Jahangiri , Thomas Hütter

Apache VXQuery: A Scalable XQuery Implementation

The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache…

Databases · Computer Science 2015-04-02 E. Preston Carman , Till Westmann , Vinayak R. Borkar , Michael J. Carey , Vassilis J. Tsotras

The Family of MapReduce and Large Scale Data Processing Systems

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a…

Databases · Computer Science 2013-02-14 Sherif Sakr , Anna Liu , Ayman G. Fayoumi

Evolving Large-Scale Data Stream Analytics based on Scalable PANFIS

Many distributed machine learning frameworks have recently been built to speed up the large-scale data learning process. However, most distributed machine learning used in these frameworks still uses an offline algorithm model which cannot…

Artificial Intelligence · Computer Science 2018-07-19 Mahardhika Pratama , Choiru Za'in , Eric Pardede

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

MaRe: a MapReduce-Oriented Framework for Processing Big Data with Application Containers

Background. Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-05-10 Marco Capuccini , Martin Dahlö , Salman Toor , Ola Spjuth

Aixel: A Unified, Adaptive and Extensible System for AI-powered Data Analysis

A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile,…

Databases · Computer Science 2025-10-15 Meihui Zhang , Liming Wang , Chi Zhang , Zhaojing Luo

"FRAME: Forward Recursive Adaptive Model Extraction-A Technique for Advance Feature Selection"

The challenges in feature selection, particularly in balancing model accuracy, interpretability, and computational efficiency, remain a critical issue in advancing machine learning methodologies. To address these complexities, this study…

Machine Learning · Computer Science 2026-01-06 Nachiket Kapure , Harsh Joshi , Parul Kumari , Rajeshwari Mistri , Manasi Mali

Scalable Ray Tracing Using the Distributed FrameBuffer

Image- and data-parallel rendering across multiple nodes on high-performance computing systems is widely used in visualization to provide higher frame rates, support large data sets, and render data in situ. Specifically for in situ…

Graphics · Computer Science 2023-05-15 Will Usher , Ingo Wald , Jefferson Amstutz , Johannes Günther , Carson Brownlee , Valerio Pascucci