Related papers: Towards Scalable Dataframe Systems

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox

Increasing Scalability of Process Mining using Event Dataframes: How Data Structure Matters

Process Mining is a branch of Data Science that aims to extract process-related information from event data contained in information systems, that is steadily increasing in amount. Many algorithms, and a general-purpose open source…

Databases · Computer Science 2019-08-01 Alessandro Berti

An Empirical Study on How the Developers Discussed about Pandas Topics

Pandas is defined as a software library which is used for data analysis in Python programming language. As pandas is a fast, easy and open source data analysis tool, it is rapidly used in different software engineering projects like…

Software Engineering · Computer Science 2023-05-11 Sajib Kumar Saha Joy , Farzad Ahmed , Al Hasib Mahamud , Nibir Chandra Mandal

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Arup Kumar Sarker , Mills Staylor , Gregor von Laszewski , Kaiying Shan , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Thejaka Amila Kanewela , Geoffrey Fox

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly…

Databases · Computer Science 2021-02-11 Phanwadee Sinthong , Michael J. Carey

Efficient Dataframe Systems: Lazy Fat Pandas on a Diet

Pandas is widely used for data science applications, but users often run into problems when datasets are larger than memory. There are several frameworks based on lazy evaluation that handle large datasets, but the programs have to be…

Databases · Computer Science 2025-01-15 Bhushan Pal Singh , Priyesh Kumar , Chiranmoy Bhattacharya , S. Sudarshan

Evaluation of Dataframe Libraries for Data Preparation on a Single Machine

Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. With tabular data, practitioners often perform that burdensome activity on…

Databases · Computer Science 2024-11-22 Angelo Mozzillo , Luca Zecchini , Luca Gagliardelli , Adeel Aslam , Sonia Bergamaschi , Giovanni Simonini

Transparent Synchronous Dataflow

Dataflow programming is a popular and convenient programming paradigm in systems modelling, optimisation, and machine learning. It has a number of advantages, for instance the lacks of control flow allows computation to be carried out in…

Programming Languages · Computer Science 2021-03-03 Steven W. T. Cheung , Dan R. Ghica , Koko Muroya

A Comparison of Big Data Frameworks on a Layered Dataflow Model

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay

PaPy: Parallel and Distributed Data-processing Pipelines in Python

PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written…

Programming Languages · Computer Science 2014-07-17 Marcin Cieslik , Cameron Mura

HiFrames: High Performance Data Frames in a Scripting Language

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-11 Ehsan Totoni , Wajih Ul Hassan , Todd A. Anderson , Tatiana Shpeisman

AFrame: Extending DataFrames for Large-Scale Modern Data Analysis (Extended Version)

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

MADAS -- A Python framework for assessing similarity in materials-science data

Computational materials science produces large quantities of data, both in terms of high-throughput calculations and individual studies. Extracting knowledge from this large and heterogeneous pool of data is challenging due to the wide…

Materials Science · Physics 2024-10-23 Martin Kuban , Santiago Rigamonti , Claudia Draxl

gadfly: A pandas-based Framework for Analyzing GADGET Simulation Data

We present the first public release (v0.1) of the open-source GADGET Dataframe Library: gadfly. The aim of this package is to leverage the capabilities of the broader python scientific computing ecosystem by providing tools for analyzing…

Instrumentation and Methods for Astrophysics · Physics 2016-10-19 Jacob Hummel

The Family of MapReduce and Large Scale Data Processing Systems

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a…

Databases · Computer Science 2013-02-14 Sherif Sakr , Anna Liu , Ayman G. Fayoumi

mAPN: Modeling, Analysis, and Exploration of Algorithmic and Parallelism Adaptivity

Using parallel embedded systems these days is increasing. They are getting more complex due to integrating multiple functionalities in one application or running numerous ones concurrently. This concerns a wide range of applications,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-18 Hasna Bouraoui , Chadlia Jerad , Omar Romdhani , Jeronimo Castrillon

Scalpel: The Python Static Analysis Framework

Despite being the most popular programming language, Python has not yet received enough attention from the community. To the best of our knowledge, there is no general static analysis framework proposed to facilitate the implementation of…

Software Engineering · Computer Science 2022-02-25 Li Li , Jiawei Wang , Haowei Quan

Transformation of Python Applications into Function-as-a-Service Deployments

New cloud programming and deployment models pose challenges to software application engineers who are looking, often in vain, for tools to automate any necessary code adaptation and transformation. Function-as-a-Service interfaces are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-24 Josef Spillner

Almost Continuous Transformations of Software and Higher-order Dataflow Programming

We consider two classes of stream-based computations which admit taking linear combinations of execution runs: probabilistic sampling and generalized animation. The dataflow architecture is a natural platform for programming with streams.…

Programming Languages · Computer Science 2016-01-06 Michael Bukatin , Steve Matthews

DFS: A Dataset File System for Data Discovering Users

Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research…

Digital Libraries · Computer Science 2020-04-07 Yasith Jayawardana , Sampath Jayarathna