English
Related papers

Related papers: Towards Scalable Dataframe Systems

200 papers

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox

Process Mining is a branch of Data Science that aims to extract process-related information from event data contained in information systems, that is steadily increasing in amount. Many algorithms, and a general-purpose open source…

Databases · Computer Science 2019-08-01 Alessandro Berti

Pandas is defined as a software library which is used for data analysis in Python programming language. As pandas is a fast, easy and open source data analysis tool, it is rapidly used in different software engineering projects like…

Software Engineering · Computer Science 2023-05-11 Sajib Kumar Saha Joy , Farzad Ahmed , Al Hasib Mahamud , Nibir Chandra Mandal

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more…

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly…

Databases · Computer Science 2021-02-11 Phanwadee Sinthong , Michael J. Carey

Pandas is widely used for data science applications, but users often run into problems when datasets are larger than memory. There are several frameworks based on lazy evaluation that handle large datasets, but the programs have to be…

Databases · Computer Science 2025-01-15 Bhushan Pal Singh , Priyesh Kumar , Chiranmoy Bhattacharya , S. Sudarshan

Data preparation is a trial-and-error process that typically involves countless iterations over the data to define the best pipeline of operators for a given task. With tabular data, practitioners often perform that burdensome activity on…

Dataflow programming is a popular and convenient programming paradigm in systems modelling, optimisation, and machine learning. It has a number of advantages, for instance the lacks of control flow allows computation to be carried out in…

Programming Languages · Computer Science 2021-03-03 Steven W. T. Cheung , Dan R. Ghica , Koko Muroya

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay

PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written…

Programming Languages · Computer Science 2014-07-17 Marcin Cieslik , Cameron Mura

Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-11 Ehsan Totoni , Wajih Ul Hassan , Todd A. Anderson , Tatiana Shpeisman

Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity…

Databases · Computer Science 2019-08-20 Phanwadee Sinthong , Michael J. Carey

Computational materials science produces large quantities of data, both in terms of high-throughput calculations and individual studies. Extracting knowledge from this large and heterogeneous pool of data is challenging due to the wide…

Materials Science · Physics 2024-10-23 Martin Kuban , Santiago Rigamonti , Claudia Draxl

We present the first public release (v0.1) of the open-source GADGET Dataframe Library: gadfly. The aim of this package is to leverage the capabilities of the broader python scientific computing ecosystem by providing tools for analyzing…

Instrumentation and Methods for Astrophysics · Physics 2016-10-19 Jacob Hummel

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a…

Databases · Computer Science 2013-02-14 Sherif Sakr , Anna Liu , Ayman G. Fayoumi

Using parallel embedded systems these days is increasing. They are getting more complex due to integrating multiple functionalities in one application or running numerous ones concurrently. This concerns a wide range of applications,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-18 Hasna Bouraoui , Chadlia Jerad , Omar Romdhani , Jeronimo Castrillon

Despite being the most popular programming language, Python has not yet received enough attention from the community. To the best of our knowledge, there is no general static analysis framework proposed to facilitate the implementation of…

Software Engineering · Computer Science 2022-02-25 Li Li , Jiawei Wang , Haowei Quan

New cloud programming and deployment models pose challenges to software application engineers who are looking, often in vain, for tools to automate any necessary code adaptation and transformation. Function-as-a-Service interfaces are…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-24 Josef Spillner

We consider two classes of stream-based computations which admit taking linear combinations of execution runs: probabilistic sampling and generalized animation. The dataflow architecture is a natural platform for programming with streams.…

Programming Languages · Computer Science 2016-01-06 Michael Bukatin , Steve Matthews

Many research questions can be answered quickly and efficiently using data already collected for previous research. This practice is called secondary data analysis (SDA), and has gained popularity due to lower costs and improved research…

Digital Libraries · Computer Science 2020-04-07 Yasith Jayawardana , Sampath Jayarathna
‹ Prev 1 2 3 10 Next ›