Related papers: Principles for data analysis workflows

Tools and Recommendations for Reproducible Teaching

It is recommended that teacher-scholars of data science adopt reproducible workflows in their research as scholars and teach reproducible workflows to their students. In this paper, we propose a third dimension to reproducibility practices…

Other Statistics · Statistics 2024-07-23 Mine Dogucu , Mine Cetinkaya-Rundel

How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools

With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is…

Software Engineering · Computer Science 2022-04-19 Adam Tutko , Austin Z. Henley , Audris Mockus

Scaling Systematic Literature Reviews with Machine Learning Pipelines

Systematic reviews, which entail the extraction of data from large numbers of scientific documents, are an ideal avenue for the application of machine learning. They are vital to many fields of science and philanthropy, but are very…

Computation and Language · Computer Science 2020-10-12 Seraphina Goldfarb-Tarrant , Alexander Robertson , Jasmina Lazic , Theodora Tsouloufi , Louise Donnison , Karen Smyth

Progressive Data Science: Potential and Challenges

Data science requires time-consuming iterative manual activities. In particular, activities such as data selection, preprocessing, transformation, and mining, highly depend on iterative trial-and-error processes that could be sped-up…

Human-Computer Interaction · Computer Science 2019-09-13 Cagatay Turkay , Nicola Pezzotti , Carsten Binnig , Hendrik Strobelt , Barbara Hammer , Daniel A. Keim , Jean-Daniel Fekete , Themis Palpanas , Yunhai Wang , Florin Rusu

Preprocessing Methods and Pipelines of Data Mining: An Overview

Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models…

Machine Learning · Computer Science 2019-06-21 Canchen Li

Design Principles for Data Analysis

The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design…

Methodology · Statistics 2023-05-24 Lucy D'Agostino McGowan , Roger D. Peng , Stephanie C. Hicks

A Primer on the Data Cleaning Pipeline

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this…

Databases · Computer Science 2023-07-26 Rebecca C. Steorts

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

A System for Quantifying Data Science Workflows with Fine-Grained Procedural Logging and a Pilot Study

It is important for researchers to understand precisely how data scientists turn raw data into insights, including typical programming patterns, workflow, and methodology. This paper contributes a novel system, called DataInquirer, that…

Human-Computer Interaction · Computer Science 2024-05-29 Jinjin Zhao , Avidgor Gal , Sanjay Krishnan

The Fundamental Principles of Reproducibility

Reproducibility is a confused terminology. In this paper, I take a fundamental view on reproducibility rooted in the scientific method. The scientific method is analysed and characterised in order to develop the terminology required to…

Machine Learning · Computer Science 2022-01-19 Odd Erik Gundersen

Provide Proactive Reproducible Analysis Transparency with Every Publication

The high incidence of irreproducible research has led to urgent appeals for transparency and equitable practices in open science. For the scientific disciplines that rely on computationally intensive analyses of large data sets, a granular…

Computational Engineering, Finance, and Science · Computer Science 2024-08-20 Paul Meijer , Nicole Howard , Jessica Liang , Autumn Kelsey , Sathya Subramanian , Ed Johnson , Paul Mariz , James Harvey , Madeline Ambrose , Vitalii Tereshchenko , Aldan Beaubien , Neelima Inala , Yousef Aggoune , Stark Pister , Anne Vetto , Melissa Kinsey , Tom Bumol , Ananda Goldrath , Xiaojun Li , Troy Torgerson , Peter Skene , Lauren Okada , Christian La France , Zach Thomson , Lucas Graybuck

Opinionated practices for teaching reproducibility: motivation, guided instruction and practice

In the data science courses at the University of British Columbia, we define data science as the study, development and practice of reproducible and auditable processes to obtain insight from data. While reproducibility is core to our…

Computers and Society · Computer Science 2022-07-26 Joel Ostblom , Tiffany Timbers

The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling,…

Software Engineering · Computer Science 2022-02-15 Sumon Biswas , Mohammad Wardat , Hridesh Rajan

An Integrated Framework for Process Discovery Algorithm Evaluation

Process mining offers techniques to exploit event data by providing insights and recommendations to improve business processes. The growing amount of algorithms for process discovery has raised the question of which algorithms perform best…

Software Engineering · Computer Science 2018-06-20 Toon Jouck , Alfredo Bolt , Benoît Depaire , Massimiliano de Leoni , Wil M. P. van der Aalst

Provenance and data differencing for workflow reproducibility analysis

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for…

Databases · Computer Science 2014-06-05 Paolo Missier , Simon Woodman , Hugo Hiden , Paul Watson

Dataset Distillation Meets Provable Subset Selection

Deep learning has grown tremendously over recent years, yielding state-of-the-art results in various fields. However, training such models requires huge amounts of data, increasing the computational time and cost. To address this, dataset…

Machine Learning · Computer Science 2023-07-18 Murad Tukan , Alaa Maalouf , Margarita Osadchy

Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study

How do analysis goals and context affect exploratory data analysis (EDA)? To investigate this question, we conducted semi-structured interviews with 18 data analysts. We characterize common exploration goals: profiling (assessing data…

Human-Computer Interaction · Computer Science 2019-11-05 Kanit Wongsuphasawat , Yang Liu , Jeffrey Heer

Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis

Exploring data requires a fast feedback loop from the analyst to the system, with a latency below about 10 seconds because of human cognitive limitations. When data becomes large or analysis becomes complex, sequential computations can no…

Human-Computer Interaction · Computer Science 2016-07-19 Jean-Daniel Fekete , Romain Primet

Three principles for modernizing an undergraduate regression analysis course

As data have become more prevalent in academia, industry, and daily life, it is imperative that undergraduate students are equipped with the skills needed to analyze data in the modern environment. In recent years there has been a lot of…

Other Statistics · Statistics 2023-07-10 Maria Tackett

Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows

Improvements in computational and experimental capabilities are rapidly increasing the amount of scientific data that is routinely generated. In applications that are constrained by memory and computational intensity, excessively large…

Machine Learning · Computer Science 2023-02-28 Malik Hassanaly , Bruce A. Perry , Michael E. Mueller , Shashank Yellapantula