Related papers: A Static Analysis Framework for Data Science Noteb…
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model's accuracy during offline evaluations, possibly…
Saving, or checkpointing, intermediate results during interactive data exploration can potentially boost user productivity. However, existing studies on this topic are limited, as they primarily rely on small-scale experiments with human…
Background. Jupyter notebooks are one of the main tools used by data scientists. Notebooks include features (configuration scripts, markdown, images, etc.) that make them challenging to analyze compared to traditional software. As a result,…
It is important for researchers to understand precisely how data scientists turn raw data into insights, including typical programming patterns, workflow, and methodology. This paper contributes a novel system, called DataInquirer, that…
How can we develop visual analytics (VA) tools that can be easily adopted? Visualization researchers have developed a large number of web-based VA tools to help data scientists in a wide range of tasks. However, adopting these standalone…
The massive trend of integrating data-driven AI capabilities into traditional software systems is rising new intriguing challenges. One of such challenges is achieving a smooth transition from the explorative phase of Machine Learning…
How can we better organize code in computational notebooks? Notebooks have become a popular tool among data scientists, as they seamlessly weave text and code together, supporting users to rapidly iterate and document code experiments.…
Just like other software, spreadsheets can contain significant faults. Static analysis is an accepted and well-established technique in software engineering known for its capability to discover faults. In recent years, a growing number of…
Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in…
Data exploration is an important aspect of the workflow of mixed-methods researchers, who conduct both qualitative and quantitative analysis. However, there currently exists few tools that adequately support both types of analysis…
This paper proposes the use of notebooks for the design documentation and tool interaction in the rigorous design of embedded systems. Conventionally, a notebook is a sequence of cells alternating between (textual) code and prose to form a…
Computational notebooks are notoriously prone to reproducibility failures. By permitting out-of-order cell execution, notebooks accumulate hidden state and implicit dependencies that cause interactive executions to silently diverge from…
Computational notebooks, which integrate code, documentation, tags, and visualizations into a single document, have become increasingly popular for data analysis tasks. With the advent of immersive technologies, these notebooks have evolved…
The development of data science expertise requires tacit, process-oriented skills that are difficult to teach directly. This study addresses the resulting challenge of empirically understanding how the problem-solving processes of experts…
Data science workflows are human-centered processes involving on-demand programming and analysis. While programmable and interactive interfaces such as widgets embedded within computational notebooks are suitable for these workflows, they…
In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for…
Computational notebooks (e.g., Jupyter, Google Colab) are widely used for interactive data science and machine learning. In those frameworks, users can start a session, then execute cells (i.e., a set of statements) to create variables,…
Designing a static analysis is generally a substantial undertaking, requiring significant expertise in both program analysis and the domain of the program analysis, and significant development resources. As a result, most program analyses…
Interactive visualization can support fluid exploration but is often limited to predetermined tasks. Scripting can support a vast range of queries but may be more cumbersome for free-form exploration. Embedding interactive visualization in…
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that…