Related papers: Enabling Collaborative Data Science Development wi…
The introduction of machine learning (ML) components in software projects has created the need for software engineers to collaborate with data scientists and other specialists. While collaboration can always be challenging, ML introduces…
With the increased interest in computational sciences, machine learning (ML), pattern recognition (PR) and big data, governmental agencies, academia and manufacturers are overwhelmed by the constant influx of new algorithms and techniques…
Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these…
The recent success of machine learning (ML) has led to an explosive growth both in terms of new systems and algorithms built in industry and academia, and new applications built by an ever-growing community of data science (DS)…
Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce,…
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving…
Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research…
There are many science applications that require scalable task-level parallelism and support for flexible execution and coupling of ensembles of simulations. Most high-performance system software and middleware, however, are designed to…
Incorporating Machine Learning (ML) into existing systems is a demand that has grown among several organizations. However, the development of ML-enabled systems encompasses several social and technical challenges, which must be addressed by…
Everybody wants to analyse their data, but only few posses the data science expertise to to this. Motivated by this observation we introduce a novel framework and system \textsc{VisualSynth} for human-machine collaboration in data science.…
High-quality data has become increasingly important to software engineers in designing and implementing today's software, for example, as an input to machine-learning algorithms and visualisation- and analytics-based features. Open data -…
Data-driven science is an emerging paradigm where scientific discoveries depend on the execution of computational AI models against rich, discipline-specific datasets. With modern machine learning frameworks, anyone can develop and execute…
We present the Open MatSci ML Toolkit: a flexible, self-contained, and scalable Python-based framework to apply deep learning models and methods on scientific data with a specific focus on materials science and the OpenCatalyst Dataset. Our…
Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise…
Open Science aims to foster openness and collaboration in research, leading to more significant scientific and social impact. However, practicing Open Science comes with several challenges and is currently not properly rewarded. In this…
The advent of data-driven science in the 21st century brought about the need for well-organized structured data and associated infrastructure able to facilitate the applications of Artificial Intelligence and Machine Learning. We present an…
We use the emergent field of Complex Networks to analyze the network of scientific collaborations between entities (universities, research organizations, industry related companies,...) which collaborate in the context of the so-called…
In recent years, Machine Learning (ML) components have been increasingly integrated into the core systems of organizations. Engineering such systems presents various challenges from both a theoretical and practical perspective. One of the…
Large language models (LLMs) have achieved remarkable progress across domains and applications but face challenges such as high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language…
Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data…