Related papers: A Case for Dataset Specific Profiling
Artificial intelligence (AI) provides many opportunities to improve private and public life. Discovering patterns and structures in large troves of data in an automated manner is a core component of data science, and currently drives…
The breakthrough in Deep Learning neural networks has transformed the use of AI and machine learning technologies for the analysis of very large experimental datasets. These datasets are typically generated by large-scale experimental…
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine…
In this work, we reflect on the data-driven modeling paradigm that is gaining ground in AI-driven automation of patient care. We argue that the repurposing of existing real-world patient datasets for machine learning may not always…
Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is…
Machine Science, or Data-driven Research, is a new and interesting scientific methodology that uses advanced computational techniques to identify, retrieve, classify and analyse data in order to generate hypotheses and develop models. In…
While data science has emerged as a contentious new scientific field, enormous debates and discussions have been made on it why we need data science and what makes it as a science. In reviewing hundreds of pieces of literature which include…
In this paper we argue that data science is a coherent and novel approach to empirical problems that, in its most general form, does not build understanding about phenomena. Within the new type of mathematization at work in data science,…
Empirical and LLM-based research in model-driven engineering increasingly relies on datasets of software models, for instance, to train or evaluate machine learning techniques for modeling support. These datasets have a significant impact…
The rapid advancement of large language models has fundamentally shifted the bottleneck in AI development from computational power to data availability-with countless valuable datasets remaining hidden across specialized repositories,…
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations…
Data intensive research requires the support of appropriate datasets. However, it is often time-consuming to discover usable datasets matching a specific research topic. We formulate the dataset discovery problem on an attributed…
Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do…
Continuous dynamical systems, characterized by differential equations, are ubiquitously used to model several important problems: plasma dynamics, flow through porous media, weather forecasting, and epidemic dynamics. Recently, a wide range…
As we are fast approaching the beginning of a paradigm shift in the field of science, Data driven science (the so called fourth science paradigm) is going to be the driving force in research and innovation. From medicine to biodiversity and…
With the explosion of applications of Data Science, the field is has come loose from its foundations. This article argues for a new program of applied research in areas familiar to researchers in Bayesian methods in AI that are needed to…
Data-driven approaches, most prominently deep learning, have become powerful tools for prediction in many domains. A natural question to ask is whether data-driven methods could also be used to predict global weather patterns days in…
The advent of data-driven science in the 21st century brought about the need for well-organized structured data and associated infrastructure able to facilitate the applications of Artificial Intelligence and Machine Learning. We present an…
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and…
Recent research has helped to cultivate growing awareness that machine learning systems fueled by big data can create or exacerbate troubling disparities in society. Much of this research comes from outside of the practicing data science…