Related papers: Data Readiness Levels

Data Readiness for Natural Language Processing

This document concerns data readiness in the context of machine learning and Natural Language Processing. It describes how an organization may proceed to identify, make available, validate, and prepare data to facilitate automated analysis…

Computers and Society · Computer Science 2020-10-01 Fredrik Olsson , Magnus Sahlgren

Data Readiness Report

Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which…

Databases · Computer Science 2020-10-16 Shazia Afzal , Rajmohan C , Manish Kesarwani , Sameep Mehta , Hima Patel

Technical Report on Data Integration and Preparation

AI application developers typically begin with a dataset of interest and a vision of the end analytic or insight they wish to gain from the data at hand. Although these are two very important components of an AI workflow, one often spends…

Databases · Computer Science 2021-03-04 El Kindi Rezig , Michael Cafarella , Vijay Gadepally

Empowering Tabular Data Preparation with Language Models: Why and How?

Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to…

Artificial Intelligence · Computer Science 2025-08-05 Mengshi Chen , Yuxiang Sun , Tengchao Li , Jianwei Wang , Kai Wang , Xuemin Lin , Ying Zhang , Wenjie Zhang

Towards "all-inclusive" Data Preparation to ensure Data Quality

Data preparation, especially data cleaning, is very important to ensure data quality and to improve the output of automated decision systems. Since there is no single tool that covers all steps required, a combination of tools -- namely a…

Databases · Computer Science 2023-08-29 Valerie Restat

AI Competitions and Benchmarks: Dataset Development

Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even…

Machine Learning · Computer Science 2024-04-16 Romain Egele , Julio C. S. Jacques Junior , Jan N. van Rijn , Isabelle Guyon , Xavier Baró , Albert Clapés , Prasanna Balaprakash , Sergio Escalera , Thomas Moeslund , Jun Wan

Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects

We present experiences and lessons learned from increasing data readiness of heterogeneous data for artificial intelligence projects using visual analysis methods. Increasing the data readiness level involves understanding both the data as…

Methodology · Statistics 2024-09-09 Mattias Tiger , Daniel Jakobsson , Anders Ynnerman , Fredrik Heintz , Daniel Jönsson

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice…

Human-Computer Interaction · Computer Science 2022-08-25 Amy K. Heger , Liz B. Marquis , Mihaela Vorvoreanu , Hanna Wallach , Jennifer Wortman Vaughan

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data…

Computation and Language · Computer Science 2025-11-14 Yuqi Zhu , Yi Zhong , Jintian Zhang , Ziheng Zhang , Shuofei Qiao , Yujie Luo , Lun Du , Da Zheng , Ningyu Zhang , Huajun Chen

Towards Next-Generation LLM Training: From the Data-Centric Perspective

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of…

Computation and Language · Computer Science 2026-03-17 Hao Liang , Zhengyang Zhao , Zhaoyang Han , Meiyi Qiang , Xiaochen Ma , Bohan Zeng , Qifeng Cai , Zhiyu Li , Linpeng Tang , Weinan E , Wentao Zhang

Data Science Methodologies: Current Challenges and Future Approaches

Data science has employed great research efforts in developing advanced analytics, improving data models and cultivating new algorithms. However, not many authors have come across the organizational and socio-technical challenges that arise…

Machine Learning · Computer Science 2022-01-17 Iñigo Martinez , Elisabeth Viles , Igor G. Olaizola

Space Trusted Autonomy Readiness Levels

Technology Readiness Levels are a mainstay for organizations that fund, develop, test, acquire, or use technologies. Technology Readiness Levels provide a standardized assessment of a technology's maturity and enable consistent comparison…

Computers and Society · Computer Science 2022-10-25 Kerianne L. Hobbs , Joseph B. Lyons , Martin S. Feather , Benjamen P Bycroft , Sean Phillips , Michelle Simon , Mark Harter , Kenneth Costello , Yuri Gawdiak , Stephen Paine

Dataset Management Platform for Machine Learning

The quality of the data in a dataset can have a substantial impact on the performance of a machine learning model that is trained and/or evaluated using the dataset. Effective dataset management, including tasks such as data cleanup,…

Databases · Computer Science 2023-03-16 Ze Mao , Yang Xu , Erick Suarez

Unfolding Data Quality Dimensions in Practice: A Survey

Data quality describes the degree to which data meet specific requirements and are fit for use by humans and/or downstream tasks (e.g., artificial intelligence). Data quality can be assessed across multiple high-level concepts called…

Databases · Computer Science 2025-07-24 Vasileios Papastergios , Lisa Ehrlinger , Anastasios Gounaris

Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?

Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a…

Computation and Language · Computer Science 2025-12-01 Matteo Spreafico , Ludovica Tassini , Camilla Sancricca , Cinzia Cappiello

First Study on Data Readiness Level

We introduce the idea of Data Readiness Level (DRL) to measure the relative richness of data to answer specific questions often encountered by data scientists. We first approach the problem in its full generality explaining its desired…

Information Retrieval · Computer Science 2017-02-08 Hui Guan , Thanos Gentimis , Hamid Krim , James Keiser

Technology Readiness Levels for Machine Learning Systems

The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can lead to technical debt, scope creep and misaligned…

Machine Learning · Computer Science 2023-01-11 Alexander Lavin , Ciarán M. Gilligan-Lee , Alessya Visnjic , Siddha Ganju , Dava Newman , Atılım Güneş Baydin , Sujoy Ganguly , Danny Lange , Amit Sharma , Stephan Zheng , Eric P. Xing , Adam Gibson , James Parr , Chris Mattmann , Yarin Gal

Collaboration Challenges in Building ML-Enabled Systems: Communication, Documentation, Engineering, and Process

The introduction of machine learning (ML) components in software projects has created the need for software engineers to collaborate with data scientists and other specialists. While collaboration can always be challenging, ML introduces…

Software Engineering · Computer Science 2022-02-14 Nadia Nahar , Shurui Zhou , Grace Lewis , Christian Kästner

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Towards Next Generation Data Engineering Pipelines

Data engineering pipelines are a widespread way to provide high-quality data for all kinds of data science applications. However, numerous challenges still remain in the composition and operation of such pipelines. Data engineering…

Databases · Computer Science 2025-07-30 Kevin M. Kramer , Valerie Restat , Sebastian Strasser , Uta Störl , Meike Klettke