Related papers: Datasheets for Datasets

Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with…

Machine Learning · Computer Science 2021-02-02 Ben Hutchinson , Andrew Smart , Alex Hanna , Emily Denton , Christina Greer , Oddur Kjartansson , Parker Barnes , Margaret Mitchell

Datasheets for Machine Learning Sensors

Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These "ML sensors" enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial…

Machine Learning · Computer Science 2025-10-29 Matthew Stewart , Yuke Zhang , Pete Warden , Yasmine Omri , Shvetank Prakash , Jacob Huckelberry , Joao Henrique Santos , Shawn Hymel , Benjamin Yeager Brown , Jim MacArthur , Nat Jeffries , Emanuel Moss , Mona Sloane , Brian Plancher , Vijay Janapa Reddi

AI Competitions and Benchmarks: Dataset Development

Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even…

Machine Learning · Computer Science 2024-04-16 Romain Egele , Julio C. S. Jacques Junior , Jan N. van Rijn , Isabelle Guyon , Xavier Baró , Albert Clapés , Prasanna Balaprakash , Sergio Escalera , Thomas Moeslund , Jun Wan

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which…

Machine Learning · Computer Science 2021-11-16 Amandalynne Paullada , Inioluwa Deborah Raji , Emily M. Bender , Emily Denton , Alex Hanna

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a…

Human-Computer Interaction · Computer Science 2022-04-05 Mahima Pushkarna , Andrew Zaldivar , Oddur Kjartansson

Healthsheet: Development of a Transparency Artifact for Health Datasets

Machine learning (ML) approaches have demonstrated promising results in a wide range of healthcare applications. Data plays a crucial role in developing ML-based healthcare systems that directly affect people's lives. Many of the ethical…

Artificial Intelligence · Computer Science 2022-03-01 Negar Rostamzadeh , Diana Mincu , Subhrajit Roy , Andrew Smart , Lauren Wilcox , Mahima Pushkarna , Jessica Schrouff , Razvan Amironesei , Nyalleng Moorosi , Katherine Heller

Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation

The use of AI in healthcare has the potential to improve patient care, optimize clinical workflows, and enhance decision-making. However, bias, data incompleteness, and inaccuracies in training datasets can lead to unfair outcomes and…

Computers and Society · Computer Science 2025-01-13 Marjia Siddik , Harshvardhan J. Pandit

Network Report: A Structured Description for Network Datasets

The rapid development of network science and technologies depends on shareable datasets. Currently, there is no standard practice for reporting and sharing network datasets. Some network dataset providers only share links, while others…

Social and Information Networks · Computer Science 2022-06-09 Xinyi Zheng , Ryan A. Rossi , Nesreen Ahmed , Dominik Moritz

Dataset Management Platform for Machine Learning

The quality of the data in a dataset can have a substantial impact on the performance of a machine learning model that is trained and/or evaluated using the dataset. Effective dataset management, including tasks such as data cleanup,…

Databases · Computer Science 2023-03-16 Ze Mao , Yang Xu , Erick Suarez

Position Paper on Dataset Engineering to Accelerate Science

Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a…

Machine Learning · Computer Science 2023-03-13 Emilio Vital Brazil , Eduardo Soares , Lucas Villa Real , Leonardo Azevedo , Vinicius Segura , Luiz Zerkowski , Renato Cerqueira

MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments

Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate…

Machine Learning · Computer Science 2026-02-02 Roelien C. Timmer , Necva Bölücü , Stephen Wan

Synthetic Data for Feature Selection

Feature selection is an important and active field of research in machine learning and data science. Our goal in this paper is to propose a collection of synthetic datasets that can be used as a common reference point for feature selection…

Machine Learning · Computer Science 2022-11-08 Firuz Kamalov , Hana Sulieman , Aswani Kumar Cherukuri

Completeness of Datasets Documentation on ML/AI repositories: an Empirical Investigation

ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are…

Digital Libraries · Computer Science 2025-03-19 Marco Rondina , Antonio Vetrò , Juan Carlos De Martin

Dataset Definition Standard (DDS)

This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the…

Databases · Computer Science 2021-01-11 Cyril Cappi , Camille Chapdelaine , Laurent Gardes , Eric Jenn , Baptiste Lefevre , Sylvaine Picard , Thomas Soumarmon

Datasets of Visualization for Machine Learning

Datasets of visualization play a crucial role in automating data-driven visualization pipelines, serving as the foundation for supervised model training and algorithm benchmarking. In this paper, we survey the literature on visualization…

Human-Computer Interaction · Computer Science 2024-07-24 Can Liu , Ruike Jiang , Shaocong Tan , Jiacheng Yu , Chaofan Yang , Hanning Shao , Xiaoru Yuan

Dataset search: a survey

Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts…

Databases · Computer Science 2022-11-10 Adriane Chapman , Elena Simperl , Laura Koesten , George Konstantinidis , Luis-Daniel Ibáñez-Gonzalez , Emilia Kacprzak , Paul Groth

A Survey on Dataset Distillation: Approaches, Applications and Future Directions

Dataset distillation is attracting more attention in machine learning as training sets continue to grow and the cost of training state-of-the-art models becomes increasingly high. By synthesizing datasets with high information density,…

Machine Learning · Computer Science 2023-08-25 Jiahui Geng , Zongxiong Chen , Yuandou Wang , Herbert Woisetschlaeger , Sonja Schimmler , Ruben Mayer , Zhiming Zhao , Chunming Rong

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice…

Human-Computer Interaction · Computer Science 2022-08-25 Amy K. Heger , Liz B. Marquis , Mihaela Vorvoreanu , Hanna Wallach , Jennifer Wortman Vaughan

A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication

Datasets are central to training machine learning (ML) models. The ML community has recently made significant improvements to data stewardship and documentation practices across the model development life cycle. However, the act of…

Computers and Society · Computer Science 2022-05-11 Alexandra Sasha Luccioni , Frances Corry , Hamsini Sridharan , Mike Ananny , Jason Schultz , Kate Crawford

A Critical Field Guide for Working with Machine Learning Datasets

Machine learning datasets are powerful but unwieldy. Despite the fact that large datasets commonly contain problematic material--whether from a technical, legal, or ethical perspective--datasets are valuable resources when handled carefully…

Computers and Society · Computer Science 2025-01-28 Sarah Ciston , Mike Ananny , Kate Crawford