Related papers: Dataset Management Platform for Machine Learning
Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even…
Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which…
Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used…
This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the…
Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which…
In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels,…
Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these…
Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data…
The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics…
Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe.…
Machine learning datasets are powerful but unwieldy. Despite the fact that large datasets commonly contain problematic material--whether from a technical, legal, or ethical perspective--datasets are valuable resources when handled carefully…
As machine learning systems grow in scale, so do their training data requirements, forcing practitioners to automate and outsource the curation of training data in order to achieve state-of-the-art performance. The absence of trustworthy…
The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two…
Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with…
The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So,…
Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…
Machine learning inference is increasingly being executed locally on mobile and embedded platforms, due to the clear advantages in latency, privacy and connectivity. In this paper, we present approaches for online resource management in…
Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models…
Workflow technology is rapidly evolving and, rather than being limited to modeling the control flow in business processes, is becoming a key mechanism to perform advanced data management, such as big data analytics. This survey focuses on…
Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many…