Related papers: Learning Over Dirty Data Without Cleaning

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors such as inconsistent, out-of-date, or outlier data. Identifying dirty data is often a…

Databases · Computer Science 2016-01-18 Sanjay Krishnan , Jiannan Wang , Eugene Wu , Michael J. Franklin , Ken Goldberg

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

Learn, Unlearn and Relearn: An Online Learning Paradigm for Deep Neural Networks

Deep neural networks (DNNs) are often trained on the premise that the complete training data set is provided ahead of time. However, in real-world scenarios, data often arrive in chunks over time. This leads to important considerations…

Machine Learning · Computer Science 2023-03-21 Vijaya Raghavan T. Ramkumar , Elahe Arani , Bahram Zonooz

DeepDB: Learn from Data, not from Queries!

The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major…

Databases · Computer Science 2019-09-04 Benjamin Hilprecht , Andreas Schmidt , Moritz Kulessa , Alejandro Molina , Kristian Kersting , Carsten Binnig

Learning Models over Relational Data: A Brief Tutorial

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and…

Databases · Computer Science 2019-11-18 Maximilian Schleich , Dan Olteanu , Mahmoud Abo-Khamis , Hung Q. Ngo , XuanLong Nguyen

Deep Self-Learning From Noisy Labels

ConvNets achieve good results when training from clean data, but learning from noisy labels significantly degrades performances and remains challenging. Unlike previous works constrained by many conditions, making them infeasible to real…

Computer Vision and Pattern Recognition · Computer Science 2019-08-21 Jiangfan Han , Ping Luo , Xiaogang Wang

Learning Relational Tabular Data without Shared Features

Learning relational tabular data has gained significant attention recently, but most studies focus on single tables, overlooking the potential of cross-table learning. Cross-table learning, especially in scenarios where tables lack shared…

Machine Learning · Computer Science 2025-02-17 Zhaomin Wu , Shida Wang , Ziyang Wang , Bingsheng He

Distance-based Data Cleaning: A Survey (Technical Report)

With the rapid development of the internet technology, dirty data are commonly observed in various real scenarios, e.g., owing to unreliable sensor reading, transmission and collection from heterogeneous sources. To deal with their negative…

Databases · Computer Science 2020-11-24 Yu Sun , Jian Zhang

An Effective Data-Driven Approach for Localizing Deep Learning Faults

Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered.…

Software Engineering · Computer Science 2023-07-19 Mohammad Wardat , Breno Dantas Cruz , Wei Le , Hridesh Rajan

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A…

Databases · Computer Science 2017-12-29 El Kindi Rezig , Mourad Ouzzani , Walid G. Aref , Ahmed K. Elmagarmid , Ahmed R. Mahmood

Model Debiasing by Learnable Data Augmentation

Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning "shortcuts". In…

Machine Learning · Computer Science 2024-08-12 Pietro Morerio , Ruggero Ragonesi , Vittorio Murino

Few Clean Instances Help Denoising Distant Supervision

Existing distantly supervised relation extractors usually rely on noisy data for both model training and evaluation, which may lead to garbage-in-garbage-out systems. To alleviate the problem, we study whether a small clean dataset could…

Computation and Language · Computer Science 2022-09-15 Yufang Liu , Ziyin Huang , Yijun Wang , Changzhi Sun , Man Lan , Yuanbin Wu , Xiaofeng Mou , Ding Wang

Relational Deep Learning: Graph Representation Learning on Relational Databases

Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data…

Machine Learning · Computer Science 2023-12-11 Matthias Fey , Weihua Hu , Kexin Huang , Jan Eric Lenssen , Rishabh Ranjan , Joshua Robinson , Rex Ying , Jiaxuan You , Jure Leskovec

Exploring Learning Complexity for Efficient Downstream Dataset Pruning

The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require…

Machine Learning · Computer Science 2025-05-09 Wenyu Jiang , Zhenlong Liu , Zejian Xie , Songxin Zhang , Bingyi Jing , Hongxin Wei

Database Meets Deep Learning: Challenges and Opportunities

Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications…

Databases · Computer Science 2020-01-22 Wei Wang , Meihui Zhang , Gang Chen , H. V. Jagadish , Beng Chin Ooi , Kian-Lee Tan

Automated Data Curation for Robust Language Model Fine-Tuning

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses.…

Computation and Language · Computer Science 2024-03-20 Jiuhai Chen , Jonas Mueller

Confidence-based Reliable Learning under Dual Noises

Deep neural networks (DNNs) have achieved remarkable success in a variety of computer vision tasks, where massive labeled images are routinely required for model optimization. Yet, the data collected from the open world are unavoidably…

Computer Vision and Pattern Recognition · Computer Science 2023-02-13 Peng Cui , Yang Yue , Zhijie Deng , Jun Zhu

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Learning Deep Neural Networks under Agnostic Corrupted Supervision

Training deep neural models in the presence of corrupted supervision is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm…

Machine Learning · Computer Science 2021-02-16 Boyang Liu , Mengying Sun , Ding Wang , Pang-Ning Tan , Jiayu Zhou