Related papers: Data Cleaning Using Large Language Models

Cocoon: Semantic Table Profiling Using Large Language Models

Data profilers play a crucial role in the preprocessing phase of data analysis by identifying quality issues such as missing, extreme, or erroneous values. Traditionally, profilers have relied solely on statistical methods, which lead to…

Databases · Computer Science 2024-04-22 Zezhou Huang , Eugene Wu

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done…

Databases · Computer Science 2021-09-16 Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov , Arash Termehchy

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes…

Databases · Computer Science 2019-04-25 Ki Hyun Tae , Yuji Roh , Young Hun Oh , Hyunsu Kim , Steven Euijong Whang

Human-Centric Data Cleaning [Vision]

Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently…

Databases · Computer Science 2018-01-03 El Kindi Rezig , Mourad Ouzzani , Ahmed K. Elmagarmid , Walid G. Aref

Flowco: Rethinking Data Analysis in the Age of LLMs

Conducting data analysis typically involves authoring code to transform, visualize, analyze, and interpret data. Large language models (LLMs) are now capable of generating such code for simple, routine analyses. LLMs promise to democratize…

Human-Computer Interaction · Computer Science 2025-04-22 Stephen N. Freund , Brooke Simon , Emery D. Berger , Eunice Jun

Data Cleaning and Machine Learning: A Systematic Literature Review

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing…

Machine Learning · Computer Science 2024-06-03 Pierre-Olivier Côté , Amin Nikanjam , Nafisa Ahmed , Dmytro Humeniuk , Foutse Khomh

Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However,…

Databases · Computer Science 2025-03-17 Sedir Mohammed , Felix Naumann , Hazar Harmouch

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs

Machine learning's influence is expanding rapidly, now integral to decision-making processes from corporate strategy to the advancements in Industry 4.0. The efficacy of Artificial Intelligence broadly hinges on the caliber of data used…

Databases · Computer Science 2024-04-30 Fabian Biester , Mohamed Abdelaal , Daniel Del Gaudio

Lingua Manga: A Generic Large Language Model Centric System for Data Curation

Data curation is a wide-ranging area which contains many critical but time-consuming data processing tasks. However, the diversity of such tasks makes it challenging to develop a general-purpose data curation system. To address this issue,…

Databases · Computer Science 2023-09-04 Zui Chen , Lei Cao , Sam Madden

CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit

Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage code language models (CLMs). This phenomenon occurs when evaluation data inadvertently…

Software Engineering · Computer Science 2024-11-19 Jialun Cao , Songqiang Chen , Wuqi Zhang , Hau Ching Lo , Shing-Chi Cheung

A Hybrid Data Cleaning Framework using Markov Logic Networks

With the increase of dirty data, data cleaning turns into a crux of data analysis. Most of the existing algorithms rely on either qualitative techniques (e.g., data rules) or quantitative ones (e.g., statistical methods). In this paper, we…

Databases · Computer Science 2019-03-15 Yunjun Gao , Congcong Ge , Xiaoye Miao , Haobo Wang , Bin Yao , Qing Li

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various…

Artificial Intelligence · Computer Science 2025-10-15 Samuel Yeh , Sharon Li

An Open Source Data Contamination Report for Large Language Models

Data contamination in model evaluation has become increasingly prevalent with the growing popularity of large language models. It allows models to "cheat" via memorisation instead of displaying true capabilities. Therefore, contamination…

Computation and Language · Computer Science 2024-01-30 Yucheng Li , Frank Guerin , Chenghua Lin

A Survey on Data Contamination for Large Language Models

Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data…

Computation and Language · Computer Science 2025-06-06 Yuxing Cheng , Yi Chang , Yuan Wu

A Survey on Data Selection for Language Models

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as…

Computation and Language · Computer Science 2024-08-05 Alon Albalak , Yanai Elazar , Sang Michael Xie , Shayne Longpre , Nathan Lambert , Xinyi Wang , Niklas Muennighoff , Bairu Hou , Liangming Pan , Haewon Jeong , Colin Raffel , Shiyu Chang , Tatsunori Hashimoto , William Yang Wang

Can Large Language Models Replace Data Scientists in Biomedical Research?

Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing…

Artificial Intelligence · Computer Science 2025-04-10 Zifeng Wang , Benjamin Danek , Ziwei Yang , Zheng Chen , Jimeng Sun

Data Wrangling Task Automation Using Code-Generating Language Models

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets, allowing the trained model to outperform baselines trained on the full dataset. However, the expanding body of research…

Computation and Language · Computer Science 2025-02-25 Ziche Liu , Rui Ke , Yajiao Liu , Feng Jiang , Haizhou Li

Classifying Proposals of Decentralized Autonomous Organizations Using Large Language Models

Our study demonstrates the effective use of Large Language Models (LLMs) for automating the classification of complex datasets. We specifically target proposals of Decentralized Autonomous Organizations (DAOs), as the clas-sification of…

Computers and Society · Computer Science 2024-07-04 Christian Ziegler , Marcos Miranda , Guangye Cao , Gustav Arentoft , Doo Wan Nam