Related papers: Badgers: generating data quality deficits with Pyt…

Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets

The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks.…

Machine Learning · Computer Science 2021-09-07 Nitin Gupta , Hima Patel , Shazia Afzal , Naveen Panwar , Ruhi Sharma Mittal , Shanmukha Guttula , Abhinav Jain , Lokesh Nagalapatti , Sameep Mehta , Sandeep Hans , Pranay Lohia , Aniya Aggarwal , Diptikalyan Saha

PuckTrick: A Library for Making Synthetic Data More Realistic

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete…

Machine Learning · Computer Science 2026-04-30 Alessandra Agostini , Andrea Maurino , Blerina Spahiu

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented…

Machine Learning · Computer Science 2016-09-22 Guillaume Lemaitre , Fernando Nogueira , Christos K. Aridas

tsBNgen: A Python Library to Generate Time Series Data from an Arbitrary Dynamic Bayesian Network Structure

Synthetic data is widely used in various domains. This is because many modern algorithms require lots of data for efficient training, and data collection and labeling usually are a time-consuming process and are prone to errors.…

Machine Learning · Computer Science 2020-09-11 Manie Tadayon , Greg Pottie

balance -- a Python package for balancing biased data samples

Surveys are an important research tool, providing unique measurements on subjective experiences such as sentiment and opinions that cannot be measured by other means. However, because survey data is collected from a self-selected group of…

Computation · Statistics 2023-07-14 Tal Sarig , Tal Galili , Roee Eilat

NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data

Existing Python libraries and tools lack the ability to efficiently compute statistical test results for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability…

Mathematical Software · Computer Science 2025-05-02 Fabian Woller , Lis Arend , Christian Fuchsberger , Markus List , David B. Blumenthal

Outlier Detection for Improved Data Quality and Diversity in Dialog Systems

In a corpus of data, outliers are either errors: mistakes in the data that are counterproductive, or are unique: informative samples that improve model robustness. Identifying outliers can lead to better datasets by (1) removing noise in…

Computation and Language · Computer Science 2019-04-08 Stefan Larson , Anish Mahendran , Andrew Lee , Jonathan K. Kummerfeld , Parker Hill , Michael A. Laurenzano , Johann Hauswald , Lingjia Tang , Jason Mars

PyResBugs: A Dataset of Residual Python Bugs for Natural Language-Driven Fault Injection

This paper presents PyResBugs, a curated dataset of residual bugs, i.e., defects that persist undetected during traditional testing but later surface in production, collected from major Python frameworks. Each bug in the dataset is paired…

Software Engineering · Computer Science 2025-05-12 Domenico Cotroneo , Giuseppe De Rosa , Pietro Liguori

A Survey of Bugs in AI-Generated Code

Developers are widely using AI code-generation models, aiming to increase productivity and efficiency. However, there are also quality concerns regarding the AI-generated code. The generated code is produced by models trained on publicly…

Software Engineering · Computer Science 2025-12-08 Ruofan Gao , Amjed Tahir , Peng Liang , Teo Susnjak , Foutse Khomh

Improving Data Quality through Deep Learning and Statistical Models

Traditional data quality control methods are based on users experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing…

Artificial Intelligence · Computer Science 2018-10-17 Wei Dai , Kenji Yoshigoe , William Parsley

PyGOD: A Python Library for Graph Outlier Detection

PyGOD is an open-source Python library for detecting outliers in graph data. As the first comprehensive library of its kind, PyGOD supports a wide array of leading graph-based methods for outlier detection under an easy-to-use,…

Machine Learning · Computer Science 2024-06-04 Kay Liu , Yingtong Dou , Xueying Ding , Xiyang Hu , Ruitong Zhang , Hao Peng , Lichao Sun , Philip S. Yu

News Signals: An NLP Library for Text and Time Series

We present an open-source Python library for building and using datasets where inputs are clusters of textual data, and outputs are sequences of real values representing one or more time series signals. The news-signals library supports…

Computation and Language · Computer Science 2023-12-19 Chris Hokamp , Demian Gholipour Ghalandari , Parsa Ghaffari

chatter: a Python library for applying information theory and AI/ML models to animal communication

The study of animal communication often involves categorizing units into types (e.g. syllables in songbirds, or notes in humpback whales). While this approach is useful in many cases, it necessarily flattens the complexity and nuance…

Sound · Computer Science 2025-12-23 Mason Youngblood

Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets

High-quality, error-free datasets are a key ingredient in building reliable, accurate, and unbiased machine learning (ML) models. However, real world datasets often suffer from errors due to sensor malfunctions, data entry mistakes, or…

Machine Learning · Computer Science 2025-03-11 Tommaso Bendinelli , Artur Dox , Christian Holz

Augraphy: A Data Augmentation Library for Document Images

This paper introduces Augraphy, a Python library for constructing data augmentation pipelines which produce distortions commonly seen in real-world document image datasets. Augraphy stands apart from other data augmentation tools by…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Alexander Groleau , Kok Wei Chee , Stefan Larson , Samay Maini , Jonathan Boarman

Detecting Quality Problems in Data Models by Clustering Heterogeneous Data Values

Data is of high quality if it is fit for its intended use. The quality of data is influenced by the underlying data model and its quality. One major quality problem is the heterogeneity of data as quality aspects such as understandability…

Machine Learning · Computer Science 2021-11-15 Viola Wenz , Arno Kesper , Gabriele Taentzer

PyODDS: An End-to-End Outlier Detection System

PyODDS is an end-to end Python system for outlier detection with database support. PyODDS provides outlier detection algorithms which meet the demands for users in different fields, w/wo data science or machine learning background. PyODDS…

Machine Learning · Computer Science 2019-10-14 Yuening Li , Daochen Zha , Na Zou , Xia Hu

When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order to remedy such generations, the…

Computation and Language · Computer Science 2025-08-08 Manish Nagireddy , Inkit Padhi , Soumya Ghosh , Prasanna Sattigeri

pyribs: A Bare-Bones Python Library for Quality Diversity Optimization

Recent years have seen a rise in the popularity of quality diversity (QD) optimization, a branch of optimization that seeks to find a collection of diverse, high-performing solutions to a given problem. To grow further, we believe the QD…

Neural and Evolutionary Computing · Computer Science 2023-04-18 Bryon Tjanaka , Matthew C. Fontaine , David H. Lee , Yulun Zhang , Nivedit Reddy Balam , Nathaniel Dennler , Sujay S. Garlanka , Nikitas Dimitri Klapsis , Stefanos Nikolaidis

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines…

Software Engineering · Computer Science 2025-09-23 Ashlesha Akella , Akshar Kaul , Krishnasuri Narayanam , Sameep Mehta