Related papers: Generative Data Refinement: Just Ask for Better Da…

Regularizing Neural Networks with Meta-Learning Generative Models

This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small…

Machine Learning · Computer Science 2023-10-24 Shin'ya Yamaguchi , Daiki Chijiwa , Sekitoshi Kanai , Atsutoshi Kumagai , Hisashi Kashima

Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully…

Machine Learning · Statistics 2025-05-09 Jialong Jiang , Wenkang Hu , Jian Huang , Yuling Jiao , Xu Liu

On the Stability of Iterative Retraining of Generative Models on their own Data

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is…

Machine Learning · Computer Science 2024-04-03 Quentin Bertrand , Avishek Joey Bose , Alexandre Duplessis , Marco Jiralerspong , Gauthier Gidel

Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences

The rapid progress in generative models has resulted in impressive leaps in generation quality, blurring the lines between synthetic and real data. Web-scale datasets are now prone to the inevitable contamination by synthetic data, directly…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Damien Ferbach , Quentin Bertrand , Avishek Joey Bose , Gauthier Gidel

Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Xuemei Jia , Jiawei Du , Hui Wei , Jun Chen , Joey Tianyi Zhou , Zheng Wang

A Framework for Auditable Synthetic Data Generation

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

Guided Data Repair

In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on…

Databases · Computer Science 2011-03-17 Mohamed Yakout , Ahmed K. Elmagarmid , Jennifer Neville , Mourad Ouzzani , Ihab F. Ilyas

Harnessing large-language models to generate private synthetic text

Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. An alternative approach, which this paper studies, is to use a sensitive dataset to…

Machine Learning · Computer Science 2024-01-12 Alexey Kurakin , Natalia Ponomareva , Umar Syed , Liam MacDermed , Andreas Terzis

A Unified Framework for Generative Data Augmentation: A Comprehensive Survey

Generative data augmentation (GDA) has emerged as a promising technique to alleviate data scarcity in machine learning applications. This thesis presents a comprehensive survey and unified framework of the GDA landscape. We first provide an…

Machine Learning · Computer Science 2024-04-23 Yunhao Chen , Zihui Yan , Yunjie Zhu

Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective

Training generative machine learning models to produce synthetic tabular data has become a popular approach for enhancing privacy in data sharing. As this typically involves processing sensitive personal information, releasing either the…

Cryptography and Security · Computer Science 2026-02-02 Georgi Ganev , Emiliano De Cristofaro

Boosting Data Analytics With Synthetic Volume Expansion

Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more…

Machine Learning · Statistics 2024-03-12 Xiaotong Shen , Yifei Liu , Rex Shen

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

Acquiring large quantities of data and annotations is known to be effective for developing high-performing deep learning models, but is difficult and expensive to do in the healthcare context. Adding synthetic training data using generative…

Image and Video Processing · Electrical Eng. & Systems 2023-10-06 Menghan Yu , Sourabh Kulhare , Courosh Mehanian , Charles B Delahunt , Daniel E Shea , Zohreh Laverriere , Ishan Shah , Matthew P Horning

Generative Models for Synthetic Data: Transforming Data Mining in the GenAI Era

Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation…

Machine Learning · Computer Science 2025-08-28 Dawei Li , Yue Huang , Ming Li , Tianyi Zhou , Xiangliang Zhang , Huan Liu

Enabling Synthetic Data adoption in regulated domains

The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated…

Machine Learning · Computer Science 2022-04-14 Giorgio Visani , Giacomo Graffi , Mattia Alfero , Enrico Bagli , Davide Capuzzo , Federico Chesani

Maximizing the Potential of Synthetic Data: Insights from Random Matrix Theory

Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only…

Machine Learning · Computer Science 2024-10-14 Aymane El Firdoussi , Mohamed El Amine Seddik , Soufiane Hayou , Reda Alami , Ahmed Alzubaidi , Hakim Hacid

Transitioning from Real to Synthetic data: Quantifying the bias in model

With the advent of generative modeling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modeling healthcare outcome, risk decisioning in financial…

Machine Learning · Computer Science 2021-05-11 Aman Gupta , Deepak Bhatt , Anubha Pandey

Data Pruning in Generative Diffusion Models

Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little…

Machine Learning · Computer Science 2025-03-17 Rania Briq , Jiangtao Wang , Stefan Kesselheim

GDeR: Safeguarding Efficiency, Balancing, and Robustness via Prototypical Graph Pruning

Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by…

Machine Learning · Computer Science 2024-10-18 Guibin Zhang , Haonan Dong , Yuchen Zhang , Zhixun Li , Dingshuo Chen , Kai Wang , Tianlong Chen , Yuxuan Liang , Dawei Cheng , Kun Wang

Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline

Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model…

Human-Computer Interaction · Computer Science 2025-05-14 Shivani Kapania , Stephanie Ballard , Alex Kessler , Jennifer Wortman Vaughan