English
Related papers

Related papers: Generative Data Refinement: Just Ask for Better Da…

200 papers

This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small…

Machine Learning · Computer Science 2023-10-24 Shin'ya Yamaguchi , Daiki Chijiwa , Sekitoshi Kanai , Atsutoshi Kumagai , Hisashi Kashima

The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully…

Machine Learning · Statistics 2025-05-09 Jialong Jiang , Wenkang Hu , Jian Huang , Yuling Jiao , Xu Liu

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is…

Machine Learning · Computer Science 2024-04-03 Quentin Bertrand , Avishek Joey Bose , Alexandre Duplessis , Marco Jiralerspong , Gauthier Gidel

The rapid progress in generative models has resulted in impressive leaps in generation quality, blurring the lines between synthetic and real data. Web-scale datasets are now prone to the inevitable contamination by synthetic data, directly…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Damien Ferbach , Quentin Bertrand , Avishek Joey Bose , Gauthier Gidel

High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Xuemei Jia , Jiawei Du , Hui Wei , Jun Chen , Joey Tianyi Zhou , Zheng Wang

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on…

Databases · Computer Science 2011-03-17 Mohamed Yakout , Ahmed K. Elmagarmid , Jennifer Neville , Mourad Ouzzani , Ihab F. Ilyas

Differentially private training algorithms like DP-SGD protect sensitive training data by ensuring that trained models do not reveal private information. An alternative approach, which this paper studies, is to use a sensitive dataset to…

Machine Learning · Computer Science 2024-01-12 Alexey Kurakin , Natalia Ponomareva , Umar Syed , Liam MacDermed , Andreas Terzis

Generative data augmentation (GDA) has emerged as a promising technique to alleviate data scarcity in machine learning applications. This thesis presents a comprehensive survey and unified framework of the GDA landscape. We first provide an…

Machine Learning · Computer Science 2024-04-23 Yunhao Chen , Zihui Yan , Yunjie Zhu

Training generative machine learning models to produce synthetic tabular data has become a popular approach for enhancing privacy in data sharing. As this typically involves processing sensitive personal information, releasing either the…

Cryptography and Security · Computer Science 2026-02-02 Georgi Ganev , Emiliano De Cristofaro

Synthetic data generation, a cornerstone of Generative Artificial Intelligence, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance. As synthetic data becomes more…

Machine Learning · Statistics 2024-03-12 Xiaotong Shen , Yifei Liu , Rex Shen

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

Acquiring large quantities of data and annotations is known to be effective for developing high-performing deep learning models, but is difficult and expensive to do in the healthcare context. Adding synthetic training data using generative…

Image and Video Processing · Electrical Eng. & Systems 2023-10-06 Menghan Yu , Sourabh Kulhare , Courosh Mehanian , Charles B Delahunt , Daniel E Shea , Zohreh Laverriere , Ishan Shah , Matthew P Horning

Generative models such as Large Language Models, Diffusion Models, and generative adversarial networks have recently revolutionized the creation of synthetic data, offering scalable solutions to data scarcity, privacy, and annotation…

Machine Learning · Computer Science 2025-08-28 Dawei Li , Yue Huang , Ming Li , Tianyi Zhou , Xiangliang Zhang , Huan Liu

The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated…

Machine Learning · Computer Science 2022-04-14 Giorgio Visani , Giacomo Graffi , Mattia Alfero , Enrico Bagli , Davide Capuzzo , Federico Chesani

Synthetic data has gained attention for training large language models, but poor-quality data can harm performance (see, e.g., Shumailov et al. (2023); Seddik et al. (2024)). A potential solution is data pruning, which retains only…

Machine Learning · Computer Science 2024-10-14 Aymane El Firdoussi , Mohamed El Amine Seddik , Soufiane Hayou , Reda Alami , Ahmed Alzubaidi , Hakim Hacid

With the advent of generative modeling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modeling healthcare outcome, risk decisioning in financial…

Machine Learning · Computer Science 2021-05-11 Aman Gupta , Deepak Bhatt , Anubha Pandey

Data pruning is the problem of identifying a core subset that is most beneficial to training and discarding the remainder. While pruning strategies are well studied for discriminative models like those used in classification, little…

Machine Learning · Computer Science 2025-03-17 Rania Briq , Jiangtao Wang , Stefan Kesselheim

Training high-quality deep models necessitates vast amounts of data, resulting in overwhelming computational and memory demands. Recently, data pruning, distillation, and coreset selection have been developed to streamline data volume by…

Machine Learning · Computer Science 2024-10-18 Guibin Zhang , Haonan Dong , Yuchen Zhang , Zhixun Li , Dingshuo Chen , Kai Wang , Tianlong Chen , Yuxuan Liang , Dawei Cheng , Kun Wang

Alongside the growth of generative AI, we are witnessing a surge in the use of synthetic data across all stages of the AI development pipeline. It is now common practice for researchers and practitioners to use one large generative model…

Human-Computer Interaction · Computer Science 2025-05-14 Shivani Kapania , Stephanie Ballard , Alex Kessler , Jennifer Wortman Vaughan
‹ Prev 1 2 3 10 Next ›