English
Related papers

Related papers: Balanced Mixed-Type Tabular Data Synthesis with Di…

200 papers

Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for…

Computation and Language · Computer Science 2025-06-18 Jia-Chen Zhang , Zheng Zhou , Yu-Jie Xiong , Chun-Ming Xia , Fei Dai

Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its…

Machine Learning · Computer Science 2025-02-18 Juntong Shi , Minkai Xu , Harper Hua , Hengrui Zhang , Stefano Ermon , Jure Leskovec

Diffusion model has become a main paradigm for synthetic data generation in many subfields of modern machine learning, including computer vision, language model, or speech synthesis. In this paper, we leverage the power of diffusion model…

Machine Learning · Statistics 2023-11-20 Namjoon Suh , Xiaofeng Lin , Din-Yin Hsieh , Merhdad Honarkhah , Guang Cheng

Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data…

Machine Learning · Computer Science 2024-10-30 Vitaliy Kinakh , Slava Voloshynovskiy

Due to their data-driven nature, Machine Learning (ML) models are susceptible to bias inherited from data, especially in classification problems where class and group imbalances are prevalent. Class imbalance (in the classification target)…

Machine Learning · Computer Science 2024-09-10 Emmanouil Panagiotou , Arjun Roy , Eirini Ntoutsi

AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing…

Machine Learning · Computer Science 2024-10-22 Christina Hastings Blow , Lijun Qian , Camille Gibson , Pamela Obiomon , Xishuang Dong

Data imputation and data generation have important applications for many domains, like healthcare and finance, where incomplete or missing data can hinder accurate analysis and decision-making. Diffusion models have emerged as powerful…

Machine Learning · Computer Science 2025-06-10 Mario Villaizán-Vallelado , Matteo Salvatori , Carlos Segura , Ioannis Arapakis

Generative AI models have recently achieved astonishing results in quality and are consequently employed in a fast-growing number of applications. However, since they are highly data-driven, relying on billion-sized datasets randomly…

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain…

Machine Learning · Computer Science 2026-05-25 Zhong Li , Qi Huang , Lincen Yang , Jiayang Shi , Zhao Yang , Niki van Stein , Thomas Bäck , Matthijs van Leeuwen

Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular…

Diffusion-based tabular data synthesis models have yielded promising results. However, when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is…

Machine Learning · Computer Science 2025-11-12 Zuqing Li , Junhao Gan , Jianzhong Qi

Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose…

Machine Learning · Computer Science 2026-03-27 Markus Mueller , Kathrin Gruber , Dennis Fok

Diffusion models have shown their effectiveness in generation tasks by well-approximating the underlying probability distribution. However, diffusion models are known to suffer from an amplified inherent bias from the training data in terms…

Machine Learning · Computer Science 2024-10-04 Yujin Choi , Jinseong Park , Hoki Kim , Jaewook Lee , Saerom Park

Tabular data is one of the most prevalent and important data formats in real-world applications such as healthcare, finance, and education. However, its effective use in machine learning is often constrained by data scarcity, privacy…

Machine Learning · Computer Science 2025-07-18 Ruxue Shi , Yili Wang , Mengnan Du , Xu Shen , Yi Chang , Xin Wang

Recent progress in generative AI, especially diffusion models, has demonstrated significant utility in text-to-image synthesis. Particularly in healthcare, these models offer immense potential in generating synthetic datasets and training…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Yan Luo , Muhammad Osama Khan , Congcong Wen , Muhammad Muneeb Afzal , Titus Fidelis Wuermeling , Min Shi , Yu Tian , Yi Fang , Mengyu Wang

Realistic synthetic tabular data generation encounters significant challenges in preserving privacy, especially when dealing with sensitive information in domains like finance and healthcare. In this paper, we introduce \textit{Federated…

Machine Learning · Computer Science 2024-01-15 Timur Sattarov , Marco Schreyer , Damian Borth

The sharing of microdata, such as fund holdings and derivative instruments, by regulatory institutions presents a unique challenge due to strict data confidentiality and privacy regulations. These challenges often hinder the ability of both…

Machine Learning · Computer Science 2023-09-06 Timur Sattarov , Marco Schreyer , Damian Borth

The increasing demand for privacy-preserving data analytics in various domains necessitates solutions for synthetic data generation that rigorously uphold privacy standards. We introduce the DP-FedTabDiff framework, a novel integration of…

Machine Learning · Computer Science 2025-09-01 Timur Sattarov , Marco Schreyer , Damian Borth

With the advent of generative modeling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modeling healthcare outcome, risk decisioning in financial…

Machine Learning · Computer Science 2021-05-11 Aman Gupta , Deepak Bhatt , Anubha Pandey

Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly…

Machine Learning · Computer Science 2025-12-23 Jacob Si , Zijing Ou , Mike Qu , Zhengrui Xiang , Yingzhen Li
‹ Prev 1 2 3 10 Next ›