Related papers: Generative Modeling of Complex Data
Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data;…
Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, and artificial intelligence explanation. In all such contexts, it is crucial to generate plausible data samples. A common assumption…
Many security and network applications require having large datasets to train the machine learning models. Limited data access is a well-known problem in the security domain. Recent studies have shown the potential of Transformer models to…
This article provides a comprehensive synthesis of the recent developments in synthetic data generation via deep generative models, focusing on tabular datasets. We specifically outline the importance of synthetic data generation in the…
Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging…
Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is…
Synthetic tabular data generation has received increasing attention in recent years, particularly with the emergence of foundation models for tabular data. The breakthrough success of TabPFN (Hollmann et al.,2025), which leverages vast…
Two-dimensional array-based datasets are pervasive in a variety of domains. Current approaches for generative modeling have typically been limited to conventional image datasets and performed in the pixel domain which do not explicitly…
Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…
We propose a new framework for generating cross-sectional synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The…
Deep Generative Models (DGMs) have been shown to be powerful tools for generating tabular data, as they have been increasingly able to capture the complex distributions that characterize them. However, to generate realistic synthetic data,…
Generative modeling has recently seen many exciting developments with the advent of deep generative architectures such as Variational Auto-Encoders (VAE) or Generative Adversarial Networks (GAN). The ability to draw synthetic i.i.d.…
Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade,…
The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully…
Large monolithic generative models trained on massive amounts of data have become an increasingly dominant approach in AI research. In this paper, we argue that we should instead construct large generative systems by composing smaller…
The ability to generate synthetic data has a variety of use cases across different domains. In education research, there is a growing need to have access to synthetic data to test certain concepts and ideas. In recent years, several deep…
Generative modelling has become the standard approach for synthesising tabular data. However, different use cases demand synthetic data to comply with different requirements to be useful in practice. In this survey, we review deep…
Synthetic data generation has emerged as a crucial topic for financial institutions, driven by multiple factors, such as privacy protection and data augmentation. Many algorithms have been proposed for synthetic data generation but reaching…
Synthetic data generation becomes prevalent as a solution to privacy leakage and data shortage. Generative models are designed to generate a realistic synthetic dataset, which can precisely express the data distribution for the real…
Ensuring the generalisability of clinical machine learning (ML) models across diverse healthcare settings remains a significant challenge due to variability in patient demographics, disease prevalence, and institutional practices. Existing…