Related papers: Generating Multidimensional Clusters With Support …

Natural Language-Based Synthetic Data Generation for Cluster Analysis

Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the…

Machine Learning · Computer Science 2025-02-19 Michael J. Zellinger , Peter Bühlmann

Syntgen: A system to generate temporal networks with user specified topology

Network representations can help reveal the behavior of complex systems. Useful information can be derived from the network properties and invariants, such as components, clusters or cliques, as well as from their changes over time. The…

Social and Information Networks · Computer Science 2019-03-18 Luis Ramada Pereira , Rui J. Lopes , Jorge Louçã

FASTGEN: Fast and Cost-Effective Synthetic Tabular Data Generation with LLMs

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing…

Machine Learning · Computer Science 2025-07-22 Anh Nguyen , Sam Schafft , Nicholas Hale , John Alfaro

MMM and MMMSynth: Clustering of heterogeneous tabular data, and synthetic data generation

We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but…

Machine Learning · Computer Science 2024-04-22 Chandrani Kumari , Rahul Siddharthan

PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Leonardo Brusini , Cristian Sbrolli , Eugenio Lomurno , Toshihiko Yamasaki , Matteo Matteucci

Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods

Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Unfortunately, many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available…

Machine Learning · Computer Science 2020-12-09 James Jordon , Alan Wilson , Mihaela van der Schaar

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

DRtool: An Interactive Tool for Analyzing High-Dimensional Clusterings

When faced with new data, we often conduct a cluster analysis to obtain a better understanding of the data's structure and the archetypical samples present in the data. This process often includes visualization of the data, either as a way…

Applications · Statistics 2026-04-06 Justin Lin , Julia Fukuyama

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of…

Computation and Language · Computer Science 2024-06-24 Lin Long , Rui Wang , Ruixuan Xiao , Junbo Zhao , Xiao Ding , Gang Chen , Haobo Wang

A Framework for Auditable Synthetic Data Generation

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

Synthetic Data for Model Selection

Recent breakthroughs in synthetic data generation approaches made it possible to produce highly photorealistic images which are hardly distinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to…

Computer Vision and Pattern Recognition · Computer Science 2023-07-06 Alon Shoshan , Nadav Bhonker , Igor Kviatkovsky , Matan Fintz , Gerard Medioni

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to…

Machine Learning · Computer Science 2022-12-13 Angeela Acharya , Siddhartha Sikdar , Sanmay Das , Huzefa Rangwala

Second-order Control of Complex Systems with Correlated Synthetic Data

The generation of synthetic data is an essential tool to study complex systems, allowing for example to test models of these in precisely controlled settings, or to parametrize simulation models when data is missing. This paper focuses on…

Applications · Statistics 2019-11-25 Juste Raimbault

MC-GEN:Multi-level Clustering for Private Synthetic Data Generation

With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy…

Machine Learning · Computer Science 2022-11-30 Mingchen Li , Di Zhuang , J. Morris Chang

Boosting Synthetic Data Generation with Effective Nonlinear Causal Discovery

Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, and artificial intelligence explanation. In all such contexts, it is crucial to generate plausible data samples. A common assumption…

Artificial Intelligence · Computer Science 2024-10-16 Martina Cinquini , Fosca Giannotti , Riccardo Guidotti

Generation of Synthetic Multi-Resolution Time Series Load Data

The availability of large datasets is crucial for the development of new power system applications and tools; unfortunately, very few are publicly and freely available. We designed an end-to-end generative framework for the creation of…

Systems and Control · Electrical Eng. & Systems 2022-07-26 Andrea Pinceti , Lalitha Sankar , Oliver Kosut

Generation and Simulation of Synthetic Datasets with Copulas

This paper proposes a new method to generate synthetic data sets based on copula models. Our goal is to produce surrogate data resembling real data in terms of marginal and joint distributions. We present a complete and reliable algorithm…

Machine Learning · Computer Science 2022-04-01 Regis Houssou , Mihai-Cezar Augustin , Efstratios Rappos , Vivien Bonvin , Stephan Robert-Nicoud

Cluster Aware Mobility Encounter Dataset Enlargement

The recent emerging fields in data processing and manipulation has facilitated the need for synthetic data generation. This is also valid for mobility encounter dataset generation. Synthetic data generation might be useful to run…

Networking and Internet Architecture · Computer Science 2020-02-21 Rajarshi Haldar , Salih Safa Bacanli , Moayad Aloqaily , Adel Ben Mnaouer , Damla Turgut

Synthetic data for unsupervised polyp segmentation

Deep learning has shown excellent performance in analysing medical images. However, datasets are difficult to obtain due privacy issues, standardization problems, and lack of annotations. We address these problems by producing realistic…

Image and Video Processing · Electrical Eng. & Systems 2022-02-18 Enric Moreu , Kevin McGuinness , Noel E. O'Connor

Kubric: A scalable dataset generator

Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real…

Computer Vision and Pattern Recognition · Computer Science 2022-03-08 Klaus Greff , Francois Belletti , Lucas Beyer , Carl Doersch , Yilun Du , Daniel Duckworth , David J. Fleet , Dan Gnanapragasam , Florian Golemo , Charles Herrmann , Thomas Kipf , Abhijit Kundu , Dmitry Lagun , Issam Laradji , Hsueh-Ti , Liu , Henning Meyer , Yishu Miao , Derek Nowrouzezahrai , Cengiz Oztireli , Etienne Pot , Noha Radwan , Daniel Rebain , Sara Sabour , Mehdi S. M. Sajjadi , Matan Sela , Vincent Sitzmann , Austin Stone , Deqing Sun , Suhani Vora , Ziyu Wang , Tianhao Wu , Kwang Moo Yi , Fangcheng Zhong , Andrea Tagliasacchi