Related papers: Spatial Data Generators

Evaluation of Categorical Generative Models -- Bridging the Gap Between Real and Synthetic Data

The machine learning community has mainly relied on real data to benchmark algorithms as it provides compelling evidence of model applicability. Evaluation on synthetic datasets can be a powerful tool to provide a better understanding of a…

Machine Learning · Computer Science 2022-11-01 Florence Regol , Anja Kroon , Mark Coates

Should I use Synthetic Data for That? An Analysis of the Suitability of Synthetic Data for Data Sharing and Augmentation

Recent advances in generative modelling have led many to see synthetic data as the go-to solution for a range of problems around data access, scarcity, and under-representation. In this paper, we study three prominent use cases: (1) Sharing…

Machine Learning · Computer Science 2026-02-04 Bogdan Kulynych , Theresa Stadler , Jean Louis Raisaro , Carmela Troncoso

Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods

Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Unfortunately, many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available…

Machine Learning · Computer Science 2020-12-09 James Jordon , Alan Wilson , Mihaela van der Schaar

Partition-based differentially private synthetic data generation

Private synthetic data sharing is preferred as it keeps the distribution and nuances of original data compared to summary statistics. The state-of-the-art methods adopt a select-measure-generate paradigm, but measuring large domain…

Cryptography and Security · Computer Science 2023-10-11 Meifan Zhang , Dihang Deng , Lihua Yin

A Framework for Auditable Synthetic Data Generation

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

A primer on synthetic health data

Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions derived…

Machine Learning · Computer Science 2024-07-04 Jennifer A Bartell , Sander Boisen Valentin , Anders Krogh , Henning Langberg , Martin Bøgsted

Meta-Sim: Learning to Generate Synthetic Datasets

Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose…

Computer Vision and Pattern Recognition · Computer Science 2019-04-29 Amlan Kar , Aayush Prakash , Ming-Yu Liu , Eric Cameracci , Justin Yuan , Matt Rusiniak , David Acuna , Antonio Torralba , Sanja Fidler

Benchmarking Differentially Private Synthetic Data Generation Algorithms

This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the…

Cryptography and Security · Computer Science 2022-02-16 Yuchao Tao , Ryan McKenna , Michael Hay , Ashwin Machanavajjhala , Gerome Miklau

Synthetic Data Generation for Economists

As more tech companies engage in rigorous economic analyses, we are confronted with a data problem: in-house papers cannot be replicated due to use of sensitive, proprietary, or private data. Readers are left to assume that the obscured…

General Economics · Economics 2020-11-10 Allison Koenecke , Hal Varian

Synthetic Dataset Generation with Itemset-Based Generative Models

This paper proposes three different data generators, tailored to transactional datasets, based on existing itemset-based generative models. All these generators are intuitive and easy to implement and show satisfactory performance. The…

Databases · Computer Science 2020-07-15 Christian Lezcano , Marta Arias

Synthetic Data for Feature Selection

Feature selection is an important and active field of research in machine learning and data science. Our goal in this paper is to propose a collection of synthetic datasets that can be used as a common reference point for feature selection…

Machine Learning · Computer Science 2022-11-08 Firuz Kamalov , Hana Sulieman , Aswani Kumar Cherukuri

Synthetic Data Generation for Bridging Sim2Real Gap in a Production Environment

Synthetic data is being used lately for training deep neural networks in computer vision applications such as object detection, object segmentation and 6D object pose estimation. Domain randomization hereby plays an important role in…

Computer Vision and Pattern Recognition · Computer Science 2024-05-13 Parth Rawal , Mrunal Sompura , Wolfgang Hintze

Machine Learning for Synthetic Data Generation: A Review

Machine learning heavily relies on data, but real-world applications often encounter various data-related issues. These include data of poor quality, insufficient data points leading to under-fitting of machine learning models, and…

Machine Learning · Computer Science 2025-04-07 Yingzhou Lu , Lulu Chen , Yuanyuan Zhang , Minjie Shen , Huazheng Wang , Xiao Wang , Capucine van Rechem , Tianfan Fu , Wenqi Wei

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to…

Machine Learning · Computer Science 2022-12-13 Angeela Acharya , Siddhartha Sikdar , Sanmay Das , Huzefa Rangwala

Using Synthetic Data to estimate the True Error is theoretically and practically doable

Accurately evaluating model performance is crucial for deploying machine learning systems in real-world applications. Traditional methods often require a sufficiently large labeled test set to ensure a reliable evaluation. However, in many…

Machine Learning · Computer Science 2025-11-04 Hai Hoang Thanh , Duy-Tung Nguyen , Hung The Tran , Khoat Than

Generating Synthetic Multispectral Satellite Imagery from Sentinel-2

Multi-spectral satellite imagery provides valuable data at global scale for many environmental and socio-economic applications. Building supervised machine learning models based on these imagery, however, may require ground reference labels…

Computer Vision and Pattern Recognition · Computer Science 2020-12-08 Tharun Mohandoss , Aditya Kulkarni , Daniel Northrup , Ernest Mwebaze , Hamed Alemohammad

Evolving Spatially Aggregated Features from Satellite Imagery for Regional Modeling

Satellite imagery and remote sensing provide explanatory variables at relatively high resolutions for modeling geospatial phenomena, yet regional summaries are often desirable for analysis and actionable insight. In this paper, we propose a…

Machine Learning · Statistics 2017-12-15 Sam Kriegman , Marcin Szubert , Josh C. Bongard , Christian Skalka

Synthetic Test Data Generation Using Recurrent Neural Networks: A Position Paper

Testing in production-like test environments is an essential part of quality assurance processes in many industries. Provisioning of such test environments, for information-intensive services, involves setting up databases that are…

Software Engineering · Computer Science 2024-07-09 Razieh Behjati , Erik Arisholm , Chao Tan , Margrethe M. Bedregal

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already…

Databases · Computer Science 2024-01-01 Fabian Panse , Wolfram Wingerath , Benjamin Wollmer