Related papers: On Consistent Bayesian Inference from Synthetic Da…

Generation and analysis of synthetic data via Bayesian networks: a robust approach for uncertainty quantification via Bayesian paradigm

Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought…

Methodology · Statistics 2024-03-04 Larissa N. A. Martins , Flávio B. Gonçalves , Thais P. Galletti

One Step to Efficient Synthetic Data

A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true…

Statistics Theory · Mathematics 2026-02-18 Jordan Awan , Zhanrui Cai

Foundations of Bayesian Learning from Synthetic Data

There is significant growth and interest in the use of synthetic data as an enabler for machine learning in environments where the release of real data is restricted due to privacy or availability constraints. Despite a large number of…

Machine Learning · Computer Science 2020-11-25 Harrison Wilde , Jack Jewson , Sebastian Vollmer , Chris Holmes

Harnessing Synthetic Data from Generative AI for Statistical Inference

The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise…

Machine Learning · Statistics 2026-03-06 Ahmad Abdel-Azim , Ruoyu Wang , Xihong Lin

Synthetic likelihood in misspecified models

Bayesian synthetic likelihood is a widely used approach for conducting Bayesian analysis in complex models where evaluation of the likelihood is infeasible but simulation from the assumed model is tractable. We analyze the behaviour of the…

Statistics Theory · Mathematics 2026-04-17 David T. Frazier , Christopher Drovandi , David J. Nott

Privacy-preserving data sharing via probabilistic modelling

Differential privacy allows quantifying privacy loss resulting from accessing sensitive personal data. Repeated accesses to underlying data incur increasing loss. Releasing data as privacy-preserving synthetic data would avoid this…

Machine Learning · Statistics 2021-06-10 Joonas Jälkö , Eemil Lagerspetz , Jari Haukka , Sasu Tarkoma , Antti Honkela , Samuel Kaski

Best Practices and Lessons Learned on Synthetic Data

The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs. Synthetic data has emerged as a promising solution…

Computation and Language · Computer Science 2024-08-13 Ruibo Liu , Jerry Wei , Fangyu Liu , Chenglei Si , Yanzhe Zhang , Jinmeng Rao , Steven Zheng , Daiyi Peng , Diyi Yang , Denny Zhou , Andrew M. Dai

Generating Higher-Fidelity Synthetic Datasets with Privacy Guarantees

This paper considers the problem of enhancing user privacy in common machine learning development tasks, such as data annotation and inspection, by substituting the real data with samples form a generative adversarial network. We propose…

Machine Learning · Statistics 2020-03-03 Aleksei Triastcyn , Boi Faltings

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to…

Methodology · Statistics 2024-05-09 Leila Nombo , Anne-Sophie Charest

Robust Bayesian Regression with Synthetic Posterior

Although linear regression models are fundamental tools in statistical science, the estimation results can be sensitive to outliers. While several robust methods have been proposed in frequentist frameworks, statistical inference is not…

Methodology · Statistics 2020-07-15 Shintaro Hashimoto , Shonosuke Sugasawa

Bayesian inference using synthetic likelihood: asymptotics and adjustments

Implementing Bayesian inference is often computationally challenging in applications involving complex models, and sometimes calculating the likelihood itself is difficult. Synthetic likelihood is one approach for carrying out inference…

Computation · Statistics 2021-03-15 David T. Frazier , David J. Nott , Christopher Drovandi , Robert Kohn

On the Stability of General Bayesian Inference

We study the stability of posterior predictive inferences to the specification of the likelihood model and perturbations of the data generating process. In modern big data analyses, useful broad structural judgements may be elicited from…

Methodology · Statistics 2024-04-30 Jack Jewson , Jim Q. Smith , Chris Holmes

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Statistical agencies utilize models to synthesize respondent-level data for release to the general public as an alternative to the actual data records. A Bayesian model synthesizer encodes privacy protection by employing a hierarchical…

Statistics Theory · Mathematics 2020-05-19 Jingchen Hu , Terrance D. Savitsky

A Framework for Auditable Synthetic Data Generation

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets

Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical…

Machine Learning · Computer Science 2025-04-28 Ossi Räisä , Antti Honkela

Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a…

Data Structures and Algorithms · Computer Science 2024-05-28 Mark Bun , Marco Gaboardi , Marcel Neunhoeffer , Wanrong Zhang

Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods

Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Unfortunately, many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available…

Machine Learning · Computer Science 2020-12-09 James Jordon , Alan Wilson , Mihaela van der Schaar

Differentially Private Statistical Inference through $\beta$-Divergence One Posterior Sampling

Differential privacy guarantees allow the results of a statistical analysis involving sensitive data to be released without compromising the privacy of any individual taking part. Achieving such guarantees generally requires the injection…

Machine Learning · Statistics 2023-10-31 Jack Jewson , Sahra Ghalebikesabi , Chris Holmes

Measuring the quality of Synthetic data for use in competitions

Machine learning has the potential to assist many communities in using the large datasets that are becoming more and more available. Unfortunately, much of that potential is not being realized because it would require sharing data in a way…

Machine Learning · Computer Science 2018-07-02 James Jordon , Jinsung Yoon , Mihaela van der Schaar

Combining support for hypotheses over heterogeneous studies with Bayesian Evidence Synthesis: A simulation study

Scientific claims gain credibility by replicability, especially if replication under different circumstances and varying designs yields equivalent results. Aggregating results over multiple studies is, however, not straightforward, and when…

Methodology · Statistics 2023-12-27 Thom Benjamin Volker , Irene Klugkist